Background

SRA (Sequence Read Archive) is a file format used by NCBI, EBI, etc., for storing genomic read data. It works with multiple file types (BAM, HDF5, FASTQ). In our case, we’re going to be focusing on FASTQ. The first step of many pipelines is converting SRA into FASTQ, which will be our focus in this post.

If you’re working as an individual or a scientist, you probably want to go ahead and use SRA Toolkit to download your files. For our purposes here, though, we’re going to assume you already have your files downloaded (and are probably using an implement outside of SRA Toolkit for file i/o)

SRA Toolkit is pretty frustrating to use. It is not designed for programmatic use or as part of larger systems (and the developers seem hostile to the idea that someone would even try and do this 😲). It wants you to do an interactive configuration on every install https://github.com/ncbi/sra-tools/issues/77 We get around this with a dumb hack to make it think we’ve gone through this process and configured it. That’s what’s happening in lines 25-26. (it’s kind of messy to create this config file like this, it would probably be better to make this as a file and copy it in, but for our purposes, I want to contain everything in a single file with no external dependencies)

Dockerfile

We will be using Docker and Python for this, so our first step is to create a Dockerfile with the tools we need installed. Here’s our Dockerfile–I’ve added comments to explain what I’m doing, but if you don’t know anything about Docker, this isn’t a day one tutorial, so check out one of those first.

 1# Were using ubuntu 20.04 as our base image
 2FROM ubuntu:20.04
 3# Set DEBIAN_FRONTEND to non-interactive to avoid interactive configuration
 4ENV DEBIAN_FRONTEND=noninteractive
 5# Set SRA toolkit version
 6ENV SRATOOLKIT_VERSION="3.0.0"
 7ENV USER=alex
 8ENV DATA=/data
 9# change our working directory
10WORKDIR /opt
11
12    # udpate package list and install wget, python
13RUN apt-get update && apt-get install -y \
14    wget \
15    python3 && ln -sf python3 /usr/bin/python && \
16    # download and decompress our version of SRA Toolkit
17    wget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/${SRATOOLKIT_VERSION}/sratoolkit.${SRATOOLKIT_VERSION}-ubuntu64.tar.gz && \
18    tar xvf /opt/sratoolkit.${SRATOOLKIT_VERSION}-ubuntu64.tar.gz
19# add SRA toolkit binaries to our path
20ENV PATH=/opt/sratoolkit.3.0.0-ubuntu64/bin:${PATH}
21
22    # create our usser
23RUN useradd -ms /bin/bash ${USER} && \
24    # This creates a file tricking SRA Toolkit into thinking we've gone through the manual configuration
25    mkdir /home/${USER}/.ncbi/ && \
26    echo '/LIBS/GUID = "mock-uid"\nconfig/default = "true"' > /home/${USER}/.ncbi/user-settings.mkfg && \
27    # create a data directory to work out of and give ownership to our user
28    mkdir ${DATA} && chown ${USER} ${DATA}
29
30# copy in our python script
31COPY dump_fastq.py /usr/bin/
32
33# set our user
34USER ${USER}
35WORKDIR ${DATA}
36
37ENTRYPOINT ["python", "/usr/bin/dump_fastq.py"]

Python

Our python script for this is pretty simple. We’re assuming that our SRA file has been downloaded. We’re going to be running SRA Toolkit using the python subprocess module.

First, we need to check that our SRA is valid. We use the vdb-validate tool to do this. If we get a good return code, we will test if it’s paired-ended data. I’m not going to go into a ton of detail about paired vs. single-ended data but suffice it to say that it’s more effective to use paired-end data. You can read more here https://www.illumina.com/science/technology/next-generation-sequencing/plan-experiments/paired-end-vs-single-read.html

In most cases, data from modern experiments is paired, but it’s essential to know. In this case, it may seem like it’s not helpful, but in a production env we would be passing these files on to another step in a pipeline, and we need to be able to tell if we will have a single read file or multiple to configure the next step properly.

To determine this, we stand on the shoulders of those who’ve come before us, and use a version of a function described here https://www.biostars.org/p/139422/.

Once we determine the data type, we’ll pass our SRA to fastq-dump to split it. We’re going to tell it to give us gzipped output files. If the result is successful, we’re simply going to return the paths of the files (which, in this case, we’re just going to log to stdout rather than upload)

That’s basically it. We’ve also added some simple error handling. I’ve commented the file below to explain better what we’re doing.

 1import sys
 2import logging
 3import argparse
 4from subprocess import run
 5
 6logging.basicConfig(level=logging.INFO)
 7log = logging.getLogger()
 8
 9
10def error(message='Unexpected Error', error=None):
11    log.error(message)
12    log.error(error) if error else None
13    sys.exit(1)
14
15
16def is_paired_sra(sra):
17    try:
18        result = run(['fastq-dump',
19                      '-X',
20                      '1',
21                      '-Z',
22                      '--split-spot',
23                      sra], capture_output=True, text=True)
24        # check for good return
25        if result.returncode == 0:
26            # get number of lines in stdout
27            num_lines = len(result.stdout.splitlines())
28            # 4 lines indicates single end fastq
29            if num_lines == 4:
30                return False
31            # 8 lines indicates paired end
32            elif num_lines == 8:
33                return True
34            else:
35                # There are cases where an index may be included, and 12 lines would be output
36                # for our purposes here, we are going to treat this as an error
37                error('Unable to determine if SRA is paired ended')
38        else:
39            error('Unable to determine if SRA is paired ended')
40
41    except Exception as e:
42        error(error=e)
43
44
45def validate(sra):
46    try:
47        # validate sra, return result, report error
48        result = run(['vdb-validate', sra])
49        return result.returncode == 0
50    except Exception as e:
51        error(error=e)
52
53
54def split_sra(sra):
55    # check if sra is paired
56    paired = is_paired_sra(sra)
57    # dump sra into fastq
58    result = run(['fastq-dump',
59                  '--split-files',
60                  '--gzip',
61                  '--outdir', results_dir,
62                  sra])
63
64    if result.returncode == 0:
65        # in prod, we would likely upload the files here. For our purposes we are just going to report their local paths
66        if paired:
67            # if data is paired return both read 1 and read 2
68            return f'{sra.strip(".sra")}_1.fastq.gz', f'{sra.strip(".sra")}_2.fastq.gz'
69        else:
70            # otherwise, only read 1
71            return f'{results_dir}/{sra.strip(".sra")}_1.fastq.gz'
72
73
74def main(sra):
75    # check for valid sra file
76    if validate(sra):
77        # dump sra, report results
78        result = split_sra(sra)
79        log.info(result)
80    else:
81        error('SRA is is not valid')
82
83
84if __name__ == "__main__":
85    parser = argparse.ArgumentParser(description='Split SRA')
86    parser.add_argument('--sra', type=str, help='Path to SRA')
87    parser.add_argument('-data_dir', default='/data', required=False, type=str, help='Data directory')
88    args = parser.parse_args()
89    results_dir = args.data_dir
90    main(f'{results_dir}/{args.sra}')

Testing out our code

Next, we’re going to need some data. SRA files are often pretty large (sometimes hundreds of gigabytes).
Typically, S. cerevisiae RNAseq datasets are pretty small but also fully functional, so we’re going to be using one of those https://trace.ncbi.nlm.nih.gov/Traces/?run=SRR21712309 (To find this, I went to https://ncbi.nlm.nih.gov/sra, entered S. cerevisiae in the search bar, and selected the first one :smiling:)

Now that we have data, we can run this container using Docker. We will bind a directory on our host machine (./data) to our /data directory in our container. This folder is where our input data will live, and the Fastq files our script generates will be written.

To run, we’ll simply run,

docker run -v$PWD/data:/data fastq-dump --sra SRR21712309

and we’ll see output logged to stdout

2022-10-12T01:21:16 vdb-validate.3.0.0 info: Database 'SRR21712309' metadata: md5 ok
2022-10-12T01:21:16 vdb-validate.3.0.0 info: Table 'SEQUENCE' metadata: md5 ok
2022-10-12T01:21:16 vdb-validate.3.0.0 info: Column 'ALTREAD': checksums ok
2022-10-12T01:21:17 vdb-validate.3.0.0 info: Column 'QUALITY': checksums ok
2022-10-12T01:21:20 vdb-validate.3.0.0 info: Column 'READ': checksums ok
2022-10-12T01:21:21 vdb-validate.3.0.0 info: Database '/data/SRR21712309' contains only unaligned reads
2022-10-12T01:21:21 vdb-validate.3.0.0 info: Database 'SRR21712309' is consistent
Read 8631759 spots for /data/SRR21712309
Written 8631759 spots for /data/SRR21712309
INFO:root:('/data/SRR21712309_1.fastq.gz', '/data/SRR21712309_2.fastq.gz')

We’ll also see these files appear in our ./data directory. The path to these will match our file paths logged at the end

(venv) alexjacobs@Alexs-MacBook-Pro ~/r/b/fastq-dump (master)> ls -lh data
total 1354280
-rw-r--r--@ 1 alexjacobs  staff   213M Sep 27 07:30 SRR21712309
-rw-r--r--  1 alexjacobs  staff   217M Oct 11 22:57 SRR21712309_1.fastq.gz
-rw-r--r--  1 alexjacobs  staff   221M Oct 11 22:57 SRR21712309_2.fastq.gz

And that’s it! This is a lot of explanation for a simple toy example, but hopefully is helpful to someone just getting started!