Background
SRA (Sequence Read Archive) is a file format used by NCBI, EBI, etc., for storing genomic read data. It works with multiple file types (BAM, HDF5, FASTQ). In our case, we’re going to be focusing on FASTQ. The first step of many pipelines is converting SRA into FASTQ, which will be our focus in this post.
If you’re working as an individual or a scientist, you probably want to go ahead and use SRA Toolkit to download your files. For our purposes here, though, we’re going to assume you already have your files downloaded (and are probably using an implement outside of SRA Toolkit for file i/o)
SRA Toolkit is pretty frustrating to use. It is not designed for programmatic use or as part of larger systems (and the developers seem hostile to the idea that someone would even try and do this 😲). It wants you to do an interactive configuration on every install https://github.com/ncbi/sra-tools/issues/77 We get around this with a dumb hack to make it think we’ve gone through this process and configured it. That’s what’s happening in lines 25-26. (it’s kind of messy to create this config file like this, it would probably be better to make this as a file and copy it in, but for our purposes, I want to contain everything in a single file with no external dependencies)
Dockerfile
We will be using Docker and Python for this, so our first step is to create a Dockerfile with the tools we need installed. Here’s our Dockerfile–I’ve added comments to explain what I’m doing, but if you don’t know anything about Docker, this isn’t a day one tutorial, so check out one of those first.
1# Were using ubuntu 20.04 as our base image
2FROM ubuntu:20.04
3# Set DEBIAN_FRONTEND to non-interactive to avoid interactive configuration
4ENV DEBIAN_FRONTEND=noninteractive
5# Set SRA toolkit version
6ENV SRATOOLKIT_VERSION="3.0.0"
7ENV USER=alex
8ENV DATA=/data
9# change our working directory
10WORKDIR /opt
11
12 # udpate package list and install wget, python
13RUN apt-get update && apt-get install -y \
14 wget \
15 python3 && ln -sf python3 /usr/bin/python && \
16 # download and decompress our version of SRA Toolkit
17 wget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/${SRATOOLKIT_VERSION}/sratoolkit.${SRATOOLKIT_VERSION}-ubuntu64.tar.gz && \
18 tar xvf /opt/sratoolkit.${SRATOOLKIT_VERSION}-ubuntu64.tar.gz
19# add SRA toolkit binaries to our path
20ENV PATH=/opt/sratoolkit.3.0.0-ubuntu64/bin:${PATH}
21
22 # create our usser
23RUN useradd -ms /bin/bash ${USER} && \
24 # This creates a file tricking SRA Toolkit into thinking we've gone through the manual configuration
25 mkdir /home/${USER}/.ncbi/ && \
26 echo '/LIBS/GUID = "mock-uid"\nconfig/default = "true"' > /home/${USER}/.ncbi/user-settings.mkfg && \
27 # create a data directory to work out of and give ownership to our user
28 mkdir ${DATA} && chown ${USER} ${DATA}
29
30# copy in our python script
31COPY dump_fastq.py /usr/bin/
32
33# set our user
34USER ${USER}
35WORKDIR ${DATA}
36
37ENTRYPOINT ["python", "/usr/bin/dump_fastq.py"]
Python
Our python script for this is pretty simple. We’re assuming that our SRA file has been downloaded. We’re going to be running SRA Toolkit using the python subprocess module.
First, we need to check that our SRA is valid. We use the vdb-validate
tool to do this. If we get a good
return code, we will test if it’s paired-ended data. I’m not going to go into a ton of detail
about paired vs. single-ended data but suffice it to say that it’s more effective to use paired-end data. You can read more here https://www.illumina.com/science/technology/next-generation-sequencing/plan-experiments/paired-end-vs-single-read.html
In most cases, data from modern experiments is paired, but it’s essential to know. In this case, it may seem like it’s not helpful, but in a production env we would be passing these files on to another step in a pipeline, and we need to be able to tell if we will have a single read file or multiple to configure the next step properly.
To determine this, we stand on the shoulders of those who’ve come before us, and use a version of a function described here https://www.biostars.org/p/139422/.
Once we determine the data type, we’ll pass our SRA to fastq-dump
to split it. We’re going to
tell it to give us gzipped output files. If the result is successful, we’re simply going to return the paths of the files
(which, in this case, we’re just going to log to stdout rather than upload)
That’s basically it. We’ve also added some simple error handling. I’ve commented the file below to explain better what we’re doing.
1import sys
2import logging
3import argparse
4from subprocess import run
5
6logging.basicConfig(level=logging.INFO)
7log = logging.getLogger()
8
9
10def error(message='Unexpected Error', error=None):
11 log.error(message)
12 log.error(error) if error else None
13 sys.exit(1)
14
15
16def is_paired_sra(sra):
17 try:
18 result = run(['fastq-dump',
19 '-X',
20 '1',
21 '-Z',
22 '--split-spot',
23 sra], capture_output=True, text=True)
24 # check for good return
25 if result.returncode == 0:
26 # get number of lines in stdout
27 num_lines = len(result.stdout.splitlines())
28 # 4 lines indicates single end fastq
29 if num_lines == 4:
30 return False
31 # 8 lines indicates paired end
32 elif num_lines == 8:
33 return True
34 else:
35 # There are cases where an index may be included, and 12 lines would be output
36 # for our purposes here, we are going to treat this as an error
37 error('Unable to determine if SRA is paired ended')
38 else:
39 error('Unable to determine if SRA is paired ended')
40
41 except Exception as e:
42 error(error=e)
43
44
45def validate(sra):
46 try:
47 # validate sra, return result, report error
48 result = run(['vdb-validate', sra])
49 return result.returncode == 0
50 except Exception as e:
51 error(error=e)
52
53
54def split_sra(sra):
55 # check if sra is paired
56 paired = is_paired_sra(sra)
57 # dump sra into fastq
58 result = run(['fastq-dump',
59 '--split-files',
60 '--gzip',
61 '--outdir', results_dir,
62 sra])
63
64 if result.returncode == 0:
65 # in prod, we would likely upload the files here. For our purposes we are just going to report their local paths
66 if paired:
67 # if data is paired return both read 1 and read 2
68 return f'{sra.strip(".sra")}_1.fastq.gz', f'{sra.strip(".sra")}_2.fastq.gz'
69 else:
70 # otherwise, only read 1
71 return f'{results_dir}/{sra.strip(".sra")}_1.fastq.gz'
72
73
74def main(sra):
75 # check for valid sra file
76 if validate(sra):
77 # dump sra, report results
78 result = split_sra(sra)
79 log.info(result)
80 else:
81 error('SRA is is not valid')
82
83
84if __name__ == "__main__":
85 parser = argparse.ArgumentParser(description='Split SRA')
86 parser.add_argument('--sra', type=str, help='Path to SRA')
87 parser.add_argument('-data_dir', default='/data', required=False, type=str, help='Data directory')
88 args = parser.parse_args()
89 results_dir = args.data_dir
90 main(f'{results_dir}/{args.sra}')
Testing out our code
Next, we’re going to need some data. SRA files are often pretty large (sometimes hundreds of gigabytes).
Typically, S. cerevisiae RNAseq datasets are pretty small but also fully functional, so we’re going to
be using one of those https://trace.ncbi.nlm.nih.gov/Traces/?run=SRR21712309
(To find this, I went to https://ncbi.nlm.nih.gov/sra, entered S. cerevisiae in the search bar, and selected the first one :smiling:)
Now that we have data, we can run this container using Docker. We will
bind a directory on our host machine (./data
) to our /data
directory in our container. This folder is where our input data will live, and the Fastq files our script generates will be written.
To run, we’ll simply run,
docker run -v$PWD/data:/data fastq-dump --sra SRR21712309
and we’ll see output logged to stdout
2022-10-12T01:21:16 vdb-validate.3.0.0 info: Database 'SRR21712309' metadata: md5 ok
2022-10-12T01:21:16 vdb-validate.3.0.0 info: Table 'SEQUENCE' metadata: md5 ok
2022-10-12T01:21:16 vdb-validate.3.0.0 info: Column 'ALTREAD': checksums ok
2022-10-12T01:21:17 vdb-validate.3.0.0 info: Column 'QUALITY': checksums ok
2022-10-12T01:21:20 vdb-validate.3.0.0 info: Column 'READ': checksums ok
2022-10-12T01:21:21 vdb-validate.3.0.0 info: Database '/data/SRR21712309' contains only unaligned reads
2022-10-12T01:21:21 vdb-validate.3.0.0 info: Database 'SRR21712309' is consistent
Read 8631759 spots for /data/SRR21712309
Written 8631759 spots for /data/SRR21712309
INFO:root:('/data/SRR21712309_1.fastq.gz', '/data/SRR21712309_2.fastq.gz')
We’ll also see these files appear in our ./data
directory. The path to these will match
our file paths logged at the end
(venv) alexjacobs@Alexs-MacBook-Pro ~/r/b/fastq-dump (master)> ls -lh data
total 1354280
-rw-r--r--@ 1 alexjacobs staff 213M Sep 27 07:30 SRR21712309
-rw-r--r-- 1 alexjacobs staff 217M Oct 11 22:57 SRR21712309_1.fastq.gz
-rw-r--r-- 1 alexjacobs staff 221M Oct 11 22:57 SRR21712309_2.fastq.gz
And that’s it! This is a lot of explanation for a simple toy example, but hopefully is helpful to someone just getting started!