<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Bioinformatics on Alex Jacobs</title>
    <link>https://alex-jacobs.com/tags/bioinformatics/</link>
    <description>Recent content in Bioinformatics on Alex Jacobs</description>
    <generator>Hugo -- gohugo.io</generator>
    <language>en-us</language>
    <lastBuildDate>Sun, 03 Apr 2022 00:00:00 +0000</lastBuildDate><atom:link href="https://alex-jacobs.com/tags/bioinformatics/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Splitting SRA into FASTQ with SRAToolkit, Python, and Docker</title>
      <link>https://alex-jacobs.com/posts/fastqsplit/</link>
      <pubDate>Sun, 03 Apr 2022 00:00:00 +0000</pubDate>
      
      <guid>https://alex-jacobs.com/posts/fastqsplit/</guid>
      <description>A simple example using Python and Docker to split an SRA file into Fastq</description>
      <content:encoded><![CDATA[<h1 id="background">Background</h1>
<p>SRA (Sequence Read Archive) is a file format used by NCBI, EBI, etc., for storing genomic read data. It works with
multiple file types (BAM, HDF5, FASTQ). In our case, we&rsquo;re going to be focusing on FASTQ. The first step of many pipelines
is converting SRA into FASTQ, which will be our focus in this post.</p>
<p>If you&rsquo;re working as an individual or a scientist, you probably want to
go ahead and use SRA Toolkit to download your files. For our purposes here, though,
we&rsquo;re going to assume you already have your files downloaded (and are probably
using an implement outside of SRA Toolkit for file i/o)</p>
<p>SRA Toolkit is pretty frustrating to use. It is not designed for programmatic use or as part of larger systems
(and the developers seem hostile to the idea that someone would even try and do this 😲). It wants you to
do an interactive configuration on every install <a href="https://github.com/ncbi/sra-tools/issues/77">https://github.com/ncbi/sra-tools/issues/77</a>
We get around this with a dumb hack to make it think we&rsquo;ve gone through this process and configured it. That&rsquo;s what&rsquo;s happening in lines 25-26.
(it&rsquo;s kind of messy to create this config file like this, it would probably be better to make this as a file and copy it in,
but for our purposes, I want to contain everything in a single file with no external dependencies)</p>
<h2 id="dockerfile">Dockerfile</h2>
<p>We will be using Docker and Python for this, so our first step is to create a Dockerfile with the tools we need
installed. Here&rsquo;s our Dockerfile&ndash;I&rsquo;ve added comments to explain what I&rsquo;m doing, but if you don&rsquo;t know <em>anything</em> about Docker,
this isn&rsquo;t a day one tutorial, so check out one of those first.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-dockerfile" data-lang="dockerfile"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c"># Were using ubuntu 20.04 as our base image</span><span class="err">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="err"></span><span class="k">FROM</span><span class="s"> ubuntu:20.04</span><span class="err">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="err"></span><span class="c"># Set DEBIAN_FRONTEND to non-interactive to avoid interactive configuration</span><span class="err">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="err"></span><span class="k">ENV</span> <span class="nv">DEBIAN_FRONTEND</span><span class="o">=</span>noninteractive
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c"># Set SRA toolkit version</span><span class="err">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="err"></span><span class="k">ENV</span> <span class="nv">SRATOOLKIT_VERSION</span><span class="o">=</span><span class="s2">&#34;3.0.0&#34;</span><span class="err">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="err"></span><span class="k">ENV</span> <span class="nv">USER</span><span class="o">=</span>alex
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="k">ENV</span> <span class="nv">DATA</span><span class="o">=</span>/data<span class="err">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="err"></span><span class="c"># change our working directory</span><span class="err">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="err"></span><span class="k">WORKDIR</span><span class="s"> /opt</span><span class="err">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="err">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="err"></span>    <span class="c1"># udpate package list and install wget, python</span><span class="err">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="err"></span><span class="k">RUN</span> apt-get update <span class="o">&amp;&amp;</span> apt-get install -y <span class="se">\
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="se"></span>    wget <span class="se">\
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="se"></span>    python3 <span class="o">&amp;&amp;</span> ln -sf python3 /usr/bin/python <span class="o">&amp;&amp;</span> <span class="se">\
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="se"></span>    <span class="c1"># download and decompress our version of SRA Toolkit</span><span class="err">
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="err"></span>    wget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/<span class="si">${</span><span class="nv">SRATOOLKIT_VERSION</span><span class="si">}</span>/sratoolkit.<span class="si">${</span><span class="nv">SRATOOLKIT_VERSION</span><span class="si">}</span>-ubuntu64.tar.gz <span class="o">&amp;&amp;</span> <span class="se">\
</span></span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="se"></span>    tar xvf /opt/sratoolkit.<span class="si">${</span><span class="nv">SRATOOLKIT_VERSION</span><span class="si">}</span>-ubuntu64.tar.gz<span class="err">
</span></span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="err"></span><span class="c"># add SRA toolkit binaries to our path</span><span class="err">
</span></span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="err"></span><span class="k">ENV</span> <span class="nv">PATH</span><span class="o">=</span>/opt/sratoolkit.3.0.0-ubuntu64/bin:<span class="si">${</span><span class="nv">PATH</span><span class="si">}</span><span class="err">
</span></span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="err">
</span></span></span><span class="line"><span class="ln">22</span><span class="cl"><span class="err"></span>    <span class="c1"># create our usser</span><span class="err">
</span></span></span><span class="line"><span class="ln">23</span><span class="cl"><span class="err"></span><span class="k">RUN</span> useradd -ms /bin/bash <span class="si">${</span><span class="nv">USER</span><span class="si">}</span> <span class="o">&amp;&amp;</span> <span class="se">\
</span></span></span><span class="line"><span class="ln">24</span><span class="cl"><span class="se"></span>    <span class="c1"># This creates a file tricking SRA Toolkit into thinking we&#39;ve gone through the manual configuration</span><span class="err">
</span></span></span><span class="line"><span class="ln">25</span><span class="cl"><span class="err"></span>    mkdir /home/<span class="si">${</span><span class="nv">USER</span><span class="si">}</span>/.ncbi/ <span class="o">&amp;&amp;</span> <span class="se">\
</span></span></span><span class="line"><span class="ln">26</span><span class="cl"><span class="se"></span>    <span class="nb">echo</span> <span class="s1">&#39;/LIBS/GUID = &#34;mock-uid&#34;\nconfig/default = &#34;true&#34;&#39;</span> &gt; /home/<span class="si">${</span><span class="nv">USER</span><span class="si">}</span>/.ncbi/user-settings.mkfg <span class="o">&amp;&amp;</span> <span class="se">\
</span></span></span><span class="line"><span class="ln">27</span><span class="cl"><span class="se"></span>    <span class="c1"># create a data directory to work out of and give ownership to our user</span><span class="err">
</span></span></span><span class="line"><span class="ln">28</span><span class="cl"><span class="err"></span>    mkdir <span class="si">${</span><span class="nv">DATA</span><span class="si">}</span> <span class="o">&amp;&amp;</span> chown <span class="si">${</span><span class="nv">USER</span><span class="si">}</span> <span class="si">${</span><span class="nv">DATA</span><span class="si">}</span><span class="err">
</span></span></span><span class="line"><span class="ln">29</span><span class="cl"><span class="err">
</span></span></span><span class="line"><span class="ln">30</span><span class="cl"><span class="err"></span><span class="c"># copy in our python script</span><span class="err">
</span></span></span><span class="line"><span class="ln">31</span><span class="cl"><span class="err"></span><span class="k">COPY</span> dump_fastq.py /usr/bin/<span class="err">
</span></span></span><span class="line"><span class="ln">32</span><span class="cl"><span class="err">
</span></span></span><span class="line"><span class="ln">33</span><span class="cl"><span class="err"></span><span class="c"># set our user</span><span class="err">
</span></span></span><span class="line"><span class="ln">34</span><span class="cl"><span class="err"></span><span class="k">USER</span><span class="s"> ${USER}</span><span class="err">
</span></span></span><span class="line"><span class="ln">35</span><span class="cl"><span class="err"></span><span class="k">WORKDIR</span><span class="s"> ${DATA}</span><span class="err">
</span></span></span><span class="line"><span class="ln">36</span><span class="cl"><span class="err">
</span></span></span><span class="line"><span class="ln">37</span><span class="cl"><span class="err"></span><span class="k">ENTRYPOINT</span> <span class="p">[</span><span class="s2">&#34;python&#34;</span><span class="p">,</span> <span class="s2">&#34;/usr/bin/dump_fastq.py&#34;</span><span class="p">]</span><span class="err">
</span></span></span></code></pre></div><h2 id="python">Python</h2>
<p>Our python script for this is pretty simple. We&rsquo;re assuming that our SRA file has been
downloaded. We&rsquo;re going to be running SRA Toolkit using the python subprocess module.</p>
<p>First, we need to check that our SRA is valid. We use the <code>vdb-validate</code> tool to do this. If we get a good
return code, we will test if it&rsquo;s paired-ended data. I&rsquo;m not going to go into a ton of detail
about paired vs. single-ended data but suffice it to say that it&rsquo;s more effective to use paired-end data. You can read more here <a href="https://www.illumina.com/science/technology/next-generation-sequencing/plan-experiments/paired-end-vs-single-read.html">https://www.illumina.com/science/technology/next-generation-sequencing/plan-experiments/paired-end-vs-single-read.html</a></p>
<p>In most cases, data from modern experiments is paired, but it&rsquo;s essential to know. In this case, it may seem like it&rsquo;s not helpful,
but in a production env we would be passing these files on to another step in a pipeline, and we need to be
able to tell if we will have a single read file or multiple to configure the next step properly.</p>
<p>To determine this, we stand on the shoulders of those who&rsquo;ve come before us, and use a version of a function described here
<a href="https://www.biostars.org/p/139422/">https://www.biostars.org/p/139422/</a>.</p>
<p>Once we determine the data type, we&rsquo;ll pass our SRA to <code>fastq-dump</code> to split it. We&rsquo;re going to
tell it to give us gzipped output files. If the result is successful, we&rsquo;re simply going to return the paths of the files
(which, in this case, we&rsquo;re just going to log to stdout rather than upload)</p>
<p>That&rsquo;s basically it. We&rsquo;ve also added some simple error handling. I&rsquo;ve commented the file below to explain better what
we&rsquo;re doing.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="kn">import</span> <span class="nn">sys</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="kn">import</span> <span class="nn">logging</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="kn">import</span> <span class="nn">argparse</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="kn">from</span> <span class="nn">subprocess</span> <span class="kn">import</span> <span class="n">run</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="n">logging</span><span class="o">.</span><span class="n">basicConfig</span><span class="p">(</span><span class="n">level</span><span class="o">=</span><span class="n">logging</span><span class="o">.</span><span class="n">INFO</span><span class="p">)</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="n">log</span> <span class="o">=</span> <span class="n">logging</span><span class="o">.</span><span class="n">getLogger</span><span class="p">()</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="k">def</span> <span class="nf">error</span><span class="p">(</span><span class="n">message</span><span class="o">=</span><span class="s1">&#39;Unexpected Error&#39;</span><span class="p">,</span> <span class="n">error</span><span class="o">=</span><span class="kc">None</span><span class="p">):</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">    <span class="n">log</span><span class="o">.</span><span class="n">error</span><span class="p">(</span><span class="n">message</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl">    <span class="n">log</span><span class="o">.</span><span class="n">error</span><span class="p">(</span><span class="n">error</span><span class="p">)</span> <span class="k">if</span> <span class="n">error</span> <span class="k">else</span> <span class="kc">None</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl">    <span class="n">sys</span><span class="o">.</span><span class="n">exit</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl">
</span></span><span class="line"><span class="ln">15</span><span class="cl">
</span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="k">def</span> <span class="nf">is_paired_sra</span><span class="p">(</span><span class="n">sra</span><span class="p">):</span>
</span></span><span class="line"><span class="ln">17</span><span class="cl">    <span class="k">try</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">18</span><span class="cl">        <span class="n">result</span> <span class="o">=</span> <span class="n">run</span><span class="p">([</span><span class="s1">&#39;fastq-dump&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">19</span><span class="cl">                      <span class="s1">&#39;-X&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">20</span><span class="cl">                      <span class="s1">&#39;1&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">21</span><span class="cl">                      <span class="s1">&#39;-Z&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">22</span><span class="cl">                      <span class="s1">&#39;--split-spot&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">23</span><span class="cl">                      <span class="n">sra</span><span class="p">],</span> <span class="n">capture_output</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">text</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">24</span><span class="cl">        <span class="c1"># check for good return</span>
</span></span><span class="line"><span class="ln">25</span><span class="cl">        <span class="k">if</span> <span class="n">result</span><span class="o">.</span><span class="n">returncode</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">26</span><span class="cl">            <span class="c1"># get number of lines in stdout</span>
</span></span><span class="line"><span class="ln">27</span><span class="cl">            <span class="n">num_lines</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">result</span><span class="o">.</span><span class="n">stdout</span><span class="o">.</span><span class="n">splitlines</span><span class="p">())</span>
</span></span><span class="line"><span class="ln">28</span><span class="cl">            <span class="c1"># 4 lines indicates single end fastq</span>
</span></span><span class="line"><span class="ln">29</span><span class="cl">            <span class="k">if</span> <span class="n">num_lines</span> <span class="o">==</span> <span class="mi">4</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">30</span><span class="cl">                <span class="k">return</span> <span class="kc">False</span>
</span></span><span class="line"><span class="ln">31</span><span class="cl">            <span class="c1"># 8 lines indicates paired end</span>
</span></span><span class="line"><span class="ln">32</span><span class="cl">            <span class="k">elif</span> <span class="n">num_lines</span> <span class="o">==</span> <span class="mi">8</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">33</span><span class="cl">                <span class="k">return</span> <span class="kc">True</span>
</span></span><span class="line"><span class="ln">34</span><span class="cl">            <span class="k">else</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">35</span><span class="cl">                <span class="c1"># There are cases where an index may be included, and 12 lines would be output</span>
</span></span><span class="line"><span class="ln">36</span><span class="cl">                <span class="c1"># for our purposes here, we are going to treat this as an error</span>
</span></span><span class="line"><span class="ln">37</span><span class="cl">                <span class="n">error</span><span class="p">(</span><span class="s1">&#39;Unable to determine if SRA is paired ended&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">38</span><span class="cl">        <span class="k">else</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">39</span><span class="cl">            <span class="n">error</span><span class="p">(</span><span class="s1">&#39;Unable to determine if SRA is paired ended&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">40</span><span class="cl">
</span></span><span class="line"><span class="ln">41</span><span class="cl">    <span class="k">except</span> <span class="ne">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">42</span><span class="cl">        <span class="n">error</span><span class="p">(</span><span class="n">error</span><span class="o">=</span><span class="n">e</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">43</span><span class="cl">
</span></span><span class="line"><span class="ln">44</span><span class="cl">
</span></span><span class="line"><span class="ln">45</span><span class="cl"><span class="k">def</span> <span class="nf">validate</span><span class="p">(</span><span class="n">sra</span><span class="p">):</span>
</span></span><span class="line"><span class="ln">46</span><span class="cl">    <span class="k">try</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">47</span><span class="cl">        <span class="c1"># validate sra, return result, report error</span>
</span></span><span class="line"><span class="ln">48</span><span class="cl">        <span class="n">result</span> <span class="o">=</span> <span class="n">run</span><span class="p">([</span><span class="s1">&#39;vdb-validate&#39;</span><span class="p">,</span> <span class="n">sra</span><span class="p">])</span>
</span></span><span class="line"><span class="ln">49</span><span class="cl">        <span class="k">return</span> <span class="n">result</span><span class="o">.</span><span class="n">returncode</span> <span class="o">==</span> <span class="mi">0</span>
</span></span><span class="line"><span class="ln">50</span><span class="cl">    <span class="k">except</span> <span class="ne">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">51</span><span class="cl">        <span class="n">error</span><span class="p">(</span><span class="n">error</span><span class="o">=</span><span class="n">e</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">52</span><span class="cl">
</span></span><span class="line"><span class="ln">53</span><span class="cl">
</span></span><span class="line"><span class="ln">54</span><span class="cl"><span class="k">def</span> <span class="nf">split_sra</span><span class="p">(</span><span class="n">sra</span><span class="p">):</span>
</span></span><span class="line"><span class="ln">55</span><span class="cl">    <span class="c1"># check if sra is paired</span>
</span></span><span class="line"><span class="ln">56</span><span class="cl">    <span class="n">paired</span> <span class="o">=</span> <span class="n">is_paired_sra</span><span class="p">(</span><span class="n">sra</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">57</span><span class="cl">    <span class="c1"># dump sra into fastq</span>
</span></span><span class="line"><span class="ln">58</span><span class="cl">    <span class="n">result</span> <span class="o">=</span> <span class="n">run</span><span class="p">([</span><span class="s1">&#39;fastq-dump&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">59</span><span class="cl">                  <span class="s1">&#39;--split-files&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">60</span><span class="cl">                  <span class="s1">&#39;--gzip&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">61</span><span class="cl">                  <span class="s1">&#39;--outdir&#39;</span><span class="p">,</span> <span class="n">results_dir</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">62</span><span class="cl">                  <span class="n">sra</span><span class="p">])</span>
</span></span><span class="line"><span class="ln">63</span><span class="cl">
</span></span><span class="line"><span class="ln">64</span><span class="cl">    <span class="k">if</span> <span class="n">result</span><span class="o">.</span><span class="n">returncode</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">65</span><span class="cl">        <span class="c1"># in prod, we would likely upload the files here. For our purposes we are just going to report their local paths</span>
</span></span><span class="line"><span class="ln">66</span><span class="cl">        <span class="k">if</span> <span class="n">paired</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">67</span><span class="cl">            <span class="c1"># if data is paired return both read 1 and read 2</span>
</span></span><span class="line"><span class="ln">68</span><span class="cl">            <span class="k">return</span> <span class="sa">f</span><span class="s1">&#39;</span><span class="si">{</span><span class="n">sra</span><span class="o">.</span><span class="n">strip</span><span class="p">(</span><span class="s2">&#34;.sra&#34;</span><span class="p">)</span><span class="si">}</span><span class="s1">_1.fastq.gz&#39;</span><span class="p">,</span> <span class="sa">f</span><span class="s1">&#39;</span><span class="si">{</span><span class="n">sra</span><span class="o">.</span><span class="n">strip</span><span class="p">(</span><span class="s2">&#34;.sra&#34;</span><span class="p">)</span><span class="si">}</span><span class="s1">_2.fastq.gz&#39;</span>
</span></span><span class="line"><span class="ln">69</span><span class="cl">        <span class="k">else</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">70</span><span class="cl">            <span class="c1"># otherwise, only read 1</span>
</span></span><span class="line"><span class="ln">71</span><span class="cl">            <span class="k">return</span> <span class="sa">f</span><span class="s1">&#39;</span><span class="si">{</span><span class="n">results_dir</span><span class="si">}</span><span class="s1">/</span><span class="si">{</span><span class="n">sra</span><span class="o">.</span><span class="n">strip</span><span class="p">(</span><span class="s2">&#34;.sra&#34;</span><span class="p">)</span><span class="si">}</span><span class="s1">_1.fastq.gz&#39;</span>
</span></span><span class="line"><span class="ln">72</span><span class="cl">
</span></span><span class="line"><span class="ln">73</span><span class="cl">
</span></span><span class="line"><span class="ln">74</span><span class="cl"><span class="k">def</span> <span class="nf">main</span><span class="p">(</span><span class="n">sra</span><span class="p">):</span>
</span></span><span class="line"><span class="ln">75</span><span class="cl">    <span class="c1"># check for valid sra file</span>
</span></span><span class="line"><span class="ln">76</span><span class="cl">    <span class="k">if</span> <span class="n">validate</span><span class="p">(</span><span class="n">sra</span><span class="p">):</span>
</span></span><span class="line"><span class="ln">77</span><span class="cl">        <span class="c1"># dump sra, report results</span>
</span></span><span class="line"><span class="ln">78</span><span class="cl">        <span class="n">result</span> <span class="o">=</span> <span class="n">split_sra</span><span class="p">(</span><span class="n">sra</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">79</span><span class="cl">        <span class="n">log</span><span class="o">.</span><span class="n">info</span><span class="p">(</span><span class="n">result</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">80</span><span class="cl">    <span class="k">else</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">81</span><span class="cl">        <span class="n">error</span><span class="p">(</span><span class="s1">&#39;SRA is is not valid&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">82</span><span class="cl">
</span></span><span class="line"><span class="ln">83</span><span class="cl">
</span></span><span class="line"><span class="ln">84</span><span class="cl"><span class="k">if</span> <span class="vm">__name__</span> <span class="o">==</span> <span class="s2">&#34;__main__&#34;</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">85</span><span class="cl">    <span class="n">parser</span> <span class="o">=</span> <span class="n">argparse</span><span class="o">.</span><span class="n">ArgumentParser</span><span class="p">(</span><span class="n">description</span><span class="o">=</span><span class="s1">&#39;Split SRA&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">86</span><span class="cl">    <span class="n">parser</span><span class="o">.</span><span class="n">add_argument</span><span class="p">(</span><span class="s1">&#39;--sra&#39;</span><span class="p">,</span> <span class="nb">type</span><span class="o">=</span><span class="nb">str</span><span class="p">,</span> <span class="n">help</span><span class="o">=</span><span class="s1">&#39;Path to SRA&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">87</span><span class="cl">    <span class="n">parser</span><span class="o">.</span><span class="n">add_argument</span><span class="p">(</span><span class="s1">&#39;-data_dir&#39;</span><span class="p">,</span> <span class="n">default</span><span class="o">=</span><span class="s1">&#39;/data&#39;</span><span class="p">,</span> <span class="n">required</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="nb">type</span><span class="o">=</span><span class="nb">str</span><span class="p">,</span> <span class="n">help</span><span class="o">=</span><span class="s1">&#39;Data directory&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">88</span><span class="cl">    <span class="n">args</span> <span class="o">=</span> <span class="n">parser</span><span class="o">.</span><span class="n">parse_args</span><span class="p">()</span>
</span></span><span class="line"><span class="ln">89</span><span class="cl">    <span class="n">results_dir</span> <span class="o">=</span> <span class="n">args</span><span class="o">.</span><span class="n">data_dir</span>
</span></span><span class="line"><span class="ln">90</span><span class="cl">    <span class="n">main</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;</span><span class="si">{</span><span class="n">results_dir</span><span class="si">}</span><span class="s1">/</span><span class="si">{</span><span class="n">args</span><span class="o">.</span><span class="n">sra</span><span class="si">}</span><span class="s1">&#39;</span><span class="p">)</span>
</span></span></code></pre></div><h2 id="testing-out-our-code">Testing out our code</h2>
<p>Next, we&rsquo;re going to need some data. SRA files are often pretty large (sometimes hundreds of gigabytes).<br>
Typically, S. cerevisiae RNAseq datasets are pretty small but also fully functional, so we&rsquo;re going to
be using one of those <a href="https://trace.ncbi.nlm.nih.gov/Traces/?run=SRR21712309">https://trace.ncbi.nlm.nih.gov/Traces/?run=SRR21712309</a>
(To find this, I went to <a href="https://ncbi.nlm.nih.gov/sra">https://ncbi.nlm.nih.gov/sra</a>, entered S. cerevisiae in the search bar, and selected the first one :smiling:)</p>
<p>Now that we have data, we can run this container using Docker. We will
bind a directory on our host machine (<code>./data</code>) to our <code>/data</code> directory in our container. This folder is where our input data will live, and the Fastq files our script generates will be written.</p>
<p>To run, we&rsquo;ll simply run,</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">docker run -v<span class="nv">$PWD</span>/data:/data fastq-dump --sra SRR21712309
</span></span></code></pre></div><p>and we&rsquo;ll see output logged to stdout</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">2022-10-12T01:21:16 vdb-validate.3.0.0 info: Database <span class="s1">&#39;SRR21712309&#39;</span> metadata: md5 ok
</span></span><span class="line"><span class="cl">2022-10-12T01:21:16 vdb-validate.3.0.0 info: Table <span class="s1">&#39;SEQUENCE&#39;</span> metadata: md5 ok
</span></span><span class="line"><span class="cl">2022-10-12T01:21:16 vdb-validate.3.0.0 info: Column <span class="s1">&#39;ALTREAD&#39;</span>: checksums ok
</span></span><span class="line"><span class="cl">2022-10-12T01:21:17 vdb-validate.3.0.0 info: Column <span class="s1">&#39;QUALITY&#39;</span>: checksums ok
</span></span><span class="line"><span class="cl">2022-10-12T01:21:20 vdb-validate.3.0.0 info: Column <span class="s1">&#39;READ&#39;</span>: checksums ok
</span></span><span class="line"><span class="cl">2022-10-12T01:21:21 vdb-validate.3.0.0 info: Database <span class="s1">&#39;/data/SRR21712309&#39;</span> contains only unaligned reads
</span></span><span class="line"><span class="cl">2022-10-12T01:21:21 vdb-validate.3.0.0 info: Database <span class="s1">&#39;SRR21712309&#39;</span> is consistent
</span></span><span class="line"><span class="cl">Read <span class="m">8631759</span> spots <span class="k">for</span> /data/SRR21712309
</span></span><span class="line"><span class="cl">Written <span class="m">8631759</span> spots <span class="k">for</span> /data/SRR21712309
</span></span><span class="line"><span class="cl">INFO:root:<span class="o">(</span><span class="s1">&#39;/data/SRR21712309_1.fastq.gz&#39;</span>, <span class="s1">&#39;/data/SRR21712309_2.fastq.gz&#39;</span><span class="o">)</span>
</span></span></code></pre></div><p>We&rsquo;ll also see these files appear in our <code>./data</code> directory. The path to these will match
our file paths logged at the end</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="o">(</span>venv<span class="o">)</span> alexjacobs@Alexs-MacBook-Pro ~/r/b/fastq-dump <span class="o">(</span>master<span class="o">)</span>&gt; ls -lh data
</span></span><span class="line"><span class="cl">total <span class="m">1354280</span>
</span></span><span class="line"><span class="cl">-rw-r--r--@ <span class="m">1</span> alexjacobs  staff   213M Sep <span class="m">27</span> 07:30 SRR21712309
</span></span><span class="line"><span class="cl">-rw-r--r--  <span class="m">1</span> alexjacobs  staff   217M Oct <span class="m">11</span> 22:57 SRR21712309_1.fastq.gz
</span></span><span class="line"><span class="cl">-rw-r--r--  <span class="m">1</span> alexjacobs  staff   221M Oct <span class="m">11</span> 22:57 SRR21712309_2.fastq.gz
</span></span></code></pre></div><p>And that&rsquo;s it!  This is a lot of explanation for a simple toy example, but hopefully is helpful
to someone just getting started!</p>
]]></content:encoded>
    </item>
    
  </channel>
</rss>
