<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Docker on Alex Jacobs</title>
    <link>https://alex-jacobs.com/tags/docker/</link>
    <description>Recent content in Docker on Alex Jacobs</description>
    <generator>Hugo -- gohugo.io</generator>
    <language>en-us</language>
    <lastBuildDate>Wed, 15 Jun 2022 00:00:00 +0000</lastBuildDate><atom:link href="https://alex-jacobs.com/tags/docker/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Running Jupyter lab behind NGINX--Part 2</title>
      <link>https://alex-jacobs.com/posts/jupyterlab2/</link>
      <pubDate>Wed, 15 Jun 2022 00:00:00 +0000</pubDate>
      
      <guid>https://alex-jacobs.com/posts/jupyterlab2/</guid>
      <description>Part 2: Configuring Jupyter Lab authentication behind NGINX, handling token-based auth bypass, and securing the reverse proxy setup on EC2.</description>
      <content:encoded><![CDATA[<h4 id="if-you-havent-read-part-1postsjupyterlab1-you-may-want-to-start-there">If you haven&rsquo;t read <a href="/posts/jupyterlab1/">part 1</a>, you may want to start there.</h4>
<p>In the <a href="/posts/jupyterlab1/">last post</a>, we left off with a working reverse proxy, but we couldn&rsquo;t access Jupyter lab due to
its auth enforcement. Because of how we&rsquo;re setting this up, we will be handling
authentication upstream of Jupyter Lab, and we don&rsquo;t want to rely on them for handling authentication. What we are going to do here is generally considered &ldquo;unsafe.&rdquo;<br>
Again, if you&rsquo;re looking to do this for your team, check out <a href="https://jupyter.org/hub">Jupyter Hub</a>&ndash;it probably makes more sense
for your use case.</p>
<p>To disable token auth, we will update our Jupyter Lab config.</p>
<p>There is an extensive config file for Jupyter Lab. In a production environment, I recommend using it (you can generate a sample file
by running <code>jupyter notebook --generate-config</code>). But, for this toy example, we will pass
our config as cmd line args. To disable token auth and to allow same-origin requests, we&rsquo;re going to update our Jupyter Lab
Dockerfile Entrypoint to include these arguments</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="s2">&#34;--ServerApp.token=&#34;</span>, <span class="s2">&#34;--ServerApp.password=&#34;</span>, <span class="s2">&#34;--ServerApp.allow_origin&#34;</span>, <span class="s2">&#34;*&#34;</span>
</span></span></code></pre></div><p>Our Dockerfile should now look like</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-dockerfile" data-lang="dockerfile"><span class="line"><span class="cl"><span class="k">FROM</span><span class="s"> ubuntu:20.04</span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="k">ENV</span> <span class="nv">DEBIAN_FRONTEND</span><span class="o">=</span>noninteractive
</span></span><span class="line"><span class="cl"><span class="k">RUN</span> apt-get update <span class="o">&amp;&amp;</span> apt-get install -y <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>      python3-pip <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>      python3-dev <span class="o">&amp;&amp;</span> <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>      python3 -m pip install jupyterlab<span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="k">RUN</span> useradd -ms /bin/bash jupyter<span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="k">EXPOSE</span><span class="s"> 8888</span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="k">ENTRYPOINT</span> <span class="p">[</span><span class="s2">&#34;jupyter&#34;</span><span class="p">,</span> <span class="s2">&#34;lab&#34;</span><span class="p">,</span> <span class="s2">&#34;--ip=0.0.0.0&#34;</span><span class="p">,</span> <span class="s2">&#34;--port&#34;</span><span class="p">,</span> <span class="s2">&#34;8888&#34;</span><span class="p">,</span> <span class="s2">&#34;--allow-root&#34;</span><span class="p">,</span> <span class="s2">&#34;--ServerApp.token=&#34;</span><span class="p">,</span> <span class="s2">&#34;--ServerApp.password=&#34;</span><span class="p">,</span> <span class="s2">&#34;--ServerApp.allow_origin&#34;</span><span class="p">,</span> <span class="s2">&#34;*&#34;</span><span class="p">]</span><span class="err">
</span></span></span></code></pre></div><p>And if we rebuild and start our docker compose again</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">docker compose build <span class="o">&amp;&amp;</span> docker compose up
</span></span></code></pre></div><p>We now get through to Juypter!
<img loading="lazy" src="/posts/jupyterlab2/img5.png" type="" alt="jupyter lab"  /></p>
<p>But if we try and open the Python kernel, we&rsquo;ll notice it&rsquo;s having trouble connecting.
<img loading="lazy" src="/posts/jupyterlab2/img3.png" type="" alt="jupyter lab_cant_connect"  /></p>
<p>Opening our browser dev tools shows that there is an issue with how our proxy is handling WebSockets
<img loading="lazy" src="/posts/jupyterlab2/img4.png" type="" alt="jupyter lab_websockets"  /></p>
<p>We&rsquo;ll have to update our Nginx config to address this.<br>
We will add these lines to set headers properly for WebSockets to our / location in the server block.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-nginx" data-lang="nginx"><span class="line"><span class="cl"><span class="k">...</span>
</span></span><span class="line"><span class="cl">  <span class="s">server</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kn">listen</span>       <span class="mi">8000</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kn">server_name</span>  <span class="s">localhost</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="kn">location</span> <span class="s">/</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="kn">proxy_set_header</span> <span class="s">Host</span> <span class="nv">$host</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="kn">proxy_set_header</span> <span class="s">X-Real-IP</span> <span class="nv">$remote_addr</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="kn">proxy_hide_header</span> <span class="s">&#34;X-Frame-Options&#34;</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="kn">proxy_pass</span> <span class="s">http://upstream_jupyter</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1"># websocket support
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>        <span class="kn">proxy_http_version</span> <span class="mi">1</span><span class="s">.1</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="kn">proxy_set_header</span> <span class="s">Upgrade</span> <span class="s">&#34;websocket&#34;</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="kn">proxy_set_header</span> <span class="s">Connection</span> <span class="s">&#34;Upgrade&#34;</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="kn">proxy_read_timeout</span> <span class="mi">86400</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">  <span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="k">...</span>
</span></span></code></pre></div><p>And now, if we restart our containers using the updated config, we&rsquo;ll see our kernel connects!</p>
<p><img loading="lazy" src="/posts/jupyterlab2/img6.png" type="" alt="jupyter lab_websockets"  /></p>
<p>If you&rsquo;re wondering how we will handle security when we&rsquo;re basically giving whoever is using this a terminal
into our cloud, the answer is using AWS to isolate the instance via IAM roles/ policy. We aren&rsquo;t going to get too much into
that in this post, but it is a valid concern. There isn&rsquo;t much we can do to prevent a privilege escalation/container escape
from a sophisticated user, but we can at least not give root access.</p>
<p>We&rsquo;re going to update our Jupyter Dockerfile to have a new user, &lsquo;jupyter&rsquo;, and we&rsquo;ll run Jupyter Lab as this user.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-dockerfile" data-lang="dockerfile"><span class="line"><span class="cl">...<span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="k">RUN</span> useradd -ms /bin/bash jupyter<span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="k">USER</span><span class="s"> jupyter</span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span>...<span class="err">
</span></span></span></code></pre></div><p>We&rsquo;re also going to update our ENTRYPOINT, so the Jupyter Lab root directory is set to the Jupyter user&rsquo;s home directory</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="s2">&#34;--ServerApp.root_dir&#34;</span>, <span class="s2">&#34;/home/jupyter&#34;</span>, <span class="s2">&#34;--ServerApp.notebook_dir&#34;</span>, <span class="s2">&#34;/home/jupyter&#34;</span>
</span></span></code></pre></div><p>Our Dockerfile should now look like</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-dockerfile" data-lang="dockerfile"><span class="line"><span class="cl"><span class="k">FROM</span><span class="s"> ubuntu:20.04</span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="k">ENV</span> <span class="nv">DEBIAN_FRONTEND</span><span class="o">=</span>noninteractive
</span></span><span class="line"><span class="cl"><span class="k">RUN</span> apt-get update <span class="o">&amp;&amp;</span> apt-get install -y <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>      python3-pip <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>      python3-dev <span class="o">&amp;&amp;</span> <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>      python3 -m pip install jupyterlab<span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="c"># add user and switch to them</span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="k">RUN</span> useradd -ms /bin/bash jupyter<span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="k">USER</span><span class="s"> jupyter</span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="k">EXPOSE</span><span class="s"> 8888</span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="k">ENTRYPOINT</span> <span class="p">[</span><span class="s2">&#34;jupyter&#34;</span><span class="p">,</span> <span class="s2">&#34;lab&#34;</span><span class="p">,</span> <span class="s2">&#34;--ip=0.0.0.0&#34;</span><span class="p">,</span> <span class="s2">&#34;--port&#34;</span><span class="p">,</span> <span class="s2">&#34;8888&#34;</span><span class="p">,</span> <span class="s2">&#34;--ServerApp.token=&#34;</span><span class="p">,</span> <span class="s2">&#34;--ServerApp.password=&#34;</span><span class="p">,</span> <span class="s2">&#34;--ServerApp.allow_origin&#34;</span><span class="p">,</span> <span class="s2">&#34;*&#34;</span><span class="p">,</span> <span class="s2">&#34;--ServerApp.root_dir&#34;</span><span class="p">,</span> <span class="s2">&#34;/home/jupyter&#34;</span><span class="p">,</span> <span class="s2">&#34;--ServerApp.notebook_dir&#34;</span><span class="p">,</span> <span class="s2">&#34;/home/jupyter&#34;</span><span class="p">]</span><span class="err">
</span></span></span></code></pre></div><p>If we reload our site, we&rsquo;ll see that the working directory is now set to <code>/home/jupyter</code>, and if we try to
write to <code>/</code>, we&rsquo;ll get a permissions error. It&rsquo;s important to note that while this makes it a little more
difficult for a malicious user to take over this &lsquo;instance&rsquo;, we will be giving them access to the internet,
the ability to download and install packages, execute code, etc. It would not be too difficult for someone with mal
intent to get around this. Changing the user and working directory does more to help an innocent user from accidentally breaking
something.</p>
<p>Great! Now we have disabled token authentication, added a system user (who is now running Jupyter), and changed our notebook
directory to our user&rsquo;s directory! In the next post, we&rsquo;ll set up a task definition and deploy to ECS.</p>
]]></content:encoded>
    </item>
    
    <item>
      <title>Running Jupyter lab behind NGINX--Part 1</title>
      <link>https://alex-jacobs.com/posts/jupyterlab1/</link>
      <pubDate>Sun, 08 May 2022 00:00:00 +0000</pubDate>
      
      <guid>https://alex-jacobs.com/posts/jupyterlab1/</guid>
      <description>This multipart blog will walk through our setup running Jupyter Lab, using Nginx as a reverse proxy using a sidecar pattern to deploy to EC2.</description>
      <content:encoded><![CDATA[<h1 id="background">Background</h1>
<p>Jupyter Lab is an open source web-based IDE for notebooks with Python and R support, geared towards the data science crowd.
It&rsquo;s a powerful, mature application with a potentially complex configuration. Our requirement was to deliver Jupyter Lab to users
so that each user would have their own isolated &ldquo;instance .&rdquo; There is an off-the-shelf solution for this called Jupyter Hub that probably makes
the most sense for your organization. This example will be a proof of concept on how you could roll your solution.</p>
<h2 id="heading"></h2>
<h2 id="step-1--jupyter-lab-docker">Step 1&ndash;Jupyter Lab Docker</h2>
<h5 id="if-you-arent-familiar-with-docker-were-going-to-be-using-it-a-lot-here-so-check-out-some-guides">If you aren&rsquo;t familiar with Docker, we&rsquo;re going to be using it a lot here, so check out some guides</h5>
<p>Our first step will be getting Jupyter Lab up and running in a container.
There are many Docker images available on (Docker hub)[https://hub.docker.com/] for Jupyter Lab, but since we&rsquo;re rolling
everything ourselves, we might as well make our own image. It also gives us more control over our code&ndash;it&rsquo;s also a pretty simple Dockerfile.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-dockerfile" data-lang="dockerfile"><span class="line"><span class="cl"><span class="k">FROM</span><span class="s"> ubuntu:20.04</span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="k">ENV</span> <span class="nv">DEBIAN_FRONTEND</span><span class="o">=</span>noninteractive
</span></span><span class="line"><span class="cl"><span class="k">RUN</span> apt-get update <span class="o">&amp;&amp;</span> apt-get install -y <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>      python3-pip <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>      python3-dev <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>    <span class="o">&amp;&amp;</span> python3 -m pip install jupyterlab<span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="k">EXPOSE</span><span class="s"> 8888</span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="k">ENTRYPOINT</span> <span class="p">[</span><span class="s2">&#34;jupyter&#34;</span><span class="p">,</span> <span class="s2">&#34;lab&#34;</span><span class="p">,</span> <span class="s2">&#34;--ip=0.0.0.0&#34;</span><span class="p">,</span> <span class="s2">&#34;--port&#34;</span><span class="p">,</span> <span class="s2">&#34;8888&#34;</span><span class="p">,</span> <span class="s2">&#34;--allow-root&#34;</span><span class="p">]</span><span class="err">
</span></span></span></code></pre></div><p>There are probably some good arguments for why you should use alpine or something else as the base image here, but I&rsquo;m
a sucker for ubuntu. Since this isn&rsquo;t a Docker tutorial, I&rsquo;m not going to go into great detail here about what each
line in this Dockerfile does, but assume that it installs Jupyter Lab and configures it to run at port 8888.
We&rsquo;ll expand on the Jupyter Lab config (and make some changes) later, but for now, this works fine.</p>
<p>We&rsquo;re going to use Docker compose to run this. Our compose file looks like</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">version</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;3&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">services</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">jupyter</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">build</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">context</span><span class="p">:</span><span class="w"> </span><span class="l">.</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">dockerfile</span><span class="p">:</span><span class="w"> </span><span class="l">Dockerfile</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">image</span><span class="p">:</span><span class="w"> </span><span class="l">jupyter</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">container_name</span><span class="p">:</span><span class="w"> </span><span class="l">jupyter</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">ports</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span>- <span class="m">8888</span><span class="p">:</span><span class="m">8888</span><span class="w">
</span></span></span></code></pre></div><p>We can run this with <code>docker compose up</code></p>
<p>This will start our Jupyter Lab container and make it available at <a href="http://127.0.0.1:8888/lab/">http://127.0.0.1:8888/lab/</a></p>
<p><img loading="lazy" src="/posts/jupyterlab1/img1.png" type="" alt="jupyter lab"  /></p>
<h3 id="awesome-were-part-of-the-way-there">Awesome! We&rsquo;re part of the way there!</h3>
<p>Next, we need to put together an Nginx docker file.</p>
<p>While we could use the official Nginx image, in keeping with the theme, we&rsquo;re going to create our own Nginx image (and it&rsquo;s also
<em>really</em> simple)</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-dockerfile" data-lang="dockerfile"><span class="line"><span class="cl"><span class="k">FROM</span><span class="s"> ubuntu:20.04</span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="k">RUN</span> apt-get update <span class="o">&amp;&amp;</span> apt-get -y install nginx<span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="k">COPY</span> nginx.conf /etc/nginx/nginx.conf<span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="k">EXPOSE</span><span class="s"> 8000</span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="k">ENTRYPOINT</span> <span class="p">[</span><span class="s2">&#34;/usr/sbin/nginx&#34;</span><span class="p">]</span><span class="err">
</span></span></span></code></pre></div><p>Pretty straightforward. Our config file is also pretty simple. We&rsquo;re going to use port 8000, and we&rsquo;re going to simply
forward all requests directly to Jupyter lab.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-nginx" data-lang="nginx"><span class="line"><span class="cl"><span class="k">daemon</span> <span class="no">off</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="k">error_log</span> <span class="s">/dev/stdout</span> <span class="s">info</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">events</span> <span class="p">{}</span>
</span></span><span class="line"><span class="cl"><span class="k">http</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">  <span class="kn">access_log</span> <span class="s">/dev/stdout</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">  <span class="kn">upstream</span> <span class="s">upstream_jupyter</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kn">server</span> <span class="n">jupyter</span><span class="p">:</span><span class="mi">8888</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kn">keepalive</span> <span class="mi">32</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">  <span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">  <span class="kn">server</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kn">listen</span>       <span class="mi">8000</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kn">server_name</span>  <span class="s">localhost</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="kn">location</span> <span class="s">/</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="kn">proxy_set_header</span> <span class="s">Host</span> <span class="nv">$host</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="kn">proxy_set_header</span> <span class="s">X-Real-IP</span> <span class="nv">$remote_addr</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="kn">proxy_hide_header</span> <span class="s">&#34;X-Frame-Options&#34;</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="kn">proxy_pass</span> <span class="s">http://upstream_jupyter</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">  <span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>The final piece that will tie these together is our docker-compose file. Our docker-compose is pretty simple as well.
By using Docker compose, the networking between the containers is handled for us, and we can point the <code>jupyter</code> as the service
name in our nginx.conf</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">version</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;3&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">services</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">jupyter</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">build</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">context</span><span class="p">:</span><span class="w"> </span><span class="l">.</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">dockerfile</span><span class="p">:</span><span class="w"> </span><span class="l">Dockerfile</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">image</span><span class="p">:</span><span class="w"> </span><span class="l">jupyter</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">container_name</span><span class="p">:</span><span class="w"> </span><span class="l">jupyter</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">ports</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span>- <span class="m">8888</span><span class="p">:</span><span class="m">8888</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">nginx</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">build</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">context</span><span class="p">:</span><span class="w"> </span><span class="l">.</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">dockerfile</span><span class="p">:</span><span class="w"> </span><span class="l">nginx.Dockerfile</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">image</span><span class="p">:</span><span class="w"> </span><span class="l">jupyter-nginx</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">container_name</span><span class="p">:</span><span class="w"> </span><span class="l">nginx</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">ports</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span>- <span class="m">8000</span><span class="p">:</span><span class="m">8000</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">volumes</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span>- <span class="l">nginx.conf:/etc/nginx/nginx.conf</span><span class="w">
</span></span></span></code></pre></div><p>Now, let&rsquo;s head to <a href="http://127.0.0.1:8000">http://127.0.0.1:8000</a>, and&hellip;</p>
<p><img loading="lazy" src="/posts/jupyterlab1/img2.png" type="" alt="nginx jupyter lab"  />
Awesome! We&rsquo;re being proxied to Jupyter Lab. But, we see a page requiring token auth. This is because Jupyter is currently
configured to enforce this. In the next post, we&rsquo;ll deal with this and some other things regarding permissions, creating a user, and making a
task definition for deploying this configuration to ECS.</p>
]]></content:encoded>
    </item>
    
    <item>
      <title>Splitting SRA into FASTQ with SRAToolkit, Python, and Docker</title>
      <link>https://alex-jacobs.com/posts/fastqsplit/</link>
      <pubDate>Sun, 03 Apr 2022 00:00:00 +0000</pubDate>
      
      <guid>https://alex-jacobs.com/posts/fastqsplit/</guid>
      <description>A simple example using Python and Docker to split an SRA file into Fastq</description>
      <content:encoded><![CDATA[<h1 id="background">Background</h1>
<p>SRA (Sequence Read Archive) is a file format used by NCBI, EBI, etc., for storing genomic read data. It works with
multiple file types (BAM, HDF5, FASTQ). In our case, we&rsquo;re going to be focusing on FASTQ. The first step of many pipelines
is converting SRA into FASTQ, which will be our focus in this post.</p>
<p>If you&rsquo;re working as an individual or a scientist, you probably want to
go ahead and use SRA Toolkit to download your files. For our purposes here, though,
we&rsquo;re going to assume you already have your files downloaded (and are probably
using an implement outside of SRA Toolkit for file i/o)</p>
<p>SRA Toolkit is pretty frustrating to use. It is not designed for programmatic use or as part of larger systems
(and the developers seem hostile to the idea that someone would even try and do this 😲). It wants you to
do an interactive configuration on every install <a href="https://github.com/ncbi/sra-tools/issues/77">https://github.com/ncbi/sra-tools/issues/77</a>
We get around this with a dumb hack to make it think we&rsquo;ve gone through this process and configured it. That&rsquo;s what&rsquo;s happening in lines 25-26.
(it&rsquo;s kind of messy to create this config file like this, it would probably be better to make this as a file and copy it in,
but for our purposes, I want to contain everything in a single file with no external dependencies)</p>
<h2 id="dockerfile">Dockerfile</h2>
<p>We will be using Docker and Python for this, so our first step is to create a Dockerfile with the tools we need
installed. Here&rsquo;s our Dockerfile&ndash;I&rsquo;ve added comments to explain what I&rsquo;m doing, but if you don&rsquo;t know <em>anything</em> about Docker,
this isn&rsquo;t a day one tutorial, so check out one of those first.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-dockerfile" data-lang="dockerfile"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c"># Were using ubuntu 20.04 as our base image</span><span class="err">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="err"></span><span class="k">FROM</span><span class="s"> ubuntu:20.04</span><span class="err">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="err"></span><span class="c"># Set DEBIAN_FRONTEND to non-interactive to avoid interactive configuration</span><span class="err">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="err"></span><span class="k">ENV</span> <span class="nv">DEBIAN_FRONTEND</span><span class="o">=</span>noninteractive
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c"># Set SRA toolkit version</span><span class="err">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="err"></span><span class="k">ENV</span> <span class="nv">SRATOOLKIT_VERSION</span><span class="o">=</span><span class="s2">&#34;3.0.0&#34;</span><span class="err">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="err"></span><span class="k">ENV</span> <span class="nv">USER</span><span class="o">=</span>alex
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="k">ENV</span> <span class="nv">DATA</span><span class="o">=</span>/data<span class="err">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="err"></span><span class="c"># change our working directory</span><span class="err">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="err"></span><span class="k">WORKDIR</span><span class="s"> /opt</span><span class="err">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="err">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="err"></span>    <span class="c1"># udpate package list and install wget, python</span><span class="err">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="err"></span><span class="k">RUN</span> apt-get update <span class="o">&amp;&amp;</span> apt-get install -y <span class="se">\
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="se"></span>    wget <span class="se">\
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="se"></span>    python3 <span class="o">&amp;&amp;</span> ln -sf python3 /usr/bin/python <span class="o">&amp;&amp;</span> <span class="se">\
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="se"></span>    <span class="c1"># download and decompress our version of SRA Toolkit</span><span class="err">
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="err"></span>    wget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/<span class="si">${</span><span class="nv">SRATOOLKIT_VERSION</span><span class="si">}</span>/sratoolkit.<span class="si">${</span><span class="nv">SRATOOLKIT_VERSION</span><span class="si">}</span>-ubuntu64.tar.gz <span class="o">&amp;&amp;</span> <span class="se">\
</span></span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="se"></span>    tar xvf /opt/sratoolkit.<span class="si">${</span><span class="nv">SRATOOLKIT_VERSION</span><span class="si">}</span>-ubuntu64.tar.gz<span class="err">
</span></span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="err"></span><span class="c"># add SRA toolkit binaries to our path</span><span class="err">
</span></span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="err"></span><span class="k">ENV</span> <span class="nv">PATH</span><span class="o">=</span>/opt/sratoolkit.3.0.0-ubuntu64/bin:<span class="si">${</span><span class="nv">PATH</span><span class="si">}</span><span class="err">
</span></span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="err">
</span></span></span><span class="line"><span class="ln">22</span><span class="cl"><span class="err"></span>    <span class="c1"># create our usser</span><span class="err">
</span></span></span><span class="line"><span class="ln">23</span><span class="cl"><span class="err"></span><span class="k">RUN</span> useradd -ms /bin/bash <span class="si">${</span><span class="nv">USER</span><span class="si">}</span> <span class="o">&amp;&amp;</span> <span class="se">\
</span></span></span><span class="line"><span class="ln">24</span><span class="cl"><span class="se"></span>    <span class="c1"># This creates a file tricking SRA Toolkit into thinking we&#39;ve gone through the manual configuration</span><span class="err">
</span></span></span><span class="line"><span class="ln">25</span><span class="cl"><span class="err"></span>    mkdir /home/<span class="si">${</span><span class="nv">USER</span><span class="si">}</span>/.ncbi/ <span class="o">&amp;&amp;</span> <span class="se">\
</span></span></span><span class="line"><span class="ln">26</span><span class="cl"><span class="se"></span>    <span class="nb">echo</span> <span class="s1">&#39;/LIBS/GUID = &#34;mock-uid&#34;\nconfig/default = &#34;true&#34;&#39;</span> &gt; /home/<span class="si">${</span><span class="nv">USER</span><span class="si">}</span>/.ncbi/user-settings.mkfg <span class="o">&amp;&amp;</span> <span class="se">\
</span></span></span><span class="line"><span class="ln">27</span><span class="cl"><span class="se"></span>    <span class="c1"># create a data directory to work out of and give ownership to our user</span><span class="err">
</span></span></span><span class="line"><span class="ln">28</span><span class="cl"><span class="err"></span>    mkdir <span class="si">${</span><span class="nv">DATA</span><span class="si">}</span> <span class="o">&amp;&amp;</span> chown <span class="si">${</span><span class="nv">USER</span><span class="si">}</span> <span class="si">${</span><span class="nv">DATA</span><span class="si">}</span><span class="err">
</span></span></span><span class="line"><span class="ln">29</span><span class="cl"><span class="err">
</span></span></span><span class="line"><span class="ln">30</span><span class="cl"><span class="err"></span><span class="c"># copy in our python script</span><span class="err">
</span></span></span><span class="line"><span class="ln">31</span><span class="cl"><span class="err"></span><span class="k">COPY</span> dump_fastq.py /usr/bin/<span class="err">
</span></span></span><span class="line"><span class="ln">32</span><span class="cl"><span class="err">
</span></span></span><span class="line"><span class="ln">33</span><span class="cl"><span class="err"></span><span class="c"># set our user</span><span class="err">
</span></span></span><span class="line"><span class="ln">34</span><span class="cl"><span class="err"></span><span class="k">USER</span><span class="s"> ${USER}</span><span class="err">
</span></span></span><span class="line"><span class="ln">35</span><span class="cl"><span class="err"></span><span class="k">WORKDIR</span><span class="s"> ${DATA}</span><span class="err">
</span></span></span><span class="line"><span class="ln">36</span><span class="cl"><span class="err">
</span></span></span><span class="line"><span class="ln">37</span><span class="cl"><span class="err"></span><span class="k">ENTRYPOINT</span> <span class="p">[</span><span class="s2">&#34;python&#34;</span><span class="p">,</span> <span class="s2">&#34;/usr/bin/dump_fastq.py&#34;</span><span class="p">]</span><span class="err">
</span></span></span></code></pre></div><h2 id="python">Python</h2>
<p>Our python script for this is pretty simple. We&rsquo;re assuming that our SRA file has been
downloaded. We&rsquo;re going to be running SRA Toolkit using the python subprocess module.</p>
<p>First, we need to check that our SRA is valid. We use the <code>vdb-validate</code> tool to do this. If we get a good
return code, we will test if it&rsquo;s paired-ended data. I&rsquo;m not going to go into a ton of detail
about paired vs. single-ended data but suffice it to say that it&rsquo;s more effective to use paired-end data. You can read more here <a href="https://www.illumina.com/science/technology/next-generation-sequencing/plan-experiments/paired-end-vs-single-read.html">https://www.illumina.com/science/technology/next-generation-sequencing/plan-experiments/paired-end-vs-single-read.html</a></p>
<p>In most cases, data from modern experiments is paired, but it&rsquo;s essential to know. In this case, it may seem like it&rsquo;s not helpful,
but in a production env we would be passing these files on to another step in a pipeline, and we need to be
able to tell if we will have a single read file or multiple to configure the next step properly.</p>
<p>To determine this, we stand on the shoulders of those who&rsquo;ve come before us, and use a version of a function described here
<a href="https://www.biostars.org/p/139422/">https://www.biostars.org/p/139422/</a>.</p>
<p>Once we determine the data type, we&rsquo;ll pass our SRA to <code>fastq-dump</code> to split it. We&rsquo;re going to
tell it to give us gzipped output files. If the result is successful, we&rsquo;re simply going to return the paths of the files
(which, in this case, we&rsquo;re just going to log to stdout rather than upload)</p>
<p>That&rsquo;s basically it. We&rsquo;ve also added some simple error handling. I&rsquo;ve commented the file below to explain better what
we&rsquo;re doing.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="kn">import</span> <span class="nn">sys</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="kn">import</span> <span class="nn">logging</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="kn">import</span> <span class="nn">argparse</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="kn">from</span> <span class="nn">subprocess</span> <span class="kn">import</span> <span class="n">run</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="n">logging</span><span class="o">.</span><span class="n">basicConfig</span><span class="p">(</span><span class="n">level</span><span class="o">=</span><span class="n">logging</span><span class="o">.</span><span class="n">INFO</span><span class="p">)</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="n">log</span> <span class="o">=</span> <span class="n">logging</span><span class="o">.</span><span class="n">getLogger</span><span class="p">()</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="k">def</span> <span class="nf">error</span><span class="p">(</span><span class="n">message</span><span class="o">=</span><span class="s1">&#39;Unexpected Error&#39;</span><span class="p">,</span> <span class="n">error</span><span class="o">=</span><span class="kc">None</span><span class="p">):</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">    <span class="n">log</span><span class="o">.</span><span class="n">error</span><span class="p">(</span><span class="n">message</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl">    <span class="n">log</span><span class="o">.</span><span class="n">error</span><span class="p">(</span><span class="n">error</span><span class="p">)</span> <span class="k">if</span> <span class="n">error</span> <span class="k">else</span> <span class="kc">None</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl">    <span class="n">sys</span><span class="o">.</span><span class="n">exit</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl">
</span></span><span class="line"><span class="ln">15</span><span class="cl">
</span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="k">def</span> <span class="nf">is_paired_sra</span><span class="p">(</span><span class="n">sra</span><span class="p">):</span>
</span></span><span class="line"><span class="ln">17</span><span class="cl">    <span class="k">try</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">18</span><span class="cl">        <span class="n">result</span> <span class="o">=</span> <span class="n">run</span><span class="p">([</span><span class="s1">&#39;fastq-dump&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">19</span><span class="cl">                      <span class="s1">&#39;-X&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">20</span><span class="cl">                      <span class="s1">&#39;1&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">21</span><span class="cl">                      <span class="s1">&#39;-Z&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">22</span><span class="cl">                      <span class="s1">&#39;--split-spot&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">23</span><span class="cl">                      <span class="n">sra</span><span class="p">],</span> <span class="n">capture_output</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">text</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">24</span><span class="cl">        <span class="c1"># check for good return</span>
</span></span><span class="line"><span class="ln">25</span><span class="cl">        <span class="k">if</span> <span class="n">result</span><span class="o">.</span><span class="n">returncode</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">26</span><span class="cl">            <span class="c1"># get number of lines in stdout</span>
</span></span><span class="line"><span class="ln">27</span><span class="cl">            <span class="n">num_lines</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">result</span><span class="o">.</span><span class="n">stdout</span><span class="o">.</span><span class="n">splitlines</span><span class="p">())</span>
</span></span><span class="line"><span class="ln">28</span><span class="cl">            <span class="c1"># 4 lines indicates single end fastq</span>
</span></span><span class="line"><span class="ln">29</span><span class="cl">            <span class="k">if</span> <span class="n">num_lines</span> <span class="o">==</span> <span class="mi">4</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">30</span><span class="cl">                <span class="k">return</span> <span class="kc">False</span>
</span></span><span class="line"><span class="ln">31</span><span class="cl">            <span class="c1"># 8 lines indicates paired end</span>
</span></span><span class="line"><span class="ln">32</span><span class="cl">            <span class="k">elif</span> <span class="n">num_lines</span> <span class="o">==</span> <span class="mi">8</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">33</span><span class="cl">                <span class="k">return</span> <span class="kc">True</span>
</span></span><span class="line"><span class="ln">34</span><span class="cl">            <span class="k">else</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">35</span><span class="cl">                <span class="c1"># There are cases where an index may be included, and 12 lines would be output</span>
</span></span><span class="line"><span class="ln">36</span><span class="cl">                <span class="c1"># for our purposes here, we are going to treat this as an error</span>
</span></span><span class="line"><span class="ln">37</span><span class="cl">                <span class="n">error</span><span class="p">(</span><span class="s1">&#39;Unable to determine if SRA is paired ended&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">38</span><span class="cl">        <span class="k">else</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">39</span><span class="cl">            <span class="n">error</span><span class="p">(</span><span class="s1">&#39;Unable to determine if SRA is paired ended&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">40</span><span class="cl">
</span></span><span class="line"><span class="ln">41</span><span class="cl">    <span class="k">except</span> <span class="ne">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">42</span><span class="cl">        <span class="n">error</span><span class="p">(</span><span class="n">error</span><span class="o">=</span><span class="n">e</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">43</span><span class="cl">
</span></span><span class="line"><span class="ln">44</span><span class="cl">
</span></span><span class="line"><span class="ln">45</span><span class="cl"><span class="k">def</span> <span class="nf">validate</span><span class="p">(</span><span class="n">sra</span><span class="p">):</span>
</span></span><span class="line"><span class="ln">46</span><span class="cl">    <span class="k">try</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">47</span><span class="cl">        <span class="c1"># validate sra, return result, report error</span>
</span></span><span class="line"><span class="ln">48</span><span class="cl">        <span class="n">result</span> <span class="o">=</span> <span class="n">run</span><span class="p">([</span><span class="s1">&#39;vdb-validate&#39;</span><span class="p">,</span> <span class="n">sra</span><span class="p">])</span>
</span></span><span class="line"><span class="ln">49</span><span class="cl">        <span class="k">return</span> <span class="n">result</span><span class="o">.</span><span class="n">returncode</span> <span class="o">==</span> <span class="mi">0</span>
</span></span><span class="line"><span class="ln">50</span><span class="cl">    <span class="k">except</span> <span class="ne">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">51</span><span class="cl">        <span class="n">error</span><span class="p">(</span><span class="n">error</span><span class="o">=</span><span class="n">e</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">52</span><span class="cl">
</span></span><span class="line"><span class="ln">53</span><span class="cl">
</span></span><span class="line"><span class="ln">54</span><span class="cl"><span class="k">def</span> <span class="nf">split_sra</span><span class="p">(</span><span class="n">sra</span><span class="p">):</span>
</span></span><span class="line"><span class="ln">55</span><span class="cl">    <span class="c1"># check if sra is paired</span>
</span></span><span class="line"><span class="ln">56</span><span class="cl">    <span class="n">paired</span> <span class="o">=</span> <span class="n">is_paired_sra</span><span class="p">(</span><span class="n">sra</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">57</span><span class="cl">    <span class="c1"># dump sra into fastq</span>
</span></span><span class="line"><span class="ln">58</span><span class="cl">    <span class="n">result</span> <span class="o">=</span> <span class="n">run</span><span class="p">([</span><span class="s1">&#39;fastq-dump&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">59</span><span class="cl">                  <span class="s1">&#39;--split-files&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">60</span><span class="cl">                  <span class="s1">&#39;--gzip&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">61</span><span class="cl">                  <span class="s1">&#39;--outdir&#39;</span><span class="p">,</span> <span class="n">results_dir</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">62</span><span class="cl">                  <span class="n">sra</span><span class="p">])</span>
</span></span><span class="line"><span class="ln">63</span><span class="cl">
</span></span><span class="line"><span class="ln">64</span><span class="cl">    <span class="k">if</span> <span class="n">result</span><span class="o">.</span><span class="n">returncode</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">65</span><span class="cl">        <span class="c1"># in prod, we would likely upload the files here. For our purposes we are just going to report their local paths</span>
</span></span><span class="line"><span class="ln">66</span><span class="cl">        <span class="k">if</span> <span class="n">paired</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">67</span><span class="cl">            <span class="c1"># if data is paired return both read 1 and read 2</span>
</span></span><span class="line"><span class="ln">68</span><span class="cl">            <span class="k">return</span> <span class="sa">f</span><span class="s1">&#39;</span><span class="si">{</span><span class="n">sra</span><span class="o">.</span><span class="n">strip</span><span class="p">(</span><span class="s2">&#34;.sra&#34;</span><span class="p">)</span><span class="si">}</span><span class="s1">_1.fastq.gz&#39;</span><span class="p">,</span> <span class="sa">f</span><span class="s1">&#39;</span><span class="si">{</span><span class="n">sra</span><span class="o">.</span><span class="n">strip</span><span class="p">(</span><span class="s2">&#34;.sra&#34;</span><span class="p">)</span><span class="si">}</span><span class="s1">_2.fastq.gz&#39;</span>
</span></span><span class="line"><span class="ln">69</span><span class="cl">        <span class="k">else</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">70</span><span class="cl">            <span class="c1"># otherwise, only read 1</span>
</span></span><span class="line"><span class="ln">71</span><span class="cl">            <span class="k">return</span> <span class="sa">f</span><span class="s1">&#39;</span><span class="si">{</span><span class="n">results_dir</span><span class="si">}</span><span class="s1">/</span><span class="si">{</span><span class="n">sra</span><span class="o">.</span><span class="n">strip</span><span class="p">(</span><span class="s2">&#34;.sra&#34;</span><span class="p">)</span><span class="si">}</span><span class="s1">_1.fastq.gz&#39;</span>
</span></span><span class="line"><span class="ln">72</span><span class="cl">
</span></span><span class="line"><span class="ln">73</span><span class="cl">
</span></span><span class="line"><span class="ln">74</span><span class="cl"><span class="k">def</span> <span class="nf">main</span><span class="p">(</span><span class="n">sra</span><span class="p">):</span>
</span></span><span class="line"><span class="ln">75</span><span class="cl">    <span class="c1"># check for valid sra file</span>
</span></span><span class="line"><span class="ln">76</span><span class="cl">    <span class="k">if</span> <span class="n">validate</span><span class="p">(</span><span class="n">sra</span><span class="p">):</span>
</span></span><span class="line"><span class="ln">77</span><span class="cl">        <span class="c1"># dump sra, report results</span>
</span></span><span class="line"><span class="ln">78</span><span class="cl">        <span class="n">result</span> <span class="o">=</span> <span class="n">split_sra</span><span class="p">(</span><span class="n">sra</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">79</span><span class="cl">        <span class="n">log</span><span class="o">.</span><span class="n">info</span><span class="p">(</span><span class="n">result</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">80</span><span class="cl">    <span class="k">else</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">81</span><span class="cl">        <span class="n">error</span><span class="p">(</span><span class="s1">&#39;SRA is is not valid&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">82</span><span class="cl">
</span></span><span class="line"><span class="ln">83</span><span class="cl">
</span></span><span class="line"><span class="ln">84</span><span class="cl"><span class="k">if</span> <span class="vm">__name__</span> <span class="o">==</span> <span class="s2">&#34;__main__&#34;</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">85</span><span class="cl">    <span class="n">parser</span> <span class="o">=</span> <span class="n">argparse</span><span class="o">.</span><span class="n">ArgumentParser</span><span class="p">(</span><span class="n">description</span><span class="o">=</span><span class="s1">&#39;Split SRA&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">86</span><span class="cl">    <span class="n">parser</span><span class="o">.</span><span class="n">add_argument</span><span class="p">(</span><span class="s1">&#39;--sra&#39;</span><span class="p">,</span> <span class="nb">type</span><span class="o">=</span><span class="nb">str</span><span class="p">,</span> <span class="n">help</span><span class="o">=</span><span class="s1">&#39;Path to SRA&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">87</span><span class="cl">    <span class="n">parser</span><span class="o">.</span><span class="n">add_argument</span><span class="p">(</span><span class="s1">&#39;-data_dir&#39;</span><span class="p">,</span> <span class="n">default</span><span class="o">=</span><span class="s1">&#39;/data&#39;</span><span class="p">,</span> <span class="n">required</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="nb">type</span><span class="o">=</span><span class="nb">str</span><span class="p">,</span> <span class="n">help</span><span class="o">=</span><span class="s1">&#39;Data directory&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">88</span><span class="cl">    <span class="n">args</span> <span class="o">=</span> <span class="n">parser</span><span class="o">.</span><span class="n">parse_args</span><span class="p">()</span>
</span></span><span class="line"><span class="ln">89</span><span class="cl">    <span class="n">results_dir</span> <span class="o">=</span> <span class="n">args</span><span class="o">.</span><span class="n">data_dir</span>
</span></span><span class="line"><span class="ln">90</span><span class="cl">    <span class="n">main</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;</span><span class="si">{</span><span class="n">results_dir</span><span class="si">}</span><span class="s1">/</span><span class="si">{</span><span class="n">args</span><span class="o">.</span><span class="n">sra</span><span class="si">}</span><span class="s1">&#39;</span><span class="p">)</span>
</span></span></code></pre></div><h2 id="testing-out-our-code">Testing out our code</h2>
<p>Next, we&rsquo;re going to need some data. SRA files are often pretty large (sometimes hundreds of gigabytes).<br>
Typically, S. cerevisiae RNAseq datasets are pretty small but also fully functional, so we&rsquo;re going to
be using one of those <a href="https://trace.ncbi.nlm.nih.gov/Traces/?run=SRR21712309">https://trace.ncbi.nlm.nih.gov/Traces/?run=SRR21712309</a>
(To find this, I went to <a href="https://ncbi.nlm.nih.gov/sra">https://ncbi.nlm.nih.gov/sra</a>, entered S. cerevisiae in the search bar, and selected the first one :smiling:)</p>
<p>Now that we have data, we can run this container using Docker. We will
bind a directory on our host machine (<code>./data</code>) to our <code>/data</code> directory in our container. This folder is where our input data will live, and the Fastq files our script generates will be written.</p>
<p>To run, we&rsquo;ll simply run,</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">docker run -v<span class="nv">$PWD</span>/data:/data fastq-dump --sra SRR21712309
</span></span></code></pre></div><p>and we&rsquo;ll see output logged to stdout</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">2022-10-12T01:21:16 vdb-validate.3.0.0 info: Database <span class="s1">&#39;SRR21712309&#39;</span> metadata: md5 ok
</span></span><span class="line"><span class="cl">2022-10-12T01:21:16 vdb-validate.3.0.0 info: Table <span class="s1">&#39;SEQUENCE&#39;</span> metadata: md5 ok
</span></span><span class="line"><span class="cl">2022-10-12T01:21:16 vdb-validate.3.0.0 info: Column <span class="s1">&#39;ALTREAD&#39;</span>: checksums ok
</span></span><span class="line"><span class="cl">2022-10-12T01:21:17 vdb-validate.3.0.0 info: Column <span class="s1">&#39;QUALITY&#39;</span>: checksums ok
</span></span><span class="line"><span class="cl">2022-10-12T01:21:20 vdb-validate.3.0.0 info: Column <span class="s1">&#39;READ&#39;</span>: checksums ok
</span></span><span class="line"><span class="cl">2022-10-12T01:21:21 vdb-validate.3.0.0 info: Database <span class="s1">&#39;/data/SRR21712309&#39;</span> contains only unaligned reads
</span></span><span class="line"><span class="cl">2022-10-12T01:21:21 vdb-validate.3.0.0 info: Database <span class="s1">&#39;SRR21712309&#39;</span> is consistent
</span></span><span class="line"><span class="cl">Read <span class="m">8631759</span> spots <span class="k">for</span> /data/SRR21712309
</span></span><span class="line"><span class="cl">Written <span class="m">8631759</span> spots <span class="k">for</span> /data/SRR21712309
</span></span><span class="line"><span class="cl">INFO:root:<span class="o">(</span><span class="s1">&#39;/data/SRR21712309_1.fastq.gz&#39;</span>, <span class="s1">&#39;/data/SRR21712309_2.fastq.gz&#39;</span><span class="o">)</span>
</span></span></code></pre></div><p>We&rsquo;ll also see these files appear in our <code>./data</code> directory. The path to these will match
our file paths logged at the end</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="o">(</span>venv<span class="o">)</span> alexjacobs@Alexs-MacBook-Pro ~/r/b/fastq-dump <span class="o">(</span>master<span class="o">)</span>&gt; ls -lh data
</span></span><span class="line"><span class="cl">total <span class="m">1354280</span>
</span></span><span class="line"><span class="cl">-rw-r--r--@ <span class="m">1</span> alexjacobs  staff   213M Sep <span class="m">27</span> 07:30 SRR21712309
</span></span><span class="line"><span class="cl">-rw-r--r--  <span class="m">1</span> alexjacobs  staff   217M Oct <span class="m">11</span> 22:57 SRR21712309_1.fastq.gz
</span></span><span class="line"><span class="cl">-rw-r--r--  <span class="m">1</span> alexjacobs  staff   221M Oct <span class="m">11</span> 22:57 SRR21712309_2.fastq.gz
</span></span></code></pre></div><p>And that&rsquo;s it!  This is a lot of explanation for a simple toy example, but hopefully is helpful
to someone just getting started!</p>
]]></content:encoded>
    </item>
    
  </channel>
</rss>
