<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Production on Alex Jacobs</title>
    <link>https://alex-jacobs.com/tags/production/</link>
    <description>Recent content in Production on Alex Jacobs</description>
    <generator>Hugo -- gohugo.io</generator>
    <language>en-us</language>
    <lastBuildDate>Wed, 29 Oct 2025 00:00:00 +0000</lastBuildDate><atom:link href="https://alex-jacobs.com/tags/production/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>The Case Against pgvector</title>
      <link>https://alex-jacobs.com/posts/the-case-against-pgvector/</link>
      <pubDate>Wed, 29 Oct 2025 00:00:00 +0000</pubDate>
      
      <guid>https://alex-jacobs.com/posts/the-case-against-pgvector/</guid>
      <description>What happens when you try to run pgvector in production and discover all the things the blog posts conveniently forgot to mention</description>
      <content:encoded><![CDATA[

<a class="simon-callout" href="https://simonwillison.net/2025/Nov/3/the-case-against-pgvector/" target="_blank" rel="noopener noreferrer">
  <span class="simon-label">Simon Willison</span> shared his thoughts on this post <span class="simon-arrow">&rarr;</span>
</a>

<h2 id="everyone-loves-pgvector-in-theory">Everyone Loves pgvector (in theory)</h2>
<p>If you&rsquo;ve spent any time in the vector search space over the past year, you&rsquo;ve probably read blog posts explaining why pgvector is the obvious choice for your vector database needs. The argument goes something like this: you already have Postgres, vector embeddings are just another data type, why add complexity with a dedicated vector database when you can keep everything in one place?</p>
<p>It&rsquo;s a compelling story. And like most of the AI influencer bullshit that fills my timeline, it glosses over the inconvenient details.</p>
<p>I&rsquo;m not here to tell you pgvector is bad. It&rsquo;s not. It&rsquo;s a useful extension that brings vector similarity search to Postgres. But after spending some time trying to build a production system on top of it, I&rsquo;ve learned that the gap between &ldquo;works in a demo&rdquo; and &ldquo;scales in production&rdquo; is&hellip; significant.</p>
<h2 id="nobodys-actually-run-this-in-production">Nobody&rsquo;s actually run this in production</h2>
<p>What bothers me most: the majority of content about pgvector reads like it was written by someone who spun up a local Postgres instance, inserted 10,000 vectors, ran a few queries, and called it a day. The posts are optimistic, the benchmarks are clean, and the conclusions are confident.</p>
<p>They&rsquo;re also missing about 80% of what you actually need to know.</p>
<p>I&rsquo;ve read through  <style>
    .pgvector-trigger {
        display: inline;
        color: var(--primary, #3273dc);
        text-decoration: underline;
        cursor: pointer;
        text-underline-offset: 2px;
        transition: opacity 0.2s;
    }
    
    .pgvector-trigger:hover {
        opacity: 0.7;
    }

    .pgvector-overlay {
        display: none;
        position: fixed;
        z-index: 9999;
        left: 0;
        top: 0;
        width: 100%;
        height: 100%;
        overflow: auto;
        background-color: rgba(0, 0, 0, 0.7);
        animation: fadeIn 0.2s ease-in;
    }

    .pgvector-overlay.active {
        display: flex;
        align-items: center;
        justify-content: center;
        padding: 20px;
    }

    .pgvector-modal {
        background-color: var(--entry, #fff);
        color: var(--content, #000);
        margin: auto;
        padding: 32px;
        border-radius: 16px;
        width: 90%;
        max-width: 650px;
        max-height: 80vh;
        box-shadow: 0 20px 60px rgba(0, 0, 0, 0.4);
        animation: slideUp 0.3s ease-out;
        display: flex;
        flex-direction: column;
        position: relative;
    }

    .pgvector-close {
        position: absolute;
        top: 16px;
        right: 16px;
        background: none;
        border: none;
        font-size: 28px;
        font-weight: 300;
        cursor: pointer;
        color: var(--secondary, #666);
        padding: 4px 8px;
        line-height: 1;
        border-radius: 4px;
        transition: all 0.2s;
    }

    .pgvector-close:hover {
        background-color: var(--code-bg, #f5f5f5);
        color: var(--content, #000);
    }

    .pgvector-content {
        overflow-y: auto;
    }

    .pgvector-posts {
        display: flex;
        flex-direction: column;
        gap: 8px;
    }

    .pgvector-post {
        padding: 0;
        list-style: none;
    }

    .pgvector-post a {
        text-decoration: none;
        color: var(--primary, #3273dc);
        display: block;
        font-size: 0.95em;
        line-height: 1.6;
        padding: 8px 0;
        transition: all 0.2s;
    }

    .pgvector-post a:hover {
        color: var(--content, #000);
        padding-left: 8px;
    }

    @keyframes fadeIn {
        from { opacity: 0; }
        to { opacity: 1; }
    }

    @keyframes slideUp {
        from {
            opacity: 0;
            transform: translateY(20px) scale(0.95);
        }
        to {
            opacity: 1;
            transform: translateY(0) scale(1);
        }
    }

    @media (max-width: 600px) {
        .pgvector-modal {
            width: 95%;
            max-height: 85vh;
            padding: 24px;
            border-radius: 12px;
        }

        .pgvector-close {
            top: 12px;
            right: 12px;
        }

        .pgvector-post a {
            font-size: 0.9em;
        }
    }
</style> <span class="pgvector-trigger" onclick="openPgvectorModal()">dozens of these posts.</span><div id="pgvectorOverlay" class="pgvector-overlay" onclick="closePgvectorModal(event)">
    <div class="pgvector-modal" onclick="event.stopPropagation()">
        <button class="pgvector-close" onclick="closePgvectorModal()">&times;</button>
        <div class="pgvector-content">
            <div class="pgvector-posts">
                <div class="pgvector-post"><a href="https://neon.com/blog/understanding-vector-search-and-hnsw-index-with-pgvector" target="_blank" rel="noopener">Understanding Vector Search and HNSW Index with pgvector</a></div>
                <div class="pgvector-post"><a href="https://www.crunchydata.com/blog/hnsw-indexes-with-postgres-and-pgvector" target="_blank" rel="noopener">HNSW Indexes with Postgres and pgvector</a></div>
                <div class="pgvector-post"><a href="https://www.stormatics.tech/blog/understand-indexes-in-pgvector" target="_blank" rel="noopener">Understand Indexes in pgvector</a></div>
                <div class="pgvector-post"><a href="https://www.lantern.dev/blog/external-indexing-for-pgvector" target="_blank" rel="noopener">External Indexing for pgvector</a></div>
                <div class="pgvector-post"><a href="https://www.lantern.dev/blog/exploring-postgres-pgvector-hnsw-index-storage" target="_blank" rel="noopener">Exploring Postgres pgvector HNSW Index Storage</a></div>
                <div class="pgvector-post"><a href="https://supabase.com/blog/pgvector-v0-5-0" target="_blank" rel="noopener">pgvector v0.5.0: Faster semantic search with HNSW indexes</a></div>
                <div class="pgvector-post"><a href="https://info.crunchydata.com/blog/early-look-at-hnsw-performance" target="_blank" rel="noopener">Early Look at HNSW Performance with pgvector</a></div>
                <div class="pgvector-post"><a href="https://tembo.io/blog/vector-indexes-in-postgres-using-pgvector-ivfflat-vs-hnsw" target="_blank" rel="noopener">Vector Indexes in Postgres using pgvector: IVFFlat vs HNSW</a></div>
                <div class="pgvector-post"><a href="https://www.tigerdata.com/blog/vector-database-basics-hnsw" target="_blank" rel="noopener">Vector Database Basics: HNSW Index</a></div>
                <div class="pgvector-post"><a href="https://dev.to/azure/postgresql-vector-indexing-hnsw-cosmosdb" target="_blank" rel="noopener">PostgreSQL Vector Indexing with HNSW</a></div>
            </div>
        </div>
    </div>
</div>

<script>
    function openPgvectorModal() {
        const overlay = document.getElementById('pgvectorOverlay');
        overlay.classList.add('active');
        document.body.style.overflow = 'hidden';
    }

    function closePgvectorModal(event) {
        if (!event || event.target === document.getElementById('pgvectorOverlay')) {
            const overlay = document.getElementById('pgvectorOverlay');
            overlay.classList.remove('active');
            document.body.style.overflow = '';
        }
    }

    
    document.addEventListener('keydown', function(event) {
        if (event.key === 'Escape') {
            closePgvectorModal();
        }
    });
</script>
 They all cover the same ground: here&rsquo;s how to install pgvector, here&rsquo;s how to create a vector column, here&rsquo;s a simple similarity search query. Some of them even mention that you should probably add an index.</p>
<p>What they don&rsquo;t tell you is what happens when you actually try to run this in production.</p>
<h2 id="picking-an-index-there-are-no-good-options">Picking an index (there are no good options)</h2>
<p>Let&rsquo;s start with indexes, because this is where the tradeoffs start.</p>
<p>pgvector gives you two index types: IVFFlat and HNSW. The blog posts will tell you that HNSW is newer and generally better, which is&hellip; technically true but deeply unhelpful.</p>
<h3 id="ivfflat">IVFFlat</h3>
<p>IVFFlat (Inverted File with Flat quantization) partitions your vector space into clusters. During search, it identifies the nearest clusters and only searches within those.</p>
<p>The good:</p>
<ul>
<li>Lower memory footprint during index creation</li>
<li>Reasonable query performance for many use cases</li>
<li>Index creation is faster than HNSW</li>
</ul>
<p>The bad:</p>
<ul>
<li>Requires you to specify the number of lists (clusters) upfront</li>
<li>That number significantly impacts both recall and query performance</li>
<li>The commonly recommended formula (<code>rows / 1000</code>) is a starting point at best</li>
<li>Recall can be&hellip; disappointing depending on your data distribution</li>
<li>New vectors get assigned to existing clusters, but clusters don&rsquo;t rebalance without a full rebuild</li>
</ul>
<!-- IMAGE 1: IVFFlat Cluster Visualization
Prompt: Technical diagram showing IVFFlat vector index structure. Show a 2D vector space divided into Voronoi cells/clusters with different colored regions. Include small dots representing vectors clustered within each partition. Label showing 'Query Vector' with arrows pointing to 2-3 nearest clusters that would be searched. Clean, minimal style with a light background. Similar to technical documentation diagrams.
-->
<p><img loading="lazy" src="/posts/the-case-against-pgvector/img_1.png" type="" alt="img_1.png"  />
<em>Image source: <a href="https://unfoldai.com/ivfflat-vs-hnsw/">IVFFlat or HNSW index for similarity search?</a> by Simeon Emanuilov</em></p>
<h3 id="hnsw">HNSW</h3>
<p>HNSW (Hierarchical Navigable Small World) builds a multi-layer graph structure for search.</p>
<p>The good:</p>
<ul>
<li>Better recall than IVFFlat for most datasets</li>
<li>More consistent query performance</li>
<li>Scales well to larger datasets</li>
</ul>
<p>The bad:</p>
<ul>
<li>Significantly higher memory requirements during index builds</li>
<li>Index creation is slow—painfully slow for large datasets</li>
<li>The memory requirements aren&rsquo;t theoretical; they are real, and they&rsquo;ll take down your database if you&rsquo;re not careful</li>
</ul>
<!-- IMAGE 2: HNSW Graph Structure
Prompt: Technical diagram of HNSW hierarchical graph structure showing 3-4 layers. Top layer has sparse nodes with long-range connections, middle layers have medium density, bottom layer is dense with many nodes and local connections. Use different colors for each layer. Show example search path highlighted in a different color traversing from top to bottom. Clean technical style, light background.
-->
<p><img loading="lazy" src="/posts/the-case-against-pgvector/img_2.png" type="" alt="img_2.png"  />
<em>Image source: <a href="https://unfoldai.com/ivfflat-vs-hnsw/">IVFFlat or HNSW index for similarity search?</a> by Simeon Emanuilov</em></p>
<p>None of the blogs mention that building an HNSW index on a few million vectors can consume 10+ GB of RAM or more (depending on your vector dimensions and dataset size). On your production database. While it&rsquo;s running. For potentially hours.</p>
<!-- IMAGE 6: Memory Spike Graph
Prompt: Line graph showing RAM usage over time during HNSW index build on production database. X-axis: Time (0-6 hours), Y-axis: RAM usage (GB). Show baseline at ~8GB labeled 'Normal operations', then sharp spike to 25-30GB labeled 'Index build starts', sustained high usage, then drop back to baseline labeled 'Index complete'. Add horizontal dashed line at server RAM limit (e.g., 32GB) marked 'Danger zone'. Include annotation: 'Production queries still running'. Use typical monitoring dashboard style with blue line and red zones.
-->
<p><img loading="lazy" src="/posts/the-case-against-pgvector/img_6.png" type="" alt="img_6.png"  /></p>
<h2 id="real-time-search-is-basically-impossible">Real-time search is basically impossible</h2>
<p>In a typical application, you want newly uploaded data to be searchable immediately. User uploads a document, you generate embeddings, insert them into your database, and they should be available in search results. Simple, right?</p>
<h3 id="how-index-updates-actually-work">How index updates actually work</h3>
<p>When you insert new vectors into a table with an index, one of two things happens:</p>
<ol>
<li>
<p><strong>IVFFlat</strong>: The new vectors are inserted into the appropriate clusters based on the existing structure. This works, but it means your cluster distribution gets increasingly suboptimal over time. The solution is to rebuild the index periodically. Which means downtime, or maintaining a separate index and doing an atomic swap, or accepting degraded search quality.</p>
</li>
<li>
<p><strong>HNSW</strong>: New vectors are added to the graph structure. This is better than IVFFlat, but it&rsquo;s not free. Each insertion requires updating the graph, which means memory allocation, graph traversals, and potential lock contention.</p>
</li>
</ol>
<p>Neither of these is a deal-breaker in isolation. But here&rsquo;s what happens in practice: you&rsquo;re inserting vectors continuously throughout the day. Each insertion is individually cheap, but the aggregate load adds up. Your database is now handling your normal transactional workload, analytical queries, AND maintaining graph structures in memory for vector search.</p>
<h3 id="handling-new-inserts">Handling new inserts</h3>
<p>Let&rsquo;s say you&rsquo;re building a document search system. Users upload PDFs, you extract text, generate embeddings, and insert them. The user expects to immediately search for that document.</p>
<p>Here&rsquo;s what actually happens:</p>
<p><strong>With no index</strong>: The insert is fast, the document is immediately available, but your searches do a full sequential scan. This works fine for a few thousand documents. At a few hundred thousand? Your searches start taking seconds. Millions? Good luck.</p>
<p><strong>With IVFFlat</strong>: The insert is still relatively fast. The vector gets assigned to a cluster. But whoops, a problem. Those initial cluster assignments were based on the data distribution when you built the index. As you add more data, especially if it&rsquo;s not uniformly distributed, some clusters get overloaded. Your search quality degrades. You rebuild the index periodically to fix this, but during the rebuild (which can take hours for large datasets), what do you do with new inserts? Queue them? Write to a separate unindexed table and merge later?</p>
<p><strong>With HNSW</strong>: The graph gets updated on each insert through incremental insertion, which sounds great. But updating an HNSW graph isn&rsquo;t free—you&rsquo;re traversing the graph to find the right place to insert the new node and updating connections. Each insert acquires locks on the graph structure. Under heavy write load, this becomes a bottleneck. And if your write rate is high enough, you start seeing lock contention that slows down both writes and reads.</p>
<!-- IMAGE 3: Real-time Ingestion Timeline
Prompt: Timeline diagram showing the challenges of real-time vector ingestion. Horizontal timeline with events: 'User uploads document' → 'Generate embeddings' → 'Insert to DB' → 'Index rebuild starts (hours)' with a long bar showing duration → 'New data searchable?'. Show a second parallel timeline of 'More users uploading' with question marks about where those writes go. Use warning colors (amber/orange) for problematic areas. Clean infographic style.
-->
<p><img loading="lazy" src="/posts/the-case-against-pgvector/img_3.jpg" type="" alt="img_3.jpg"  /></p>
<h3 id="the-operational-reality">The operational reality</h3>
<p>Here&rsquo;s the real nightmare: you&rsquo;re not just storing vectors. You have metadata—document titles, timestamps, user IDs, categories, etc. That metadata lives in other tables (or other columns in the same table). You need that metadata and the vectors to stay in sync.</p>
<p>In a normal Postgres table, this is easy—transactions handle it. But when you&rsquo;re dealing with index builds that take hours, keeping everything consistent gets complicated. For IVFFlat, periodic rebuilds are basically required to maintain search quality. For HNSW, you might need to rebuild if you want to tune parameters or if performance has degraded.</p>
<p>The problem is that index builds are memory-intensive operations, and Postgres doesn&rsquo;t have a great way to throttle them. You&rsquo;re essentially asking your production database to allocate multiple (possibly dozens) gigabytes of RAM for an operation that might take hours, while continuing to serve queries.</p>
<p>You end up with strategies like:</p>
<ul>
<li>Write to a staging table, build the index offline, then swap it in (but now you have a window where searches miss new data)</li>
<li>Maintain two indexes and write to both (double the memory, double the update cost)</li>
<li>Build indexes on replicas and promote them</li>
<li>Accept eventual consistency (users upload documents that aren&rsquo;t searchable for N minutes)</li>
<li>Provision significantly more RAM than your &ldquo;working set&rdquo; would suggest</li>
</ul>
<p>None of these are &ldquo;wrong&rdquo; exactly. But they&rsquo;re all workarounds for the fact that pgvector wasn&rsquo;t really designed for high-velocity real-time ingestion.</p>
<h2 id="pre--vs-post-filtering-or-why-you-need-to-become-a-query-planner-expert">Pre- vs. Post-Filtering (or: why you need to become a query planner expert)</h2>
<p>Okay but let&rsquo;s say you solve your index and insert problems.  Now you have a document search system with millions of vectors. Documents have metadata—maybe they&rsquo;re marked as <code>draft</code>, <code>published</code>, or <code>archived</code>. A user searches for something, and you only want to return published documents.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">documents</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">status</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;published&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">embedding</span><span class="w"> </span><span class="o">&lt;-&gt;</span><span class="w"> </span><span class="n">query_vector</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="k">LIMIT</span><span class="w"> </span><span class="mi">10</span><span class="p">;</span><span class="w">
</span></span></span></code></pre></div><p>Simple enough. But now you have a problem: should Postgres filter on status first (pre-filter) or do the vector search first and then filter (post-filter)?</p>
<p>This seems like an implementation detail. It&rsquo;s not. It&rsquo;s the difference between queries that take 50ms and queries that take 5 seconds. It&rsquo;s also the difference between returning the most relevant results and&hellip; not.</p>
<!-- IMAGE 4: Pre-filter vs Post-filter Comparison
Prompt: Side-by-side comparison diagram of pre-filter vs post-filter vector search. Left side labeled 'Pre-filter': shows filter icon → reduced dataset → vector search icon → results. Right side labeled 'Post-filter': shows vector search icon → all results → filter icon → fewer results (with some crossed out). Include example numbers like '1M docs → 100K filtered → top 10' vs '1M docs → top 10 → maybe 2 match filter'. Use flow diagram style with icons and arrows.
-->
<p><img loading="lazy" src="/posts/the-case-against-pgvector/img_4.jpg" type="" alt="img_4.jpg"  /></p>
<p><strong>Pre-filter</strong> works great when the filter is highly selective (1,000 docs out of 10M). It works terribly when the filter isn&rsquo;t selective—you&rsquo;re still searching millions of vectors.</p>
<p><strong>Post-filter</strong> works when your filter is permissive. Here&rsquo;s where it breaks: imagine you ask for 10 results with <code>LIMIT 10</code>. pgvector finds the 10 nearest neighbors, then applies your filter. Only 3 of those 10 are published. You get 3 results back, even though there might be hundreds of relevant published documents slightly further away in the embedding space.</p>
<p>The user searched, got 3 mediocre results, and has no idea they&rsquo;re missing way better matches that didn&rsquo;t make it into the initial k=10 search.</p>
<!-- IMAGE 5: The Recall Problem Visualization
Prompt: Visualization of the filtered search recall problem. Show a 2D scatter plot of documents in embedding space with two colors: green dots for 'published' and gray dots for 'draft/archived'. Draw a circle around the query point showing 'top 10 nearest neighbors' containing mostly gray dots with only 1-2 green dots. Then show several green dots just outside the circle labeled 'Highly relevant published docs (missed)'. Add text: 'User gets 2 results, misses 50+ relevant matches'. Use clean data visualization style.
-->
<p><img loading="lazy" src="/posts/the-case-against-pgvector/img_5.png" type="" alt="img_5.png"  /></p>
<p>You can work around this by fetching more vectors (say, <code>LIMIT 100</code>) and then filtering, but now:</p>
<ul>
<li>You&rsquo;re doing way more distance calculations than needed</li>
<li>You still don&rsquo;t know if 100 is enough</li>
<li>Your query performance suffers</li>
<li>You&rsquo;re guessing at the right oversampling factor</li>
</ul>
<p>With pre-filter, you avoid this problem, but you get the performance problems I mentioned. Pick your poison.</p>
<h3 id="multiple-filters">Multiple filters</h3>
<p>Now add another dimension: you&rsquo;re filtering by user_id AND category AND date_range.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">documents</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">user_id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;user123&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">  </span><span class="k">AND</span><span class="w"> </span><span class="n">category</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;technical&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">  </span><span class="k">AND</span><span class="w"> </span><span class="n">created_at</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="s1">&#39;2024-01-01&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">embedding</span><span class="w"> </span><span class="o">&lt;-&gt;</span><span class="w"> </span><span class="n">query_vector</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w"></span><span class="k">LIMIT</span><span class="w"> </span><span class="mi">10</span><span class="p">;</span><span class="w">
</span></span></span></code></pre></div><p>What&rsquo;s the right strategy now?</p>
<ul>
<li>Apply all filters first, then search? (Pre-filter)</li>
<li>Search first, then apply all filters? (Post-filter)</li>
<li>Apply some filters first, search, then apply remaining filters? (Hybrid)</li>
<li>Which filters should you apply in which order?</li>
</ul>
<p>The planner will look at table statistics, index selectivity, and estimated row counts and come up with a plan. That plan will probably be wrong, or at least suboptimal, because the planner&rsquo;s cost model wasn&rsquo;t built for vector similarity search.</p>
<p>And it gets worse: you&rsquo;re inserting new vectors throughout the day. Your index statistics are outdated. The plans get increasingly suboptimal until you ANALYZE the table. But ANALYZE on a large table with millions of rows takes time and resources. And it doesn&rsquo;t really understand vector data distribution in a meaningful way—it can tell you how many rows match <code>user_id = 'user123'</code>, but not how clustered those vectors are in the embedding space, which is what actually matters for search performance.</p>
<h3 id="workarounds">Workarounds</h3>
<p>You end up with hacks: query rewriting for different user types, partitioning your data into separate tables, CTE optimization fences to force the planner&rsquo;s hand, or just fetching way more results than needed and filtering in application code.</p>
<p>None of these are sustainable at scale.</p>
<h3 id="what-vector-databases-do">What vector databases do</h3>
<p>Dedicated vector databases have solved this. They understand the cost model of filtered vector search and make intelligent decisions:</p>
<ul>
<li><strong>Adaptive strategies</strong>: Some databases dynamically choose pre-filter or post-filter based on estimated selectivity</li>
<li><strong>Configurable modes</strong>: Others let you specify the strategy explicitly when you know your data distribution</li>
<li><strong>Specialized indexes</strong>: Some build indexes that support efficient filtered search (like filtered HNSW)</li>
<li><strong>Query optimization</strong>: They track statistics specific to vector operations and optimize accordingly</li>
</ul>
<p>OpenSearch&rsquo;s k-NN plugin, for example, lets you specify pre-filter or post-filter behavior. Pinecone automatically handles filter selectivity. Weaviate has optimizations for common filter patterns.</p>
<p>With pgvector, you get to build all of this yourself. Or live with suboptimal queries. Or hire a Postgres expert to spend weeks tuning your query patterns.</p>
<h2 id="hybrid-search-build-it-yourself">Hybrid search? Build it yourself</h2>
<p>Oh, and if you want hybrid search—combining vector similarity with traditional full-text search—you get to build that yourself too.</p>
<p>Postgres has excellent full-text search capabilities. pgvector has excellent vector search capabilities. Combining them in a meaningful way? That&rsquo;s on you.</p>
<p>You need to:</p>
<ul>
<li>Decide how to weight vector similarity vs. text relevance</li>
<li>Normalize scores from two different scoring systems</li>
<li>Tune the balance for your use case</li>
<li>Probably implement Reciprocal Rank Fusion or something similar</li>
</ul>
<p>Again, not impossible. Just another thing that many dedicated vector databases provide out of the box.</p>
<h2 id="pgvectorscale-it-doesnt-solve-everything">pgvectorscale (it doesn&rsquo;t solve everything)</h2>
<p>Timescale has released <a href="https://github.com/timescale/pgvectorscale">pgvectorscale</a>, which addresses some of these issues. It adds:</p>
<ul>
<li>StreamingDiskANN, a new search backend that&rsquo;s more memory-efficient</li>
<li>Better support for incremental index builds</li>
<li>Improved filtering performance</li>
</ul>
<p>This is great! It&rsquo;s also an admission that pgvector out of the box isn&rsquo;t sufficient for production use cases.</p>
<p>pgvectorscale is still relatively new, and adopting it means adding another dependency, another extension, another thing to manage and upgrade. For some teams, that&rsquo;s fine. For others, it&rsquo;s just more evidence that maybe the &ldquo;keep it simple, use Postgres&rdquo; argument isn&rsquo;t as simple as it seemed.</p>
<p>Oh, and if you&rsquo;re running on RDS, pgvectorscale isn&rsquo;t available. AWS doesn&rsquo;t support it. So enjoy managing your own Postgres instance if you want these improvements, or just&hellip; keep dealing with the limitations of vanilla pgvector.</p>
<p>The &ldquo;just use Postgres&rdquo; simplicity keeps getting simpler.</p>
<h2 id="just-use-a-real-vector-database">Just use a real vector database</h2>
<p>I get the appeal of pgvector. Consolidating your stack is good. Reducing operational complexity is good. Not having to manage another database is good.</p>
<p>But here&rsquo;s what I&rsquo;ve learned: for most teams, especially small teams, dedicated vector databases are actually simpler.</p>
<h3 id="what-you-actually-get">What you actually get</h3>
<p>With a managed vector database (Pinecone, Weaviate, Turbopuffer, etc.), you typically get:</p>
<ul>
<li>Intelligent query planning for filtered searches</li>
<li>Hybrid search built in</li>
<li>Real-time indexing without memory spikes</li>
<li>Horizontal scaling without complexity</li>
<li>Monitoring and observability designed for vector workloads</li>
</ul>
<!-- IMAGE 7: pgvector vs Vector DB Comparison
Prompt: Comparison table showing pgvector vs dedicated vector databases. Two columns labeled 'pgvector' and 'Vector DBs (Pinecone, Weaviate, etc.)'. Rows for: 'Filtered search optimization' (❌ vs ✓), 'Hybrid search' (DIY vs Built-in), 'Real-time indexing' (Complex vs Seamless), 'Memory management' (Manual vs Automatic), 'Horizontal scaling' (Limited vs Native), 'Setup complexity' (Lower vs Higher). Use checkmarks, X marks, and simple icons. Clean table design with alternating row colors.
-->
<p><img loading="lazy" src="/posts/the-case-against-pgvector/img_7.jpg" type="" alt="img_7.jpg"  /></p>
<h3 id="its-probably-cheaper-than-you-think">It&rsquo;s probably cheaper than you think</h3>
<p>Yes, it&rsquo;s another service to pay for. But compare:</p>
<ul>
<li>The cost of a managed vector database for your workload</li>
<li>vs. the cost of over-provisioning your Postgres instance to handle index builds</li>
<li>vs. the engineering time to tune queries and manage index rebuilds</li>
<li>vs. the opportunity cost of not building features because you&rsquo;re fighting your database</li>
</ul>
<p>Turbopuffer starts at $64 month with generous limits.</p>
<p>For a lot of teams, the managed service is actually cheaper.</p>
<h2 id="what-i-wish-someone-had-told-me">What I wish someone had told me</h2>
<p>pgvector is an impressive piece of technology. It brings vector search to Postgres in a way that&rsquo;s technically sound and genuinely useful for many applications.</p>
<p>But it&rsquo;s not a panacea. Understand the tradeoffs.</p>
<p>If you&rsquo;re building a production vector search system:</p>
<ol>
<li>
<p><strong>Index management is hard</strong>. Rebuilds are memory-intensive, time-consuming, and disruptive. Plan for this from day one.</p>
</li>
<li>
<p><strong>Query planning matters</strong>. Filtered vector search is a different beast than traditional queries, and Postgres&rsquo;s planner wasn&rsquo;t built for this.</p>
</li>
<li>
<p><strong>Real-time indexing has costs</strong>. Either in memory, in search quality, or in engineering time to manage it.</p>
</li>
<li>
<p><strong>The blog posts are lying to you</strong> (by omission). They&rsquo;re showing you the happy path and ignoring the operational reality.</p>
</li>
<li>
<p><strong>Managed offerings exist for a reason</strong>. There&rsquo;s a reason that Pinecone, Weaviate, Qdrant, and others exist and are thriving. Vector search at scale has unique challenges that general-purpose databases weren&rsquo;t designed to handle.</p>
</li>
</ol>
<p>The question isn&rsquo;t &ldquo;should I use pgvector?&rdquo; It&rsquo;s &ldquo;am I willing to take on the operational complexity of running vector search in Postgres?&rdquo;</p>
<p>For some teams, the answer is yes. You have database expertise, you need the tight integration, you&rsquo;re willing to invest the time.</p>
<p>For many teams—maybe most teams—the answer is probably no. Use a tool designed for the job. Your future self will thank you.</p>
]]></content:encoded>
    </item>
    
    <item>
      <title>A Production Framework for LLM Feature Evaluation</title>
      <link>https://alex-jacobs.com/posts/practicalaifeatures/</link>
      <pubDate>Sun, 01 Jun 2025 00:00:00 +0000</pubDate>
      
      <guid>https://alex-jacobs.com/posts/practicalaifeatures/</guid>
      <description>An empirical analysis of LLM application patterns that successfully scale in production systems, focusing on extraction, generation, and classification use cases</description>
      <content:encoded><![CDATA[<h2 id="introduction">Introduction</h2>
<p>After several years of integrating LLMs into production systems, I&rsquo;ve observed a consistent pattern: the features that
deliver real value rarely align with what gets attention at conferences. While the industry focuses on AGI and emergent
behaviors, the mundane applications—data extraction, classification, controlled generation—are quietly transforming how
we build software.</p>
<p>This post presents a framework I&rsquo;ve developed for evaluating LLM features based on what actually ships and scales. It&rsquo;s
deliberately narrow in scope, focusing on patterns that have proven reliable across multiple deployments rather than
exploring the theoretical boundaries of what&rsquo;s possible.</p>
<h2 id="the-three-categories-that-actually-work">The Three Categories That Actually Work</h2>
<p>Through trial, error, and more error, I&rsquo;ve found that LLMs consistently excel in three specific areas. When I&rsquo;m
evaluating a potential AI feature, I ask: &ldquo;Does this clearly fit into one of these categories?&rdquo; If not, it&rsquo;s probably
not worth pursuing (yet).</p>
<h3 id="1-extracting-structured-data-from-unstructured-inputs">1. Extracting Structured Data from Unstructured Inputs</h3>
<p>This is the unsexy workhorse of AI features. Think of it as having an intelligent data entry assistant who never gets
tired of parsing messy inputs.</p>
<p><strong>What makes this valuable:</strong></p>
<ul>
<li>Humans hate data entry</li>
<li>Traditional parsing is brittle and breaks with slight format changes</li>
<li>LLMs can handle ambiguity and variations gracefully</li>
</ul>
<p><strong>Real examples I&rsquo;ve built:</strong></p>
<ul>
<li><strong>PDF to JSON converter</strong>: Taking uploaded forms (PDFs, images, even handwritten docs) and extracting structured data.
What used to require complex OCR pipelines and regex nightmares now works with a simple prompt.</li>
<li><strong>API response mapper</strong>: Taking inconsistent third-party API responses and mapping them to your internal data model.
Every integration engineer&rsquo;s nightmare—different field names, nested structures that change randomly, optional fields
that are sometimes null and sometimes missing entirely.</li>
<li><strong>Customer feedback analyzer</strong>: Extracting actionable insights from the stream of unstructured feedback across emails,
Slack, support tickets. Automatically pulling out feature requests, bug reports, severity, and sentiment. What used to
be a PM&rsquo;s full-time job.</li>
</ul>
<p>The key insight here is that LLMs excel at handling structural variance and ambiguity—the exact things that make
traditional parsers brittle. A single well-crafted prompt can replace hundreds of lines of mapping logic, regex
patterns, and edge case handling. The model&rsquo;s ability to understand intent rather than just pattern match is what makes
this category so powerful.</p>
<p><strong>Production considerations:</strong> For high-volume extraction from standardized formats, purpose-built services
like <a href="https://reducto.ai/">Reducto</a> offer better economics and reliability than raw LLM calls. These platforms have
already solved for edge cases around OCR quality, table extraction, and format variations. The build-vs-buy calculation
here typically favors buying unless you have unique requirements or scale that justifies the engineering investment.</p>
<h3 id="2-content-generation-and-summarization">2. Content Generation and Summarization</h3>
<p>This is probably what most people think of when they hear &ldquo;AI features,&rdquo; but the key is being specific about the use
case.</p>
<p><strong>What makes this valuable:</strong></p>
<ul>
<li>Reduces cognitive load on users</li>
<li>Provides consistent quality and tone</li>
<li>Can process and synthesize large amounts of information quickly</li>
</ul>
<p><strong>Real examples I&rsquo;ve built:</strong></p>
<ul>
<li><strong>Smart report generation</strong>: Taking raw data and generating human-readable reports with insights and recommendations.</li>
<li><strong>Meeting summarizer</strong>: Processing transcripts to extract key decisions, action items, and important discussions.</li>
<li><strong>Documentation assistant</strong>: Generating first drafts of technical documentation from code comments and README files.</li>
</ul>
<p>The critical lesson here is that unconstrained generation is rarely what you want in production. Effective generation
features require explicit boundaries: output structure, length constraints, tone guidelines, and forbidden topics. The
challenge isn&rsquo;t getting the model to generate—it&rsquo;s getting it to generate within your specific constraints reliably.</p>
<p>This is where prompt engineering transitions from art to engineering: defining schemas, enforcing structural
requirements, and building validation layers. The most successful generation features I&rsquo;ve seen treat the LLM as one
component in a larger pipeline, not a magic box.</p>
<h3 id="3-categorization-and-classification">3. Categorization and Classification</h3>
<p>This is where LLMs really shine compared to traditional ML. What used to require thousands of labeled examples and
complex training pipelines can now be done with a well-crafted prompt.</p>
<p><strong>What makes this valuable:</strong></p>
<ul>
<li>No need for labeled training data</li>
<li>Can handle edge cases and ambiguity</li>
<li>Easy to adjust categories without retraining</li>
</ul>
<p>The architectural advantage here is profound: you&rsquo;re essentially defining classifiers declaratively rather than
imperatively. No training data, no model selection, no hyperparameter tuning—just clear descriptions of your categories.
The model&rsquo;s pre-trained understanding of language and context does the heavy lifting.</p>
<p>This fundamentally changes the iteration cycle. Adding a new category or adjusting definitions happens in minutes, not
weeks. The trade-off is less fine-grained control over the decision boundary, but for most business applications, this
is a feature, not a bug.</p>
<p><strong>Scaling considerations:</strong> Production deployments require:</p>
<ul>
<li><strong>Structured output guarantees</strong>: Libraries like <a href="https://github.com/pydantic/pydantic-ai">Pydantic AI</a>
and <a href="https://github.com/outlines-dev/outlines">Outlines</a> enforce schema compliance at the token generation level,
eliminating post-processing failures.</li>
<li><strong>Prompt optimization</strong>: <a href="https://github.com/stanfordnlp/dspy">DSPy</a> and similar frameworks apply optimization
techniques to prompt engineering, treating it as a learnable parameter rather than a manual craft.</li>
<li><strong>Evals, Observability, and Error Analysis</strong>: This could and will likely eventually be its own post</li>
</ul>
<h2 id="the-anti-patterns-what-doesnt-work">The Anti-Patterns: What Doesn&rsquo;t Work</h2>
<p>Let me save you some pain by sharing what consistently fails:</p>
<h3 id="1-trying-to-replace-domain-expertise">1. Trying to Replace Domain Expertise</h3>
<p>LLMs are great at general knowledge but terrible at specialized domains without extensive context. If you need deep
expertise, you still need experts.</p>
<h3 id="2-real-time-high-frequency-operations">2. Real-time, High-frequency Operations</h3>
<p>Sub-100ms response times and high-frequency calls remain outside the practical envelope for LLM applications. The
latency floor of current models, even with optimizations like speculative decoding, makes them unsuitable for hot-path
operations.</p>
<h3 id="3-anything-requiring-perfect-accuracy">3. Anything Requiring Perfect Accuracy</h3>
<p>LLMs are probabilistic. If you need 100% accuracy (financial calculations, legal compliance, etc.), use traditional
code.</p>
<h2 id="a-practical-evaluation-framework">A Practical Evaluation Framework</h2>
<p>When someone comes to me with an AI feature idea, here&rsquo;s my checklist:</p>
<table>
<thead>
<tr>
<th>Question</th>
<th>Good Sign</th>
<th>Red Flag</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Does it fit one of the three categories?</strong></td>
<td>Clear fit with examples</td>
<td>&ldquo;It&rsquo;s like ChatGPT but&hellip;&rdquo;</td>
</tr>
<tr>
<td><strong>What&rsquo;s the failure mode?</strong></td>
<td>Graceful degradation</td>
<td>Catastrophic failure</td>
</tr>
<tr>
<td><strong>Can a human do it in &lt;5 minutes?</strong></td>
<td>Yes, but it&rsquo;s tedious</td>
<td>No, requires deep expertise</td>
</tr>
<tr>
<td><strong>Is accuracy critical?</strong></td>
<td>Good enough is fine</td>
<td>Must be 100% correct</td>
</tr>
<tr>
<td><strong>What&rsquo;s the response time requirement?</strong></td>
<td>Seconds are fine</td>
<td>Needs to be instant</td>
</tr>
<tr>
<td><strong>Do we have the data?</strong></td>
<td>Yes, and it&rsquo;s accessible</td>
<td>&ldquo;We&rsquo;ll figure it out&rdquo;</td>
</tr>
</tbody>
</table>
<h2 id="implementation-strategy">Implementation Strategy</h2>
<p>For teams evaluating their first LLM feature, I recommend starting with categorization. The reasoning is purely
pragmatic: it has the clearest evaluation metrics, the most forgiving failure modes, and provides immediate value. You
can validate the approach with a small dataset and scale incrementally.</p>
<p>The implementation complexity is also minimal—you&rsquo;re essentially building a discriminator rather than a generator, which
sidesteps many of the challenges around hallucination, output formatting, and content safety. Most importantly, when
classification confidence is low, you can gracefully fall back to human review without breaking the user experience.</p>
<h2 id="the-reality-of-production-ai">The Reality of Production AI</h2>
<p>The gap between AI demos and production systems remains vast. The features that succeed in production share a common
trait: they augment existing workflows rather than attempting to replace them entirely. They handle the tedious,
error-prone tasks that humans perform inconsistently, freeing cognitive capacity for higher-value work.</p>
<p>This isn&rsquo;t a limitation—it&rsquo;s the current sweet spot for LLM applications. The technology excels at tasks that are
simultaneously too complex for traditional automation but too mundane to justify human attention. Understanding this
paradox is key to building AI features that actually ship.</p>
]]></content:encoded>
    </item>
    
  </channel>
</rss>
