<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>LLM on Alex Jacobs</title>
    <link>https://alex-jacobs.com/tags/llm/</link>
    <description>Recent content in LLM on Alex Jacobs</description>
    <generator>Hugo -- gohugo.io</generator>
    <language>en-us</language>
    <lastBuildDate>Sun, 04 Jan 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://alex-jacobs.com/tags/llm/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Beating BERT? Small LLMs vs Fine-Tuned Encoders for Classification</title>
      <link>https://alex-jacobs.com/posts/beatingbert/</link>
      <pubDate>Sun, 04 Jan 2026 00:00:00 +0000</pubDate>
      
      <guid>https://alex-jacobs.com/posts/beatingbert/</guid>
      <description>I ran 32 experiments comparing small LLMs to BERT on classification tasks. Turns out 2018-era BERT is still really good at what it does.</description>
      <content:encoded><![CDATA[<p>&ldquo;Just use an LLM.&rdquo;</p>
<p>That was my advice to a colleague recently when they asked about a classification problem. Who fine-tunes BERT anymore? Haven&rsquo;t decoder models eaten the entire NLP landscape?</p>
<p>The look I got back was&hellip; skeptical. And it stuck with me.</p>
<p>I&rsquo;ve been deep in LLM-land for a few years now. When your daily driver can architect systems, write production code, and reason through problems better than most junior devs, you start reaching for it reflexively. Maybe my traditional ML instincts had atrophied.</p>
<p>So I decided to actually test my assumptions instead of just vibing on them.</p>
<p>I ran 32 experiments pitting small instruction-tuned LLMs against good old BERT and DeBERTa. I figured I&rsquo;d just be confirming what I already believed, that these new decoder models would obviously crush the ancient encoders.</p>
<p>I was wrong.</p>
<p>The results across Gemma 2B, Qwen 0.5B/1.5B, BERT-base, and DeBERTa-v3 were&hellip; not what I expected. If you&rsquo;re trying to decide between these approaches for classification, you might want to actually measure things instead of assuming the newer model is better.</p>
<p><img loading="lazy" src="/posts/beatingbert/scorecard.png" type="" alt="TL;DR: DeBERTa wins 3/4 tasks, LLM wins on adversarial NLI, but LLMs need zero training"  /></p>
<p>All the code is <a href="https://github.com/alexjacobs08/beatingBERT">on GitHub</a> if you want to run your own experiments.</p>
<h2 id="experiment-setup">Experiment Setup</h2>
<h3 id="what-i-tested">What I Tested</h3>
<p><strong>BERT Family (Fine-tuned)</strong></p>
<ul>
<li>BERT-base-uncased (110M parameters)</li>
<li>DeBERTa-v3-base (184M parameters)</li>
</ul>
<p><strong>Small LLMs</strong></p>
<ul>
<li>Qwen2-0.5B-Instruct</li>
<li>Qwen2.5-1.5B-Instruct</li>
<li>Gemma-2-2B-it</li>
</ul>
<p>For the LLMs, I tried two approaches:</p>
<ol>
<li><strong>Zero-shot</strong> - Just prompt engineering, no training</li>
<li><strong>Few-shot (k=5)</strong> - Include 5 examples in the prompt</li>
</ol>
<h3 id="tasks">Tasks</h3>
<p>Four classification benchmarks ranging from easy sentiment to adversarial NLI:</p>
<table>
<thead>
<tr>
<th>Task</th>
<th>Type</th>
<th>Labels</th>
<th>Difficulty</th>
</tr>
</thead>
<tbody>
<tr>
<td>SST-2</td>
<td>Sentiment</td>
<td>2</td>
<td>Easy</td>
</tr>
<tr>
<td>RTE</td>
<td>Textual Entailment</td>
<td>2</td>
<td>Medium</td>
</tr>
<tr>
<td>BoolQ</td>
<td>Yes/No QA</td>
<td>2</td>
<td>Medium</td>
</tr>
<tr>
<td>ANLI (R1)</td>
<td>Adversarial NLI</td>
<td>3</td>
<td>Hard</td>
</tr>
</tbody>
</table>
<h3 id="methodology">Methodology</h3>
<p>For anyone who wants to reproduce this or understand what &ldquo;fine-tuned&rdquo; and &ldquo;zero-shot&rdquo; actually mean here:</p>
<p><strong>BERT/DeBERTa Fine-tuning:</strong></p>
<ul>
<li>Standard HuggingFace Trainer with AdamW optimizer</li>
<li>Learning rate: 2e-5, batch size: 32, epochs: 3</li>
<li>Max sequence length: 128 tokens</li>
<li>Evaluation on validation split (GLUE test sets don&rsquo;t have public labels)</li>
</ul>
<p><strong>LLM Zero-shot:</strong></p>
<ul>
<li>Greedy decoding (temperature=0.0) for deterministic outputs</li>
<li>Task-specific prompts asking for single-word classification labels</li>
<li>No examples in context—just instructions and the input text</li>
</ul>
<p><strong>LLM Few-shot (k=5):</strong></p>
<ul>
<li>Same as zero-shot, but with 5 labeled examples prepended to each prompt</li>
<li>Examples randomly sampled from training set (stratified by class)</li>
</ul>
<p>All experiments used a fixed random seed (99) for reproducibility. Evaluation metrics are accuracy on the validation split. Hardware: RunPod instance with RTX A4500 (20GB VRAM), 20GB RAM, 5 vCPU.</p>
<p><img loading="lazy" src="/posts/beatingbert/nvitop.png" type="" alt="nvitop running on RunPod"  />
<em>I&rsquo;d forgotten how pretty text-only land can be. When you spend most of your time in IDEs and notebooks, SSH-ing into a headless GPU box and watching nvitop do its thing feels almost meditative.</em></p>
<h2 id="results">Results</h2>
<p>Let&rsquo;s dive into what actually happened:</p>


<style>
.results-table { width: 100%; border-collapse: collapse; margin: 1.5rem 0; font-size: 0.9rem; }
.results-table th, .results-table td { padding: 0.5rem 0.75rem; text-align: left; border-bottom: 1px solid var(--border); }
.results-table th { color: var(--secondary); font-weight: 500; }
.results-table td { color: var(--content); }
.winner { background: rgba(87, 62, 170, 0.15); font-weight: 600; border-radius: 4px; padding: 0.2rem 0.4rem; }
</style>
<table class="results-table">
<thead>
<tr><th>Model</th><th>Method</th><th>SST-2</th><th>RTE</th><th>BoolQ</th><th>ANLI</th></tr>
</thead>
<tbody>
<tr><td><strong>DeBERTa-v3</strong></td><td>Fine-tuned</td><td><span class="winner">94.8%</span></td><td><span class="winner">80.9%</span></td><td><span class="winner">82.6%</span></td><td>47.4%</td></tr>
<tr><td><strong>BERT-base</strong></td><td>Fine-tuned</td><td>91.5%</td><td>61.0%</td><td>71.5%</td><td>35.3%</td></tr>
<tr><td>Qwen2.5-1.5B</td><td>Zero-shot</td><td>93.8%</td><td>78.7%</td><td>74.6%</td><td>40.8%</td></tr>
<tr><td>Qwen2.5-1.5B</td><td>Few-shot</td><td>89.0%</td><td>53.4%</td><td>73.6%</td><td>45.0%</td></tr>
<tr><td>Gemma-2-2B</td><td>Zero-shot</td><td>90.0%</td><td>61.4%</td><td>80.9%</td><td>36.1%</td></tr>
<tr><td>Gemma-2-2B</td><td>Few-shot</td><td>86.5%</td><td>73.6%</td><td>81.5%</td><td><span class="winner">47.8%</span></td></tr>
<tr><td>Qwen2-0.5B</td><td>Zero-shot</td><td>87.6%</td><td>53.1%</td><td>61.8%</td><td>33.2%</td></tr>
</tbody>
</table>

<p><img loading="lazy" src="/posts/beatingbert/accuracy_comparison.png" type="" alt="Model Accuracy Comparison"  /></p>
<p><strong>DeBERTa-v3 wins most tasks—but not all</strong></p>
<p>DeBERTa hit 94.8% on SST-2, 80.9% on RTE, and 82.6% on BoolQ. For standard classification with decent training data, the fine-tuned encoders still dominate.</p>
<p>On ANLI—the hardest benchmark, specifically designed to fool models—Gemma few-shot actually beats DeBERTa (47.8% vs 47.4%). It&rsquo;s a narrow win, but it&rsquo;s a win on the task that matters most for robustness.</p>
<p><strong>Zero-shot LLMs actually beat BERT-base</strong></p>
<p>The LLMs aren&rsquo;t losing to BERT—they&rsquo;re losing to DeBERTa. Qwen2.5-1.5B zero-shot hit 93.8% on SST-2, beating BERT-base&rsquo;s 91.5%. Same story on RTE (78.7% vs 61.0%) and BoolQ (Gemma&rsquo;s 80.9% vs BERT&rsquo;s 71.5%). For models running purely on prompts with zero training? I&rsquo;m calling it a win.</p>
<p><strong>Few-shot is a mixed bag</strong></p>
<p>Adding examples to the prompt doesn&rsquo;t always help.</p>
<p>On RTE, Qwen2.5-1.5B went from 78.7% zero-shot down to 53.4% with few-shot. On SST-2, it dropped from 93.8% to 89.0%. But on ANLI, few-shot helped significantly—Gemma jumped from 36.1% to 47.8%, enough to beat DeBERTa.</p>
<p>Few-shot helps on harder tasks where examples demonstrate the thought process, but can confuse models on simpler pattern matching tasks where they already &ldquo;get it.&rdquo; Sometimes examples add noise instead of signal.</p>
<h2 id="bert-goes-brrrr">BERT Goes Brrrr</h2>
<p>Okay, so the accuracy gap isn&rsquo;t huge. Maybe I could still justify using an LLM?</p>
<p>Then I looked at throughput:</p>
<table>
<thead>
<tr>
<th>Model</th>
<th>Method</th>
<th>Throughput (samples/s)</th>
<th>Latency (ms/sample)</th>
</tr>
</thead>
<tbody>
<tr>
<td>BERT-base</td>
<td>Fine-tuned</td>
<td><strong>277</strong></td>
<td>3.6</td>
</tr>
<tr>
<td>DeBERTa-v3</td>
<td>Fine-tuned</td>
<td><strong>232</strong></td>
<td>4.3</td>
</tr>
<tr>
<td>Qwen2-0.5B</td>
<td>Zero-shot</td>
<td>17.5</td>
<td>57</td>
</tr>
<tr>
<td>Qwen2.5-1.5B</td>
<td>Zero-shot</td>
<td>12.3</td>
<td>81</td>
</tr>
<tr>
<td>Gemma-2-2B</td>
<td>Zero-shot</td>
<td>11.6</td>
<td>86</td>
</tr>
</tbody>
</table>
<p><img loading="lazy" src="/posts/beatingbert/accuracy_vs_latency.png" type="" alt="Accuracy vs Latency"  /></p>
<p><strong>BERT is ~20x faster.</strong></p>
<p>BERT processes 277 samples per second. Gemma-2-2B manages 12. If you&rsquo;re classifying a million documents, that&rsquo;s one hour vs a full day.</p>
<p>Encoders process the whole sequence in one forward pass. Decoders generate tokens autoregressively, even just to output &ldquo;positive&rdquo; or &ldquo;negative&rdquo;.</p>
<blockquote>
<p><strong>Note on LLM latency:</strong> These numbers use <code>max_length=256</code> for tokenization. When I bumped it to <code>max_length=2048</code>, latency jumped 8x—from 57ms to 445ms per sample for Qwen-0.5B. Context window scales roughly linearly with inference time. For short classification tasks, keep it short or make it dynamic.</p>
</blockquote>
<h3 id="try-it-yourself">Try It Yourself</h3>
<p>These models struggled on nuanced reviews. Can you do better? Try classifying some of the trickiest examples from my experiments:</p>


<style>
.classifier-demo {
  max-width: 100%;
  margin: 1.5rem 0;
  padding: 1.5rem;
  border: 1px solid var(--border);
  border-radius: var(--radius);
  background: var(--code-bg);
}
.demo-header {
  text-align: center;
  margin-bottom: 1rem;
}
.demo-title {
  margin: 0 0 0.25rem 0;
  font-size: 1.1rem;
  color: var(--primary);
}
.demo-subtitle {
  margin: 0 0 0.75rem 0;
  font-size: 0.85rem;
  color: var(--secondary);
}
.demo-score {
  display: flex;
  justify-content: center;
  gap: 2rem;
  font-size: 0.9rem;
  color: var(--secondary);
}
.demo-score strong {
  color: var(--primary);
}
.review-box {
  background: var(--entry);
  padding: 1rem 1.25rem;
  border-radius: var(--radius);
  margin: 1rem 0;
  font-style: italic;
  line-height: 1.6;
  border-left: 3px solid #573eaa;
  color: var(--content);
}
.btn-group {
  display: flex;
  gap: 0.75rem;
  justify-content: center;
  margin: 1rem 0;
}
.demo-btn {
  padding: 0.6rem 1.5rem;
  font-size: 0.9rem;
  border: none;
  border-radius: var(--radius);
  cursor: pointer;
  transition: all 0.2s;
  font-weight: 500;
}
.demo-btn:hover:not(:disabled) { opacity: 0.9; }
.demo-btn:disabled { opacity: 0.5; cursor: not-allowed; }
.btn-positive { background: #27ae60; color: white; }
.btn-negative { background: #c0392b; color: white; }
.result-box {
  margin-top: 1rem;
  padding: 1rem;
  border-radius: var(--radius);
  display: none;
}
.result-box.show { display: block; }
.result-correct { background: rgba(39, 174, 96, 0.15); border: 1px solid rgba(39, 174, 96, 0.3); }
.result-wrong { background: rgba(192, 57, 43, 0.15); border: 1px solid rgba(192, 57, 43, 0.3); }
.model-results {
  margin-top: 0.75rem;
  font-size: 0.8rem;
  color: var(--secondary);
}
.model-row {
  display: flex;
  justify-content: space-between;
  padding: 0.2rem 0;
  border-bottom: 1px solid var(--border);
}
.model-row:last-child { border-bottom: none; }
.model-correct { color: #27ae60; }
.model-wrong { color: #c0392b; }
.next-btn {
  display: block;
  margin: 1rem auto 0;
  padding: 0.5rem 1.25rem;
  background: #573eaa;
  color: white;
  border: none;
  border-radius: var(--radius);
  cursor: pointer;
  font-size: 0.85rem;
}
.next-btn:hover { background: #6549c0; }
.progress-bar {
  height: 3px;
  background: var(--border);
  border-radius: 2px;
  margin-bottom: 1rem;
}
.progress-fill {
  height: 100%;
  background: #573eaa;
  border-radius: 2px;
  transition: width 0.3s;
}
.demo-complete {
  text-align: center;
  padding: 1rem;
}
.final-score {
  font-size: 1.25rem;
  font-weight: 600;
  margin: 0.5rem 0;
  color: var(--primary);
}
#completeSummary {
  color: var(--secondary);
}
</style>

<div class="classifier-demo" id="classifierDemo">
  <div class="demo-header">
    <div class="demo-title">Can You Beat the Models?</div>
    <p class="demo-subtitle">Classify these tricky movie reviews</p>
    <div class="demo-score">
      <span>You: <strong id="userScore">0</strong>/<span id="totalAnswered">0</span></span>
      <span>Models: <strong id="modelScore">0</strong>/<span id="totalAnswered2">0</span></span>
    </div>
  </div>
  <div class="progress-bar">
    <div class="progress-fill" id="progressFill" style="width: 0%"></div>
  </div>
  <div id="questionArea">
    <div class="review-box" id="reviewText"></div>
    <div class="btn-group">
      <button class="demo-btn btn-negative" onclick="submitAnswer('negative')">Negative</button>
      <button class="demo-btn btn-positive" onclick="submitAnswer('positive')">Positive</button>
    </div>
    <div class="result-box" id="resultBox">
      <div id="resultText"></div>
      <div class="model-results" id="modelResults"></div>
      <button class="next-btn" onclick="nextQuestion()">Next Review →</button>
    </div>
  </div>
  <div id="completeArea" style="display:none;" class="demo-complete">
    <div class="demo-title">Challenge Complete!</div>
    <div class="final-score">You: <span id="finalUserScore"></span> | Models: <span id="finalModelScore"></span></div>
    <p id="completeSummary"></p>
    <button class="next-btn" onclick="restartDemo()">Play Again</button>
  </div>
</div>

<script>
const demoExamples = [
  {text: "hilariously inept and ridiculous.", true_label: "positive", predictions: {"Gemma": "negative", "Qwen": "negative"}},
  {text: "all that's missing is the spontaneity, originality and delight.", true_label: "negative", predictions: {"Gemma": "positive", "Qwen": "positive"}},
  {text: "reign of fire looks as if it was made without much thought -- and is best watched that way.", true_label: "positive", predictions: {"Gemma": "negative", "Qwen": "negative"}},
  {text: "we root for (clara and paul), even like them, though perhaps it's an emotion closer to pity.", true_label: "positive", predictions: {"Gemma": "negative", "Qwen": "negative"}},
  {text: "a solid film... but more conscientious than it is truly stirring.", true_label: "positive", predictions: {"Gemma": "negative", "Qwen": "positive"}},
  {text: "this riveting world war ii moral suspense story deals with the shadow side of american culture: racial prejudice in its ugly and diverse forms.", true_label: "negative", predictions: {"Gemma": "positive", "Qwen": "positive"}}
];
let currentIndex = 0, userCorrect = 0, modelCorrect = 0, totalAnswered = 0;
function initDemo() {
  currentIndex = 0; userCorrect = 0; modelCorrect = 0; totalAnswered = 0;
  document.getElementById('completeArea').style.display = 'none';
  document.getElementById('questionArea').style.display = 'block';
  updateScores(); showQuestion();
}
function showQuestion() {
  const ex = demoExamples[currentIndex];
  document.getElementById('reviewText').textContent = '"' + ex.text + '"';
  document.getElementById('resultBox').classList.remove('show');
  document.querySelectorAll('.demo-btn').forEach(b => b.disabled = false);
  document.getElementById('progressFill').style.width = ((currentIndex / demoExamples.length) * 100) + '%';
}
function submitAnswer(answer) {
  const ex = demoExamples[currentIndex];
  const correct = answer === ex.true_label;
  totalAnswered++;
  if (correct) userCorrect++;
  let mc = 0;
  Object.values(ex.predictions).forEach(p => { if (p === ex.true_label) mc++; });
  modelCorrect += mc / Object.keys(ex.predictions).length;
  const resultBox = document.getElementById('resultBox');
  resultBox.className = 'result-box show ' + (correct ? 'result-correct' : 'result-wrong');
  document.getElementById('resultText').innerHTML = correct
    ? '<strong>Correct!</strong> This review is ' + ex.true_label + '.'
    : '<strong>Tricky!</strong> This review is actually <em>' + ex.true_label + '</em>.';
  let modelHtml = '<strong>Model predictions:</strong>';
  for (const [model, pred] of Object.entries(ex.predictions)) {
    const isCorrect = pred === ex.true_label;
    modelHtml += '<div class="model-row"><span>' + model + '</span><span class="' + (isCorrect ? 'model-correct' : 'model-wrong') + '">' + pred + ' ' + (isCorrect ? '✓' : '✗') + '</span></div>';
  }
  document.getElementById('modelResults').innerHTML = modelHtml;
  document.querySelectorAll('.demo-btn').forEach(b => b.disabled = true);
  updateScores();
}
function updateScores() {
  document.getElementById('userScore').textContent = userCorrect;
  document.getElementById('modelScore').textContent = modelCorrect.toFixed(1);
  document.getElementById('totalAnswered').textContent = totalAnswered;
  document.getElementById('totalAnswered2').textContent = totalAnswered;
}
function nextQuestion() {
  currentIndex++;
  if (currentIndex >= demoExamples.length) showComplete();
  else showQuestion();
}
function showComplete() {
  document.getElementById('questionArea').style.display = 'none';
  document.getElementById('completeArea').style.display = 'block';
  document.getElementById('finalUserScore').textContent = userCorrect + '/' + demoExamples.length;
  document.getElementById('finalModelScore').textContent = modelCorrect.toFixed(1) + '/' + demoExamples.length;
  const diff = userCorrect - modelCorrect;
  let msg = diff > 1 ? "You crushed the AI! Human intuition wins." : diff > 0 ? "You edged out the models!" : diff === 0 ? "Dead heat with AI." : "The models got you this time. These are genuinely tricky!";
  document.getElementById('completeSummary').textContent = msg;
}
function restartDemo() { initDemo(); }
document.addEventListener('DOMContentLoaded', initDemo);
if (document.readyState !== 'loading') initDemo();
</script>

<h2 id="when-llms-make-sense">When LLMs Make Sense</h2>
<p>Despite the efficiency gap, there are cases where small LLMs are the right choice:</p>
<p><strong>Zero Training Data</strong></p>
<p>If you have no labeled data, LLMs win by default. Zero-shot Qwen2.5-1.5B at 93.8% on SST-2 is production-ready without a single training example. You can&rsquo;t fine-tune BERT with zero examples.</p>
<p><strong>Rapidly Changing Categories</strong></p>
<p>If your categories change frequently (new product types, emerging topics), re-prompting an LLM takes seconds. Re-training BERT requires new labeled data, training time, validation, deployment. The iteration cycle matters.</p>
<p><strong>Explanations with Predictions</strong></p>
<p>LLMs can provide reasoning: &ldquo;This review is negative because the customer mentions &lsquo;defective product&rsquo; and &lsquo;waste of money.&rsquo;&rdquo; BERT gives you a probability. Sometimes you need the story, not just the number.</p>
<p><strong>Low Volume</strong></p>
<p>If you&rsquo;re processing 100 support tickets a day, throughput doesn&rsquo;t matter. The 20x speed difference is irrelevant when you&rsquo;re not hitting any resource constraints.</p>
<h2 id="when-bert-still-wins">When BERT Still Wins</h2>
<p><strong>High-Volume Production Systems</strong></p>
<p>If you&rsquo;re classifying millions of items daily, BERT&rsquo;s 20x throughput advantage matters. That&rsquo;s a job finishing in an hour vs. running all day.</p>
<p><strong>Well-Defined, Stable Tasks</strong></p>
<p>Sentiment analysis. Spam detection. Topic classification. If your task definition hasn&rsquo;t changed since 2019, fine-tuned BERT is proven and stable. No need to fix what isn&rsquo;t broken.</p>
<p><strong>You Have Training Data</strong></p>
<p>With a few thousand labeled examples, fine-tuned DeBERTa will beat small LLMs. It&rsquo;s a dedicated specialist vs. a generalist. Specialization still works.</p>
<p><strong>Latency Matters</strong></p>
<p>Real-time classification in a user-facing app where every millisecond counts? BERT&rsquo;s parallel processing wins. LLMs can&rsquo;t compete on speed.</p>
<h2 id="limitations">Limitations</h2>
<p>Before you @ me on Twitter—yes, I know this isn&rsquo;t the final word. Some caveats:</p>
<p><strong>I only tested small LLMs.</strong> Kept everything under 2B parameters to fit comfortably on a 20GB GPU. Bigger models like Llama-3-8B or Qwen-7B would probably do better, but then the efficiency comparison becomes even more lopsided. You&rsquo;re not beating BERT&rsquo;s throughput with a 7B model.</p>
<p><strong>Generic prompts.</strong> I used straightforward prompts without heavy optimization. Task-specific prompt engineering could boost LLM performance. DSPy-style optimization would probably help too—but that&rsquo;s another blog post.</p>
<p><strong>Four benchmarks isn&rsquo;t everything.</strong> There are plenty of classification scenarios I didn&rsquo;t test. Your domain might be different. Measure, don&rsquo;t assume.</p>
<h2 id="conclusion">Conclusion</h2>
<p>So, can small LLMs beat BERT at classification?</p>
<p>Sometimes, and on the hardest task, they actually do. Gemma few-shot edges out DeBERTa on adversarial NLI, the benchmark specifically designed to break models.</p>
<p>DeBERTa-v3 still wins 3 out of 4 tasks when you have training data. And BERT&rsquo;s efficiency advantage is real—~20x faster throughput matters when you&rsquo;re processing millions of documents and paying for compute.</p>
<p>Zero-shot LLMs aren&rsquo;t just a parlor trick either. Qwen2.5-1.5B hits 93.8% on sentiment with zero training examples—that&rsquo;s production-ready without a single label. For cold-start problems, rapidly changing domains, or when you need explanations alongside predictions, they genuinely work.</p>
<p>Hopefully this gives some actual data points for making that call instead of just following the hype cycle.</p>
<p>All the code is <a href="https://github.com/alexjacobs08/beatingBERT">on GitHub</a>. Go run your own experiments.</p>
<hr>
<p><em>Surely I&rsquo;ve made some embarrassing mistakes here. Don&rsquo;t just tell me—tell everyone! Share this post on your favorite social media with your corrections :)</em></p>
]]></content:encoded>
    </item>
    
    <item>
      <title>A Production Framework for LLM Feature Evaluation</title>
      <link>https://alex-jacobs.com/posts/practicalaifeatures/</link>
      <pubDate>Sun, 01 Jun 2025 00:00:00 +0000</pubDate>
      
      <guid>https://alex-jacobs.com/posts/practicalaifeatures/</guid>
      <description>An empirical analysis of LLM application patterns that successfully scale in production systems, focusing on extraction, generation, and classification use cases</description>
      <content:encoded><![CDATA[<h2 id="introduction">Introduction</h2>
<p>After several years of integrating LLMs into production systems, I&rsquo;ve observed a consistent pattern: the features that
deliver real value rarely align with what gets attention at conferences. While the industry focuses on AGI and emergent
behaviors, the mundane applications—data extraction, classification, controlled generation—are quietly transforming how
we build software.</p>
<p>This post presents a framework I&rsquo;ve developed for evaluating LLM features based on what actually ships and scales. It&rsquo;s
deliberately narrow in scope, focusing on patterns that have proven reliable across multiple deployments rather than
exploring the theoretical boundaries of what&rsquo;s possible.</p>
<h2 id="the-three-categories-that-actually-work">The Three Categories That Actually Work</h2>
<p>Through trial, error, and more error, I&rsquo;ve found that LLMs consistently excel in three specific areas. When I&rsquo;m
evaluating a potential AI feature, I ask: &ldquo;Does this clearly fit into one of these categories?&rdquo; If not, it&rsquo;s probably
not worth pursuing (yet).</p>
<h3 id="1-extracting-structured-data-from-unstructured-inputs">1. Extracting Structured Data from Unstructured Inputs</h3>
<p>This is the unsexy workhorse of AI features. Think of it as having an intelligent data entry assistant who never gets
tired of parsing messy inputs.</p>
<p><strong>What makes this valuable:</strong></p>
<ul>
<li>Humans hate data entry</li>
<li>Traditional parsing is brittle and breaks with slight format changes</li>
<li>LLMs can handle ambiguity and variations gracefully</li>
</ul>
<p><strong>Real examples I&rsquo;ve built:</strong></p>
<ul>
<li><strong>PDF to JSON converter</strong>: Taking uploaded forms (PDFs, images, even handwritten docs) and extracting structured data.
What used to require complex OCR pipelines and regex nightmares now works with a simple prompt.</li>
<li><strong>API response mapper</strong>: Taking inconsistent third-party API responses and mapping them to your internal data model.
Every integration engineer&rsquo;s nightmare—different field names, nested structures that change randomly, optional fields
that are sometimes null and sometimes missing entirely.</li>
<li><strong>Customer feedback analyzer</strong>: Extracting actionable insights from the stream of unstructured feedback across emails,
Slack, support tickets. Automatically pulling out feature requests, bug reports, severity, and sentiment. What used to
be a PM&rsquo;s full-time job.</li>
</ul>
<p>The key insight here is that LLMs excel at handling structural variance and ambiguity—the exact things that make
traditional parsers brittle. A single well-crafted prompt can replace hundreds of lines of mapping logic, regex
patterns, and edge case handling. The model&rsquo;s ability to understand intent rather than just pattern match is what makes
this category so powerful.</p>
<p><strong>Production considerations:</strong> For high-volume extraction from standardized formats, purpose-built services
like <a href="https://reducto.ai/">Reducto</a> offer better economics and reliability than raw LLM calls. These platforms have
already solved for edge cases around OCR quality, table extraction, and format variations. The build-vs-buy calculation
here typically favors buying unless you have unique requirements or scale that justifies the engineering investment.</p>
<h3 id="2-content-generation-and-summarization">2. Content Generation and Summarization</h3>
<p>This is probably what most people think of when they hear &ldquo;AI features,&rdquo; but the key is being specific about the use
case.</p>
<p><strong>What makes this valuable:</strong></p>
<ul>
<li>Reduces cognitive load on users</li>
<li>Provides consistent quality and tone</li>
<li>Can process and synthesize large amounts of information quickly</li>
</ul>
<p><strong>Real examples I&rsquo;ve built:</strong></p>
<ul>
<li><strong>Smart report generation</strong>: Taking raw data and generating human-readable reports with insights and recommendations.</li>
<li><strong>Meeting summarizer</strong>: Processing transcripts to extract key decisions, action items, and important discussions.</li>
<li><strong>Documentation assistant</strong>: Generating first drafts of technical documentation from code comments and README files.</li>
</ul>
<p>The critical lesson here is that unconstrained generation is rarely what you want in production. Effective generation
features require explicit boundaries: output structure, length constraints, tone guidelines, and forbidden topics. The
challenge isn&rsquo;t getting the model to generate—it&rsquo;s getting it to generate within your specific constraints reliably.</p>
<p>This is where prompt engineering transitions from art to engineering: defining schemas, enforcing structural
requirements, and building validation layers. The most successful generation features I&rsquo;ve seen treat the LLM as one
component in a larger pipeline, not a magic box.</p>
<h3 id="3-categorization-and-classification">3. Categorization and Classification</h3>
<p>This is where LLMs really shine compared to traditional ML. What used to require thousands of labeled examples and
complex training pipelines can now be done with a well-crafted prompt.</p>
<p><strong>What makes this valuable:</strong></p>
<ul>
<li>No need for labeled training data</li>
<li>Can handle edge cases and ambiguity</li>
<li>Easy to adjust categories without retraining</li>
</ul>
<p>The architectural advantage here is profound: you&rsquo;re essentially defining classifiers declaratively rather than
imperatively. No training data, no model selection, no hyperparameter tuning—just clear descriptions of your categories.
The model&rsquo;s pre-trained understanding of language and context does the heavy lifting.</p>
<p>This fundamentally changes the iteration cycle. Adding a new category or adjusting definitions happens in minutes, not
weeks. The trade-off is less fine-grained control over the decision boundary, but for most business applications, this
is a feature, not a bug.</p>
<p><strong>Scaling considerations:</strong> Production deployments require:</p>
<ul>
<li><strong>Structured output guarantees</strong>: Libraries like <a href="https://github.com/pydantic/pydantic-ai">Pydantic AI</a>
and <a href="https://github.com/outlines-dev/outlines">Outlines</a> enforce schema compliance at the token generation level,
eliminating post-processing failures.</li>
<li><strong>Prompt optimization</strong>: <a href="https://github.com/stanfordnlp/dspy">DSPy</a> and similar frameworks apply optimization
techniques to prompt engineering, treating it as a learnable parameter rather than a manual craft.</li>
<li><strong>Evals, Observability, and Error Analysis</strong>: This could and will likely eventually be its own post</li>
</ul>
<h2 id="the-anti-patterns-what-doesnt-work">The Anti-Patterns: What Doesn&rsquo;t Work</h2>
<p>Let me save you some pain by sharing what consistently fails:</p>
<h3 id="1-trying-to-replace-domain-expertise">1. Trying to Replace Domain Expertise</h3>
<p>LLMs are great at general knowledge but terrible at specialized domains without extensive context. If you need deep
expertise, you still need experts.</p>
<h3 id="2-real-time-high-frequency-operations">2. Real-time, High-frequency Operations</h3>
<p>Sub-100ms response times and high-frequency calls remain outside the practical envelope for LLM applications. The
latency floor of current models, even with optimizations like speculative decoding, makes them unsuitable for hot-path
operations.</p>
<h3 id="3-anything-requiring-perfect-accuracy">3. Anything Requiring Perfect Accuracy</h3>
<p>LLMs are probabilistic. If you need 100% accuracy (financial calculations, legal compliance, etc.), use traditional
code.</p>
<h2 id="a-practical-evaluation-framework">A Practical Evaluation Framework</h2>
<p>When someone comes to me with an AI feature idea, here&rsquo;s my checklist:</p>
<table>
<thead>
<tr>
<th>Question</th>
<th>Good Sign</th>
<th>Red Flag</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Does it fit one of the three categories?</strong></td>
<td>Clear fit with examples</td>
<td>&ldquo;It&rsquo;s like ChatGPT but&hellip;&rdquo;</td>
</tr>
<tr>
<td><strong>What&rsquo;s the failure mode?</strong></td>
<td>Graceful degradation</td>
<td>Catastrophic failure</td>
</tr>
<tr>
<td><strong>Can a human do it in &lt;5 minutes?</strong></td>
<td>Yes, but it&rsquo;s tedious</td>
<td>No, requires deep expertise</td>
</tr>
<tr>
<td><strong>Is accuracy critical?</strong></td>
<td>Good enough is fine</td>
<td>Must be 100% correct</td>
</tr>
<tr>
<td><strong>What&rsquo;s the response time requirement?</strong></td>
<td>Seconds are fine</td>
<td>Needs to be instant</td>
</tr>
<tr>
<td><strong>Do we have the data?</strong></td>
<td>Yes, and it&rsquo;s accessible</td>
<td>&ldquo;We&rsquo;ll figure it out&rdquo;</td>
</tr>
</tbody>
</table>
<h2 id="implementation-strategy">Implementation Strategy</h2>
<p>For teams evaluating their first LLM feature, I recommend starting with categorization. The reasoning is purely
pragmatic: it has the clearest evaluation metrics, the most forgiving failure modes, and provides immediate value. You
can validate the approach with a small dataset and scale incrementally.</p>
<p>The implementation complexity is also minimal—you&rsquo;re essentially building a discriminator rather than a generator, which
sidesteps many of the challenges around hallucination, output formatting, and content safety. Most importantly, when
classification confidence is low, you can gracefully fall back to human review without breaking the user experience.</p>
<h2 id="the-reality-of-production-ai">The Reality of Production AI</h2>
<p>The gap between AI demos and production systems remains vast. The features that succeed in production share a common
trait: they augment existing workflows rather than attempting to replace them entirely. They handle the tedious,
error-prone tasks that humans perform inconsistently, freeing cognitive capacity for higher-value work.</p>
<p>This isn&rsquo;t a limitation—it&rsquo;s the current sweet spot for LLM applications. The technology excels at tasks that are
simultaneously too complex for traditional automation but too mundane to justify human attention. Understanding this
paradox is key to building AI features that actually ship.</p>
]]></content:encoded>
    </item>
    
    <item>
      <title>RAG: From Context Injection to Knowledge Integration</title>
      <link>https://alex-jacobs.com/posts/rag/</link>
      <pubDate>Mon, 17 Feb 2025 00:00:00 +0000</pubDate>
      
      <guid>https://alex-jacobs.com/posts/rag/</guid>
      <description>A technical dive into the limitations of current RAG approaches, examining architectural challenges and exploring pathways to more integrated knowledge-aware LLM architectures.</description>
      <content:encoded><![CDATA[<h1 id="retrieval-augmented-generation-architectural-limitations-and-future-directions">Retrieval-Augmented Generation: Architectural Limitations and Future Directions</h1>
<p>Retrieval-Augmented Generation (RAG) has rapidly become a cornerstone in the practical application of Large Language Models (LLMs). Its promise is compelling: to expand LLMs beyond their training data by connecting them to external knowledge sources – from enterprise databases and real-time data streams to proprietary knowledge bases. The allure of RAG lies in its apparent simplicity – augment the LLM&rsquo;s input context with retrieved information, and witness enhanced output quality. However, beneath this layer of simplicity lies a more complex reality&ndash;its a bit of a hack. RAG only works because LLMs are generally robust. The more you think on it, the more it becomes clear it <em>shouldn&rsquo;t</em> really work, and should serve only as a stepping stone to a new paradigm.</p>
<h2 id="generation-vs-retrieval">Generation vs. Retrieval</h2>
<p>At their core, LLMs are generative models that produce text by navigating through a high-dimensional latent space. During pre-training on large datasets, these models learn to map language into this space, capturing relationships between words, phrases, and concepts. Text generation isn&rsquo;t a simple lookup process - it&rsquo;s a sequential operation where the model predicts each token based on both the previous context and its learned representations.</p>
<p>RAG changes this core process significantly. Rather than relying only on the model&rsquo;s learned representations, RAG injects external information directly into the context window alongside the user&rsquo;s query. While this works well in practice, it raises important questions about the theoretical and architectural implications:</p>
<ol>
<li>
<p><strong>Impact on Generation Quality:</strong> How does inserting external information affect the model&rsquo;s learned generation process? Does mixing training-derived and retrieved information create inconsistencies in the model&rsquo;s outputs?</p>
</li>
<li>
<p><strong>Information Integration:</strong> Can the model effectively combine information from different sources during generation? Or is it simply stitching together pieces without truly understanding how they relate?</p>
</li>
<li>
<p><strong>Architectural Fitness:</strong> Are transformer architectures and their training objectives actually suited for combining retrieved information with generation? Or are we forcing an approach that doesn&rsquo;t align with how these models were designed to work?</p>
</li>
</ol>
<h2 id="real-world-limitations">Real-World Limitations</h2>
<p>These theoretical concerns manifest in several practical ways:</p>
<h3 id="1-context-integration-problems">1. Context Integration Problems</h3>
<p>Current RAG implementations often struggle with:</p>
<ul>
<li>Abrupt transitions between retrieved content and generated text</li>
<li>Inconsistent voice and style when mixing sources</li>
<li>Difficulty maintaining coherent reasoning across retrieved facts</li>
<li>Limited ability to synthesize information from multiple sources</li>
</ul>
<h3 id="2-attention-mechanism-overload">2. Attention Mechanism Overload</h3>
<p>The transformer&rsquo;s attention mechanism faces significant challenges:</p>
<ul>
<li>Managing attention across disconnected chunks of information</li>
<li>Balancing focus between query, retrieved content, and generated text</li>
<li>Handling potentially contradictory information from different sources</li>
<li>Maintaining coherence when dealing with multiple retrieved documents</li>
</ul>
<h3 id="3-knowledge-conflicts">3. Knowledge Conflicts</h3>
<p>RAG systems often struggle to resolve conflicts between:</p>
<ul>
<li>The model&rsquo;s pretrained knowledge</li>
<li>Retrieved information</li>
<li>Different retrieved sources</li>
<li>User queries and retrieved content</li>
</ul>
<h2 id="the-path-forward-beyond-basic-rag">The Path Forward: Beyond Basic RAG</h2>
<p>Recent research and development suggest several promising directions for addressing these limitations:</p>
<h3 id="1-improved-knowledge-integration">1. Improved Knowledge Integration</h3>
<p>Future systems might:</p>
<ul>
<li>Process retrieved information before injection</li>
<li>Maintain explicit source tracking throughout generation</li>
<li>Use structured knowledge representations</li>
<li>Implement hierarchical attention mechanisms</li>
</ul>
<h3 id="2-enhanced-source-handling">2. Enhanced Source Handling</h3>
<p>Advanced approaches could:</p>
<ul>
<li>Evaluate source reliability and relevance</li>
<li>Resolve conflicts between sources</li>
<li>Maintain provenance information</li>
<li>Generate explicit citations and references</li>
</ul>
<h3 id="3-architectural-innovations">3. Architectural Innovations</h3>
<p>New architectures might include:</p>
<ul>
<li>Dedicated pathways for retrieved information</li>
<li>Specialized attention mechanisms for source integration</li>
<li>Dynamic context window management</li>
<li>Explicit fact-checking mechanisms</li>
</ul>
<h2 id="the-next-evolution-anthropics-citations-api">The Next Evolution: Anthropic&rsquo;s Citations API</h2>
<p>Anthropic&rsquo;s Citations API represents a significant step beyond traditional RAG implementations. While the exact implementation details aren&rsquo;t public, we can make informed speculations about its architectural innovations based on the capabilities it demonstrates.</p>
<h3 id="architectural-innovations">Architectural Innovations</h3>
<p>The Citations API likely goes beyond simple prompt engineering to include fundamental architectural changes:</p>
<ol>
<li>
<p><strong>Enhanced Context Processing</strong></p>
<ul>
<li>Specialized attention mechanisms for source document processing</li>
<li>Dedicated layers for maintaining source awareness throughout generation</li>
<li>Architectural separation between query processing and source document handling</li>
<li>Advanced chunking and document representation strategies</li>
</ul>
</li>
<li>
<p><strong>Citation-Aware Generation</strong></p>
<ul>
<li>Built-in tracking of source-claim relationships</li>
<li>Automatic detection of when citations are needed</li>
<li>Dynamic weighting of source relevance</li>
<li>Real-time fact verification against sources</li>
</ul>
</li>
<li>
<p><strong>Training Innovations</strong></p>
<ul>
<li>Custom loss functions for citation accuracy</li>
<li>Source fidelity metrics during training</li>
<li>Explicit training for source grounding</li>
<li>Specialized datasets for citation learning</li>
</ul>
</li>
</ol>
<h3 id="speculation-on-implementation">Speculation on Implementation</h3>
<p>The system likely employs several key mechanisms:</p>
<ol>
<li>
<p><strong>Dual-Stream Processing</strong></p>
<ul>
<li>Separate processing paths for user queries and source documents</li>
<li>Specialized attention heads for citation tracking</li>
<li>Fusion layers for combining information streams</li>
<li>Dynamic context management</li>
</ul>
</li>
<li>
<p><strong>Source Integration</strong></p>
<ul>
<li>Fine-grained document chunking</li>
<li>Semantic similarity tracking</li>
<li>Citation boundary detection</li>
<li>Provenance preservation</li>
</ul>
</li>
<li>
<p><strong>Training Approach</strong></p>
<ul>
<li>Multi-task training combining generation and citation</li>
<li>Custom datasets focused on source grounding</li>
<li>Citation-specific loss functions</li>
<li>Source fidelity metrics</li>
</ul>
</li>
</ol>
<h2 id="beyond-traditional-rag">Beyond Traditional RAG</h2>
<p>The Citations API and similar emerging technologies point to a future where knowledge integration isn&rsquo;t just an add-on but a core capability of language models. This evolution requires moving beyond simply stuffing context windows with retrieved documents toward architectures specifically designed for knowledge-aware generation.</p>
<p>The next generation of these systems will likely feature:</p>
<ul>
<li>Native citation capabilities</li>
<li>Real-time fact verification</li>
<li>Seamless source integration</li>
<li>Dynamic knowledge updates</li>
<li>Explicit handling of source conflicts</li>
</ul>
<p>As we move forward, the goal isn&rsquo;t to patch the limitations of current RAG systems but to fundamentally rethink how we combine language models with external knowledge. This might lead to entirely new architectures specifically designed for knowledge-enhanced generation, moving us beyond the current paradigm of context window injection toward truly integrated knowledge-aware AI systems.</p>
]]></content:encoded>
    </item>
    
    <item>
      <title>CheeseGPT</title>
      <link>https://alex-jacobs.com/posts/cheesegpt/</link>
      <pubDate>Tue, 14 Mar 2023 00:00:00 +0000</pubDate>
      
      <guid>https://alex-jacobs.com/posts/cheesegpt/</guid>
      <description>A (Very) Simple RAG Tutorial</description>
      <content:encoded><![CDATA[<h2 id="introduction">Introduction</h2>
<p>This toy project was originally created for a guest lecture to a Data Science 101 course (and its quality may reflect that :)
This post extends that lecture, designed to provide a high-level understanding and example of <em>Retrieval Augmented Generation</em> (RAG).</p>
<p>We&rsquo;ll go through the steps of creating a RAG based LLM system, explaining what we&rsquo;re doing along the way, and why.</p>
<p>You can follow along with the slides and code <a href="https://github.com/alexjacobs08/cheeseGPT">here</a></p>
<p><img loading="lazy" src="/posts/cheesegpt/img1.png" type="" alt="cheeseGPT"  /></p>
<h3 id="the-cheesegpt-system">The CheeseGPT System</h3>
<p>CheeseGPT combines Large Language Models (LLMs) with the advanced capabilities
of Retrieval-Augmented Generation (RAG). At its core, CheeseGPT uses OpenAI&rsquo;s GPT-4 model for natural language processing.
This model serves as the backbone for generating human-like text responses. However, what sets CheeseGPT apart is its
integration with Langchain and a Redis database containing all of the information on Wikipedia relating to cheese.</p>
<p>When a user asks a question, the system utilizes RAG to retrieve the most relevant information/documents from its vector database, and then includes those in its message to the LLM.  This
allows the LLM to have specific and up-to-date information to use, extending from the data that it was trained on.</p>
<p>The image below, flow from right to left (steps 1-5) shows the high level design of this.  The user&rsquo;s query is passed into our embedding model.  We do a similarity search against our
database to retrieve the most relevant documents to our users question. And then these are included in context passed to our LLM.</p>
<p><img loading="lazy" src="/posts/cheesegpt/img2.png" type="" alt="rag_based_llm_design"  /></p>
<h6 id="httpswwwanyscalecombloga-comprehensive-guide-for-building-rag-based-llm-applications-part-1"><em><a href="https://www.anyscale.com/blog/a-comprehensive-guide-for-building-rag-based-llm-applications-part-1">https://www.anyscale.com/blog/a-comprehensive-guide-for-building-rag-based-llm-applications-part-1</a></em></h6>
<p>Below, we&rsquo;ll outline the steps to building this system.</p>
<p><strong>NOTE</strong>: this is an example, and probably doesn&rsquo;t make a ton of sense as a useful system.  (For one, we&rsquo;re getting our
data from wikipedia, which is already contained within the training data of GPT-4)  This is meant to be a high level
example that can show how a RAG based system can work, and to show what the possibilities are when integrating external
data with LLMs  (proprietary data, industry specific technical docs, etc.)</p>
<h2 id="data-collection-and-processing">Data Collection and Processing</h2>
<p>As with most projects, getting and munging your data is one of the most time consuming yet crucial elements.
For our CheeseGPT example, this involved scraping Wikipedia for
cheese-related articles, generating embeddings, and storing them in a Redis database. Below, I&rsquo;ll outline these steps
with code snippets for clarity.</p>
<h3 id="scraping-wikipedia">Scraping Wikipedia</h3>
<p>We start by extracting content from Wikipedia. We made a recursive function <code>get_page_content</code> to fetch pages related
to cheese, including summaries and sections.  (Note: This function could definitely be improved.)</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="ln"> 1</span><span class="cl">
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="k">def</span> <span class="nf">get_page_content</span><span class="p">(</span><span class="n">page_title</span><span class="p">,</span> <span class="n">depth</span><span class="p">,</span> <span class="n">max_depth</span><span class="p">):</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">    <span class="n">wiki_wiki</span> <span class="o">=</span> <span class="n">wikipediaapi</span><span class="o">.</span><span class="n">Wikipedia</span><span class="p">(</span><span class="s1">&#39;MyCheeseRAGApp/1.0 (myemail@example.com)&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">    <span class="k">if</span> <span class="n">depth</span> <span class="o">&gt;</span> <span class="n">max_depth</span> <span class="ow">or</span> <span class="n">page_title</span> <span class="ow">in</span> <span class="n">visited_pages</span><span class="p">:</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">        <span class="k">return</span> <span class="p">[],</span> <span class="p">[]</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">    <span class="n">visited_pages</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">page_title</span><span class="p">)</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">    <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">visited_pages</span><span class="p">)</span> <span class="o">%</span> <span class="mi">10</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">        <span class="n">logger</span><span class="o">.</span><span class="n">info</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;Visited pages: </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">visited_pages</span><span class="p">)</span><span class="si">}</span><span class="s2">&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">        <span class="n">logger</span><span class="o">.</span><span class="n">info</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;Fetching page &#39;</span><span class="si">{</span><span class="n">page_title</span><span class="si">}</span><span class="s2">&#39; (depth=</span><span class="si">{</span><span class="n">depth</span><span class="si">}</span><span class="s2">)&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">
</span></span><span class="line"><span class="ln">12</span><span class="cl">    <span class="k">try</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl">        <span class="n">page</span> <span class="o">=</span> <span class="n">wiki_wiki</span><span class="o">.</span><span class="n">page</span><span class="p">(</span><span class="n">page_title</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl">        <span class="k">if</span> <span class="ow">not</span> <span class="n">page</span><span class="o">.</span><span class="n">exists</span><span class="p">():</span>
</span></span><span class="line"><span class="ln">15</span><span class="cl">            <span class="k">return</span> <span class="p">[],</span> <span class="p">[]</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl">
</span></span><span class="line"><span class="ln">17</span><span class="cl">        <span class="n">texts</span><span class="p">,</span> <span class="n">metadata</span> <span class="o">=</span> <span class="p">[],</span> <span class="p">[]</span>
</span></span><span class="line"><span class="ln">18</span><span class="cl">        <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">texts</span><span class="p">)</span> <span class="o">%</span> <span class="mi">100</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">19</span><span class="cl">            <span class="n">logger</span><span class="o">.</span><span class="n">info</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;Texts length </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">texts</span><span class="p">)</span><span class="si">}</span><span class="s2">&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">20</span><span class="cl">
</span></span><span class="line"><span class="ln">21</span><span class="cl">        <span class="c1"># Add page summary</span>
</span></span><span class="line"><span class="ln">22</span><span class="cl">        <span class="n">texts</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">page</span><span class="o">.</span><span class="n">summary</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">23</span><span class="cl">        <span class="n">metadata</span><span class="o">.</span><span class="n">append</span><span class="p">({</span><span class="s1">&#39;title&#39;</span><span class="p">:</span> <span class="n">page_title</span><span class="p">,</span> <span class="s1">&#39;section&#39;</span><span class="p">:</span> <span class="s1">&#39;Summary&#39;</span><span class="p">,</span> <span class="s1">&#39;url&#39;</span><span class="p">:</span> <span class="n">page</span><span class="o">.</span><span class="n">fullurl</span><span class="p">})</span>
</span></span><span class="line"><span class="ln">24</span><span class="cl">
</span></span><span class="line"><span class="ln">25</span><span class="cl">        <span class="c1"># Add sections</span>
</span></span><span class="line"><span class="ln">26</span><span class="cl">        <span class="k">for</span> <span class="n">section</span> <span class="ow">in</span> <span class="n">page</span><span class="o">.</span><span class="n">sections</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">27</span><span class="cl">            <span class="n">texts</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">section</span><span class="o">.</span><span class="n">text</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">28</span><span class="cl">            <span class="n">metadata</span><span class="o">.</span><span class="n">append</span><span class="p">({</span><span class="s1">&#39;title&#39;</span><span class="p">:</span> <span class="n">page_title</span><span class="p">,</span> <span class="s1">&#39;section&#39;</span><span class="p">:</span> <span class="n">section</span><span class="o">.</span><span class="n">title</span><span class="p">,</span> <span class="s1">&#39;url&#39;</span><span class="p">:</span> <span class="n">page</span><span class="o">.</span><span class="n">fullurl</span><span class="p">})</span>
</span></span><span class="line"><span class="ln">29</span><span class="cl">
</span></span><span class="line"><span class="ln">30</span><span class="cl">        <span class="c1"># Recursive fetching for links</span>
</span></span><span class="line"><span class="ln">31</span><span class="cl">        <span class="k">if</span> <span class="n">depth</span> <span class="o">&lt;</span> <span class="n">max_depth</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">32</span><span class="cl">            <span class="k">for</span> <span class="n">link_title</span> <span class="ow">in</span> <span class="n">page</span><span class="o">.</span><span class="n">links</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">33</span><span class="cl">                <span class="n">link_texts</span><span class="p">,</span> <span class="n">link_metadata</span> <span class="o">=</span> <span class="n">get_page_content</span><span class="p">(</span><span class="n">link_title</span><span class="p">,</span> <span class="n">depth</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">max_depth</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">34</span><span class="cl">                <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">link_texts</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">35</span><span class="cl">                    <span class="n">texts</span><span class="o">.</span><span class="n">extend</span><span class="p">(</span><span class="n">link_texts</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">36</span><span class="cl">                    <span class="n">metadata</span><span class="o">.</span><span class="n">extend</span><span class="p">(</span><span class="n">link_metadata</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">37</span><span class="cl">
</span></span><span class="line"><span class="ln">38</span><span class="cl">        <span class="k">return</span> <span class="n">texts</span><span class="p">,</span> <span class="n">metadata</span>
</span></span><span class="line"><span class="ln">39</span><span class="cl">    <span class="k">except</span> <span class="ne">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">40</span><span class="cl">        <span class="n">logger</span><span class="o">.</span><span class="n">error</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;Error fetching page &#39;</span><span class="si">{</span><span class="n">page_title</span><span class="si">}</span><span class="s2">&#39;: </span><span class="si">{</span><span class="n">e</span><span class="si">}</span><span class="s2">&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">41</span><span class="cl">        <span class="k">return</span> <span class="p">[],</span> <span class="p">[]</span>
</span></span></code></pre></div><p>This is a very greedy (and lazy) approach.  We don&rsquo;t discriminate at all, and we end up
with a ton of noise (things not related to cheese at all), but for our purposes of example, it works.</p>
<h3 id="generating-embeddings">Generating Embeddings</h3>
<p>Next, we need to generate our embeddings from our collected documents.</p>
<h4 id="what-are-embeddings">What are embeddings?</h4>
<p>Embeddings are high-dimensional, continuous vector representations of text, words, or other types of data,
where similar items have similar representations. They capture semantic relationships and features in a space where
operations like distance or angle measurement can indicate similarity or dissimilarity.</p>
<p>In machine learning, embeddings are used to convert categorical, symbolic, or textual data into a form that
algorithms can process more effectively, enabling tasks like natural language processing, recommendation systems,
and more sophisticated pattern recognition.</p>
<p>With our textual data collected, we&rsquo;ll be using OpenAI and Langchain to generate our embeddings. There are
lots of different ways to generate embeddings (plenty packages that run locally, too), but using OpenAI API to get them
is fast and easy for us. (and also dirt cheap)</p>
<p><strong>NOTE:</strong>
In a true production system, there would be <em>much</em> more consideration taken around generating embeddings.  This
is arguably the most important step in a RAG based system.  We&rsquo;d need to do experimentation with chunk size to see what
gives us the best results.  We&rsquo;d need to explore our vectors to make sure their working as expected, remove noise, etc.</p>
<h4 id="generating-embeddings-using-langchain-and-openai">Generating embeddings using LangChain and OpenAI</h4>
<p>Langchain makes it very easy to do create embeddings and store them in Redis without much thought,
but this step requires extreme care to generate good results in a production system</p>
<p>The snippet below takes our scraped wikipedia sections, generates embeddings for them using OpenAI&rsquo;s embeddings API,
and stores them in Redis.  Again, LangChain abstracts away a ton of complexity and makes this <em>really</em> easy for us.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="kn">from</span> <span class="nn">langchain.embeddings</span> <span class="kn">import</span> <span class="n">OpenAIEmbeddings</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="n">embeddings</span> <span class="o">=</span> <span class="n">OpenAIEmbeddings</span><span class="p">()</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="n">rds</span> <span class="o">=</span> <span class="n">Redis</span><span class="o">.</span><span class="n">from_texts</span><span class="p">(</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">    <span class="n">texts</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">    <span class="n">embeddings</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">    <span class="n">metadatas</span><span class="o">=</span><span class="n">metadata</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">    <span class="n">redis_url</span><span class="o">=</span><span class="s2">&#34;redis://localhost:6379&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">    <span class="n">index_name</span><span class="o">=</span><span class="s2">&#34;cheese&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="p">)</span>
</span></span></code></pre></div><h2 id="implementation-of-rag">Implementation of RAG</h2>
<p>Our RAG operates by creating an embedding of the user&rsquo;s question and then finding the most semantically similar documents
in our database (via cosine similarity between the embedding of our user&rsquo;s query and the N closest documents in our database).</p>
<p>We then include these documents / snippets in our request to the LLM, telling it that they are the most relevant documents based
on a similarity search.  The LLM can then use these documents as reference when generating its response.</p>
<p>Here&rsquo;s a simplified overview of the process with code snippets:</p>
<h3 id="embedding-user-queries">Embedding User Queries</h3>
<p>The user&rsquo;s query is converted into an embedding using the OpenAI API. This embedding represents the semantic content of
the query in a format that can be compared against the pre-computed embeddings of the database articles.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="ln">1</span><span class="cl"><span class="kn">from</span> <span class="nn">langchain.embeddings</span> <span class="kn">import</span> <span class="n">OpenAIEmbeddings</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="n">embeddings</span> <span class="o">=</span> <span class="n">OpenAIEmbeddings</span><span class="p">()</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="n">query_embedding</span> <span class="o">=</span> <span class="n">embeddings</span><span class="o">.</span><span class="n">embed_text</span><span class="p">(</span><span class="n">user_query</span><span class="p">)</span>
</span></span></code></pre></div><h3 id="retrieving-related-articles">Retrieving Related Articles</h3>
<p>We then use the query embedding to perform a similarity search in the Redis database. It retrieves a set number
of articles that are most semantically similar to the query.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">def</span> <span class="nf">get_related_articles</span><span class="p">(</span><span class="n">query_embedding</span><span class="p">,</span> <span class="n">k</span><span class="o">=</span><span class="mi">3</span><span class="p">):</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">    <span class="k">return</span> <span class="n">rds</span><span class="o">.</span><span class="n">similarity_search</span><span class="p">(</span><span class="n">query_embedding</span><span class="p">,</span> <span class="n">k</span><span class="o">=</span><span class="n">k</span><span class="p">)</span>
</span></span></code></pre></div><h3 id="integrating-retrieved-data-into-gpt-4-prompts">Integrating Retrieved Data into GPT-4 Prompts</h3>
<p>The retrieved articles are formatted and integrated into the prompt for GPT-4. This allows GPT-4 to use the information
from these articles to generate a response that is not only contextually relevant but also rich in content.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">def</span> <span class="nf">create_prompt_with_articles</span><span class="p">(</span><span class="n">query</span><span class="p">,</span> <span class="n">articles</span><span class="p">):</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">    <span class="n">article_summaries</span> <span class="o">=</span> <span class="p">[</span><span class="sa">f</span><span class="s2">&#34;</span><span class="si">{</span><span class="n">article</span><span class="p">[</span><span class="s1">&#39;title&#39;</span><span class="p">]</span><span class="si">}</span><span class="s2">: </span><span class="si">{</span><span class="n">article</span><span class="p">[</span><span class="s1">&#39;summary&#39;</span><span class="p">]</span><span class="si">}</span><span class="s2">&#34;</span> <span class="k">for</span> <span class="n">article</span> <span class="ow">in</span> <span class="n">articles</span><span class="p">]</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl">    <span class="k">return</span> <span class="sa">f</span><span class="s2">&#34;Question: </span><span class="si">{</span><span class="n">query</span><span class="si">}</span><span class="se">\n\n</span><span class="s2">Related Information:</span><span class="se">\n</span><span class="s2">&#34;</span> <span class="o">+</span> <span class="s2">&#34;</span><span class="se">\n</span><span class="s2">&#34;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">article_summaries</span><span class="p">)</span>
</span></span></code></pre></div><h3 id="generating-the-response">Generating the Response</h3>
<p>Finally, the enriched prompt is fed to GPT-4, which generates a response based on both the user&rsquo;s query and the
additional context provided by the retrieved articles.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="ln">1</span><span class="cl"><span class="n">response</span> <span class="o">=</span> <span class="n">openai</span><span class="o">.</span><span class="n">ChatCompletion</span><span class="o">.</span><span class="n">create</span><span class="p">(</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="n">model</span><span class="o">=</span><span class="s2">&#34;gpt-4&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="n">messages</span><span class="o">=</span><span class="p">[{</span><span class="s2">&#34;role&#34;</span><span class="p">:</span> <span class="s2">&#34;system&#34;</span><span class="p">,</span> <span class="s2">&#34;content&#34;</span><span class="p">:</span> <span class="n">create_prompt_with_articles</span><span class="p">(</span><span class="n">user_query</span><span class="p">,</span> <span class="n">related_articles</span><span class="p">)}]</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="p">)</span>
</span></span></code></pre></div><p>Through this process, CheeseGPT effectively combines the generative power of GPT-4 with the information retrieval
capabilities of RAG, resulting in responses that are informative, accurate, and contextually rich.</p>
<h2 id="the-chat-interface">The Chat Interface</h2>
<p>CheeseGPT&rsquo;s chat interface is an important component, orchestrating the interaction between the user, the
retrieval-augmented generation system, and the underlying Large Language Model (LLM).</p>
<p>For the purposes of our example, we have built the bindings for the interface, but did not create a fully interactive interface.</p>
<p>Let&rsquo;s dive into the key functions that make this interaction possible.</p>
<h3 id="connecting-to-redis">Connecting to Redis</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">def</span> <span class="nf">rds_connect</span><span class="p">():</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">    <span class="n">rds</span> <span class="o">=</span> <span class="n">Redis</span><span class="o">.</span><span class="n">from_existing_index</span><span class="p">(</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl">        <span class="n">embeddings</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">        <span class="n">redis_url</span><span class="o">=</span><span class="s2">&#34;redis://localhost:6379&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">        <span class="n">index_name</span><span class="o">=</span><span class="s2">&#34;cheese&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl">        <span class="n">schema</span><span class="o">=</span><span class="s2">&#34;redis_schema.yaml&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">7</span><span class="cl">    <span class="p">)</span>
</span></span><span class="line"><span class="ln">8</span><span class="cl">    <span class="k">return</span> <span class="n">rds</span>
</span></span></code></pre></div><p>This function establishes a connection to the Redis database, where the precomputed embeddings of cheese-related
Wikipedia pages are stored.</p>
<h3 id="applying-filters">Applying Filters</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">def</span> <span class="nf">get_filters</span><span class="p">():</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">    <span class="n">is_not_external_link</span> <span class="o">=</span> <span class="n">RedisFilter</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="s2">&#34;section&#34;</span><span class="p">)</span> <span class="o">!=</span> <span class="s1">&#39;External Links&#39;</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl">    <span class="n">is_not_see_also</span> <span class="o">=</span> <span class="n">RedisFilter</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="s2">&#34;section&#34;</span><span class="p">)</span> <span class="o">!=</span> <span class="s1">&#39;See Also&#39;</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">    <span class="n">_filter</span> <span class="o">=</span> <span class="n">is_not_external_link</span> <span class="o">&amp;</span> <span class="n">is_not_see_also</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">    <span class="k">return</span> <span class="n">_filter</span>
</span></span></code></pre></div><p>Filters are applied to ensure that irrelevant sections like &lsquo;External Links&rsquo; and &lsquo;See Also&rsquo; are excluded from the search
results.</p>
<h3 id="deduplicating-results">Deduplicating Results</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">def</span> <span class="nf">dedupe_results</span><span class="p">(</span><span class="n">results</span><span class="p">):</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">    <span class="n">seen</span> <span class="o">=</span> <span class="nb">set</span><span class="p">()</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl">    <span class="n">deduped_results</span> <span class="o">=</span> <span class="p">[]</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">    <span class="k">for</span> <span class="n">result</span> <span class="ow">in</span> <span class="n">results</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">        <span class="k">if</span> <span class="n">result</span><span class="o">.</span><span class="n">page_content</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">seen</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl">            <span class="n">deduped_results</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">result</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">7</span><span class="cl">            <span class="n">seen</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">result</span><span class="o">.</span><span class="n">page_content</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">8</span><span class="cl">    <span class="k">return</span> <span class="n">deduped_results</span>
</span></span></code></pre></div><p>This function ensures that duplicate content from the search results is removed, enhancing the quality of the final
output.  (This is necessary in our case because we were greedy / lazy when pulling our data / generating our vectors)</p>
<h3 id="retrieving-document-results">Retrieving Document Results</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">def</span> <span class="nf">get_results</span><span class="p">(</span><span class="n">rds</span><span class="p">,</span> <span class="n">question</span><span class="p">,</span> <span class="n">k</span><span class="o">=</span><span class="mi">3</span><span class="p">):</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">    <span class="n">_filters</span> <span class="o">=</span> <span class="n">get_filters</span><span class="p">()</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl">    <span class="n">results</span> <span class="o">=</span> <span class="n">dedupe_results</span><span class="p">(</span><span class="n">rds</span><span class="o">.</span><span class="n">similarity_search</span><span class="p">(</span><span class="n">question</span><span class="p">,</span> <span class="n">k</span><span class="o">=</span><span class="n">k</span><span class="p">,</span> <span class="nb">filter</span><span class="o">=</span><span class="n">_filters</span><span class="p">))</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">    <span class="k">return</span> <span class="n">results</span>
</span></span></code></pre></div><p>This key function performs a similarity search in the Redis database using the user&rsquo;s query, filtered and deduplicated.</p>
<h3 id="formatting-rag-results">Formatting RAG Results</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">def</span> <span class="nf">format_rag_results</span><span class="p">(</span><span class="n">rag_results</span><span class="p">):</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">    <span class="n">divider</span> <span class="o">=</span> <span class="s2">&#34;*********************RESULT*********************</span><span class="se">\n</span><span class="s2">&#34;</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl">    <span class="k">return</span> <span class="p">[</span><span class="sa">f</span><span class="s2">&#34;</span><span class="si">{</span><span class="n">divider</span><span class="si">}</span><span class="s2"> </span><span class="si">{</span><span class="n">result</span><span class="o">.</span><span class="n">page_content</span><span class="si">}</span><span class="s2"> (</span><span class="si">{</span><span class="n">result</span><span class="o">.</span><span class="n">metadata</span><span class="p">[</span><span class="s1">&#39;url&#39;</span><span class="p">]</span><span class="si">}</span><span class="s2">#</span><span class="si">{</span><span class="n">result</span><span class="o">.</span><span class="n">metadata</span><span class="p">[</span><span class="s1">&#39;section&#39;</span><span class="p">]</span><span class="si">}</span><span class="s2">)&#34;</span> <span class="k">for</span> <span class="n">result</span> <span class="ow">in</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">            <span class="n">rag_results</span><span class="p">]</span>
</span></span></code></pre></div><p>The function formats the search results, making them readable and including the source information for transparency.</p>
<p>This is what our message looks like when we send it to GPT-4.  Our system prompt is first and includes instructions for the
model to use the retrieved documents when answering the question.</p>
<p>In our user message, you can see the user&rsquo;s question, and then the documents we retrieved, presented as a list with some formatting.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="cl"><span class="p">[</span>
</span></span><span class="line"><span class="cl">   <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;role&#34;</span><span class="p">:</span> <span class="s2">&#34;system&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;content&#34;</span><span class="p">:</span> <span class="s2">&#34;You are cheeseGPT, a retrieval augmented chatbot with expert knowledge of cheese. You are here to answer questions about cheese, and you should, when possible, cite your sources with the documents provided to you.&#34;</span>
</span></span><span class="line"><span class="cl">   <span class="p">},</span>
</span></span><span class="line"><span class="cl">   <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;role&#34;</span><span class="p">:</span> <span class="s2">&#34;user&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;content&#34;</span><span class="p">:</span> <span class="s2">&#34;User question: what is the biggest cheese sporting event.  Retrieved documents: [&#39;*********************RESULT*********************\\n The Lucerne Cheese Festival (German: Käsefest Luzern) is a cheese festival held annually in Lucerne, Switzerland. It was established in 2001 and is normally run on a weekend in the middle of October at the Kapellplatz (Chapel Square) in the city centre. The next festival is planned to take place on 14 October 2023.The event features the biggest cheese market in central Switzerland, and offers the greatest selection of cheeses. As well as the cheese market and live demonstrations of cheesemaking, typical events during the festival include a milking competition and music such as the Swiss alphorn.The 2012 event featured over 200 varieties of cheese over 23 market stalls, including goat and sheep cheese. The 2020 event was almost cancelled because of social distancing restrictions during the COVID-19 pandemic, but was approved a few days before with a strict requirement to wear masks. Instead of the Kapellplatz, the festival was run from the nearby Kurplatz (Spa Square). 288 variety of cheeses were available at the festival, including cheesemakers from outside the local region such as the Bernese Jura and Ticino, who had their own festivals cancelled. Around 5,800 people attended the festival, lower than the previous year, with around two-thirds fewer sales. The following year\\&#39;s event continued restrictions, where customers had to taste and buy cheese at a distance, though masks were no longer mandatory. The 2022 event featured demonstrations of the cheese making process, a chalet built of Swiss cheese, and a \&#34;cheese chalet\&#34; hosting cheese fondue and raclette.India Times in 2014 called it out as one of 10 world food festivals for foodies. (https://en.wikipedia.org/wiki/Lucerne_Cheese_Festival#Summary)&#39;, &#39;*********************RESULT*********************\\n In addition to sampling and purchasing more than 4,600 cheeses in the Cheese Pavilion, visitors to the show are treated to various attractions throughout the day including cheese making demonstrations, trophy presentations and live cookery demonstrations. (https://en.wikipedia.org/wiki/International_Cheese_Awards#Show features)&#39;, &#39;*********************RESULT*********************\\n The first annual event was held in 2000 in Oxfordshire, and was founded by Juliet Harbutt. Each year it is preceded by the British Cheese Awards, a ceremony which Harbutt created in 1994, judged by food experts and farmers, in which the best cheeses are awarded bronze, silver and gold medals.\\nAll cheeses are tasted blind, and the winners can then display their awards during the public-attended festival.  There are usually a variety of events at the festival such as seminars, masterclasses, and cheesemaking demonstrations.The event moved to Cheltenham in Gloucestershire in 2005. In 2006 the Sunday of the weekend was cancelled at great cost after the venue experienced flooding, and the decision was made to return to Oxfordshire in 2007.Early in 2008 the festival was sold to Cardiff Council; subsequently the event has been held in the grounds of Cardiff Castle in 2008, 2009, 2010, and 2011.The 2012 Great British Cheese Festival was held at Cardiff Castle on Saturday, September 22, and Sunday, September 23.\\nIn 2015, Harbutt returned to her native New Zealand. (https://en.wikipedia.org/wiki/The_Great_British_Cheese_Festival#History)&#39;]&#34;</span>
</span></span><span class="line"><span class="cl">   <span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="p">]</span>
</span></span></code></pre></div><h3 id="generating-messages-for-the-llm">Generating Messages for the LLM</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">def</span> <span class="nf">get_messages</span><span class="p">(</span><span class="n">question</span><span class="p">,</span> <span class="n">rag_results</span><span class="p">):</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">    <span class="n">messages</span> <span class="o">=</span> <span class="p">[</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl">        <span class="p">{</span><span class="s2">&#34;role&#34;</span><span class="p">:</span> <span class="s2">&#34;system&#34;</span><span class="p">,</span> <span class="s2">&#34;content&#34;</span><span class="p">:</span> <span class="n">system_prompt</span><span class="p">},</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">        <span class="p">{</span><span class="s2">&#34;role&#34;</span><span class="p">:</span> <span class="s2">&#34;user&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">         <span class="s2">&#34;content&#34;</span><span class="p">:</span> <span class="sa">f</span><span class="s2">&#34;User question: </span><span class="si">{</span><span class="n">question</span><span class="si">}</span><span class="s2">.  Retrieved documents: </span><span class="si">{</span><span class="n">format_rag_results</span><span class="p">(</span><span class="n">rag_results</span><span class="p">)</span><span class="si">}</span><span class="s2">&#34;</span><span class="p">}</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl">    <span class="p">]</span>
</span></span><span class="line"><span class="ln">7</span><span class="cl">    <span class="k">return</span> <span class="n">messages</span>
</span></span></code></pre></div><p>This function prepares the input for the LLM, combining the system prompt, user question, and the retrieved documents.</p>
<p>The integration of these functions creates a seamless flow from the user&rsquo;s question to the LLM&rsquo;s informed response,
enabling CheeseGPT to provide expert-level insights into the world of cheese.</p>
<p>Putting it all together might look something like&hellip;</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="ln">1</span><span class="cl"><span class="n">question</span> <span class="o">=</span> <span class="s2">&#34;what is the biggest cheese sporting event&#34;</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="n">results</span> <span class="o">=</span> <span class="n">get_results</span><span class="p">(</span><span class="n">rds</span><span class="p">,</span> <span class="n">question</span><span class="p">,</span> <span class="n">k</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="n">messages</span> <span class="o">=</span> <span class="n">get_messages</span><span class="p">(</span><span class="n">question</span><span class="p">,</span> <span class="n">results</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="n">json</span><span class="o">.</span><span class="n">dumps</span><span class="p">(</span><span class="n">messages</span><span class="p">,</span> <span class="n">indent</span><span class="o">=</span><span class="mi">4</span><span class="p">))</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="n">response</span> <span class="o">=</span> <span class="n">openai</span><span class="o">.</span><span class="n">ChatCompletion</span><span class="o">.</span><span class="n">create</span><span class="p">(</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl">    <span class="n">model</span><span class="o">=</span><span class="s2">&#34;gpt-4&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">7</span><span class="cl">    <span class="n">messages</span><span class="o">=</span><span class="n">messages</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="ln">9</span><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="n">response</span><span class="p">)</span>
</span></span></code></pre></div><h2 id="outcomes">Outcomes</h2>
<p>So, let&rsquo;s compare a question using our system vs. asking ChatGPT.  We&rsquo;ll use the same question above.</p>
<p>Using our system, we get this response:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln"> 1</span><span class="cl">The Lucerne Cheese Festival in Switzerland and the International Cheese Awards are two of the biggest cheese sporting events. 
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">The Lucerne Cheese Festival, a yearly event in Lucerne, Switzerland, features the largest cheese market in central Switzerland 
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">and provides a vast selection of cheeses. The festival also includes live cheesemaking demonstrations, a milking competition, 
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">and music. In past years, it has presented over 200 varieties of cheese over 23 market stalls, including goat and sheep cheese. 
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">The festival attracted around 5,800 attendees in 2020, despite social distancing restrictions. The India Times recognized it 
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">as one of 10 world food festivals for foodies in 2014 [^1^].
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">On the other hand, the International Cheese Awards allow visitors 
</span></span><span class="line"><span class="ln">10</span><span class="cl">to sample and purchase over 4,600 different cheeses in the Cheese Pavilion. The show features various attractions such as
</span></span><span class="line"><span class="ln">11</span><span class="cl">cheese making demonstrations, trophy presentations, and live cookery demonstrations [^2^].
</span></span><span class="line"><span class="ln">12</span><span class="cl">
</span></span><span class="line"><span class="ln">13</span><span class="cl">The Great British Cheese Festival, although not as large as the other two events, is also a notable cheese event. It 
</span></span><span class="line"><span class="ln">14</span><span class="cl">hosts the British Cheese Awards, where the best cheeses are awarded bronze, silver, and gold medals, judged by food 
</span></span><span class="line"><span class="ln">15</span><span class="cl">experts and farmers. All cheeses are tasted blindly, and the winners can display their awards during the festival. 
</span></span><span class="line"><span class="ln">16</span><span class="cl">The festival typically includes seminars, masterclasses, and cheesemaking demonstrations [^3^].
</span></span><span class="line"><span class="ln">17</span><span class="cl">
</span></span><span class="line"><span class="ln">18</span><span class="cl">In conclusion, both the Lucerne Cheese Festival and the International Cheese Awards are considered some of the biggest 
</span></span><span class="line"><span class="ln">19</span><span class="cl">cheese sporting events worldwide.
</span></span><span class="line"><span class="ln">20</span><span class="cl">
</span></span><span class="line"><span class="ln">21</span><span class="cl">[^1^]: (https://en.wikipedia.org/wiki/Lucerne_Cheese_Festival#Summary)
</span></span><span class="line"><span class="ln">22</span><span class="cl">[^2^]: (https://en.wikipedia.org/wiki/International_Cheese_Awards#Show_features)
</span></span><span class="line"><span class="ln">23</span><span class="cl">[^3^]: (https://en.wikipedia.org/wiki/The_Great_British_Cheese_Festival#History)&#34;
</span></span></code></pre></div><p>And if we ask ChatGPT the same question&hellip;</p>
<p><img loading="lazy" src="/posts/cheesegpt/img3.jpg" type="" alt="img3.jpg"  /></p>
<p>I&rsquo;m not sure which of these answer is <em>more</em> correct, and it doesn&rsquo;t matter for the purposes of this example.  The point is that we were
able to retrieve and include our own information, external to the model, and make it use that information in it&rsquo;s response.<br>
It&rsquo;s clear how many amazing use cases there are for something like this!</p>
<p>Hopefully this high level toy example was able to shed some light on what a RAG based system may look like.  Checkout the
additional resources linked before for more in-depth information. Thanks for reading!</p>
<h3 id="additional-resources">Additional resources</h3>
<p><a href="https://github.com/ray-project/llm-applications/blob/main/notebooks/rag.ipynb">https://github.com/ray-project/llm-applications/blob/main/notebooks/rag.ipynb</a>
<a href="https://github.com/pchunduri6/rag-demystified">https://github.com/pchunduri6/rag-demystified</a>
<a href="https://www.anyscale.com/blog/a-comprehensive-guide-for-building-rag-based-llm-applications-part-1">https://www.anyscale.com/blog/a-comprehensive-guide-for-building-rag-based-llm-applications-part-1</a></p>
]]></content:encoded>
    </item>
    
  </channel>
</rss>
