<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>BERT on Alex Jacobs</title>
    <link>https://alex-jacobs.com/tags/bert/</link>
    <description>Recent content in BERT on Alex Jacobs</description>
    <generator>Hugo -- gohugo.io</generator>
    <language>en-us</language>
    <lastBuildDate>Sun, 04 Jan 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://alex-jacobs.com/tags/bert/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Beating BERT? Small LLMs vs Fine-Tuned Encoders for Classification</title>
      <link>https://alex-jacobs.com/posts/beatingbert/</link>
      <pubDate>Sun, 04 Jan 2026 00:00:00 +0000</pubDate>
      
      <guid>https://alex-jacobs.com/posts/beatingbert/</guid>
      <description>I ran 32 experiments comparing small LLMs to BERT on classification tasks. Turns out 2018-era BERT is still really good at what it does.</description>
      <content:encoded><![CDATA[<p>&ldquo;Just use an LLM.&rdquo;</p>
<p>That was my advice to a colleague recently when they asked about a classification problem. Who fine-tunes BERT anymore? Haven&rsquo;t decoder models eaten the entire NLP landscape?</p>
<p>The look I got back was&hellip; skeptical. And it stuck with me.</p>
<p>I&rsquo;ve been deep in LLM-land for a few years now. When your daily driver can architect systems, write production code, and reason through problems better than most junior devs, you start reaching for it reflexively. Maybe my traditional ML instincts had atrophied.</p>
<p>So I decided to actually test my assumptions instead of just vibing on them.</p>
<p>I ran 32 experiments pitting small instruction-tuned LLMs against good old BERT and DeBERTa. I figured I&rsquo;d just be confirming what I already believed, that these new decoder models would obviously crush the ancient encoders.</p>
<p>I was wrong.</p>
<p>The results across Gemma 2B, Qwen 0.5B/1.5B, BERT-base, and DeBERTa-v3 were&hellip; not what I expected. If you&rsquo;re trying to decide between these approaches for classification, you might want to actually measure things instead of assuming the newer model is better.</p>
<p><img loading="lazy" src="/posts/beatingbert/scorecard.png" type="" alt="TL;DR: DeBERTa wins 3/4 tasks, LLM wins on adversarial NLI, but LLMs need zero training"  /></p>
<p>All the code is <a href="https://github.com/alexjacobs08/beatingBERT">on GitHub</a> if you want to run your own experiments.</p>
<h2 id="experiment-setup">Experiment Setup</h2>
<h3 id="what-i-tested">What I Tested</h3>
<p><strong>BERT Family (Fine-tuned)</strong></p>
<ul>
<li>BERT-base-uncased (110M parameters)</li>
<li>DeBERTa-v3-base (184M parameters)</li>
</ul>
<p><strong>Small LLMs</strong></p>
<ul>
<li>Qwen2-0.5B-Instruct</li>
<li>Qwen2.5-1.5B-Instruct</li>
<li>Gemma-2-2B-it</li>
</ul>
<p>For the LLMs, I tried two approaches:</p>
<ol>
<li><strong>Zero-shot</strong> - Just prompt engineering, no training</li>
<li><strong>Few-shot (k=5)</strong> - Include 5 examples in the prompt</li>
</ol>
<h3 id="tasks">Tasks</h3>
<p>Four classification benchmarks ranging from easy sentiment to adversarial NLI:</p>
<table>
<thead>
<tr>
<th>Task</th>
<th>Type</th>
<th>Labels</th>
<th>Difficulty</th>
</tr>
</thead>
<tbody>
<tr>
<td>SST-2</td>
<td>Sentiment</td>
<td>2</td>
<td>Easy</td>
</tr>
<tr>
<td>RTE</td>
<td>Textual Entailment</td>
<td>2</td>
<td>Medium</td>
</tr>
<tr>
<td>BoolQ</td>
<td>Yes/No QA</td>
<td>2</td>
<td>Medium</td>
</tr>
<tr>
<td>ANLI (R1)</td>
<td>Adversarial NLI</td>
<td>3</td>
<td>Hard</td>
</tr>
</tbody>
</table>
<h3 id="methodology">Methodology</h3>
<p>For anyone who wants to reproduce this or understand what &ldquo;fine-tuned&rdquo; and &ldquo;zero-shot&rdquo; actually mean here:</p>
<p><strong>BERT/DeBERTa Fine-tuning:</strong></p>
<ul>
<li>Standard HuggingFace Trainer with AdamW optimizer</li>
<li>Learning rate: 2e-5, batch size: 32, epochs: 3</li>
<li>Max sequence length: 128 tokens</li>
<li>Evaluation on validation split (GLUE test sets don&rsquo;t have public labels)</li>
</ul>
<p><strong>LLM Zero-shot:</strong></p>
<ul>
<li>Greedy decoding (temperature=0.0) for deterministic outputs</li>
<li>Task-specific prompts asking for single-word classification labels</li>
<li>No examples in context—just instructions and the input text</li>
</ul>
<p><strong>LLM Few-shot (k=5):</strong></p>
<ul>
<li>Same as zero-shot, but with 5 labeled examples prepended to each prompt</li>
<li>Examples randomly sampled from training set (stratified by class)</li>
</ul>
<p>All experiments used a fixed random seed (99) for reproducibility. Evaluation metrics are accuracy on the validation split. Hardware: RunPod instance with RTX A4500 (20GB VRAM), 20GB RAM, 5 vCPU.</p>
<p><img loading="lazy" src="/posts/beatingbert/nvitop.png" type="" alt="nvitop running on RunPod"  />
<em>I&rsquo;d forgotten how pretty text-only land can be. When you spend most of your time in IDEs and notebooks, SSH-ing into a headless GPU box and watching nvitop do its thing feels almost meditative.</em></p>
<h2 id="results">Results</h2>
<p>Let&rsquo;s dive into what actually happened:</p>


<style>
.results-table { width: 100%; border-collapse: collapse; margin: 1.5rem 0; font-size: 0.9rem; }
.results-table th, .results-table td { padding: 0.5rem 0.75rem; text-align: left; border-bottom: 1px solid var(--border); }
.results-table th { color: var(--secondary); font-weight: 500; }
.results-table td { color: var(--content); }
.winner { background: rgba(87, 62, 170, 0.15); font-weight: 600; border-radius: 4px; padding: 0.2rem 0.4rem; }
</style>
<table class="results-table">
<thead>
<tr><th>Model</th><th>Method</th><th>SST-2</th><th>RTE</th><th>BoolQ</th><th>ANLI</th></tr>
</thead>
<tbody>
<tr><td><strong>DeBERTa-v3</strong></td><td>Fine-tuned</td><td><span class="winner">94.8%</span></td><td><span class="winner">80.9%</span></td><td><span class="winner">82.6%</span></td><td>47.4%</td></tr>
<tr><td><strong>BERT-base</strong></td><td>Fine-tuned</td><td>91.5%</td><td>61.0%</td><td>71.5%</td><td>35.3%</td></tr>
<tr><td>Qwen2.5-1.5B</td><td>Zero-shot</td><td>93.8%</td><td>78.7%</td><td>74.6%</td><td>40.8%</td></tr>
<tr><td>Qwen2.5-1.5B</td><td>Few-shot</td><td>89.0%</td><td>53.4%</td><td>73.6%</td><td>45.0%</td></tr>
<tr><td>Gemma-2-2B</td><td>Zero-shot</td><td>90.0%</td><td>61.4%</td><td>80.9%</td><td>36.1%</td></tr>
<tr><td>Gemma-2-2B</td><td>Few-shot</td><td>86.5%</td><td>73.6%</td><td>81.5%</td><td><span class="winner">47.8%</span></td></tr>
<tr><td>Qwen2-0.5B</td><td>Zero-shot</td><td>87.6%</td><td>53.1%</td><td>61.8%</td><td>33.2%</td></tr>
</tbody>
</table>

<p><img loading="lazy" src="/posts/beatingbert/accuracy_comparison.png" type="" alt="Model Accuracy Comparison"  /></p>
<p><strong>DeBERTa-v3 wins most tasks—but not all</strong></p>
<p>DeBERTa hit 94.8% on SST-2, 80.9% on RTE, and 82.6% on BoolQ. For standard classification with decent training data, the fine-tuned encoders still dominate.</p>
<p>On ANLI—the hardest benchmark, specifically designed to fool models—Gemma few-shot actually beats DeBERTa (47.8% vs 47.4%). It&rsquo;s a narrow win, but it&rsquo;s a win on the task that matters most for robustness.</p>
<p><strong>Zero-shot LLMs actually beat BERT-base</strong></p>
<p>The LLMs aren&rsquo;t losing to BERT—they&rsquo;re losing to DeBERTa. Qwen2.5-1.5B zero-shot hit 93.8% on SST-2, beating BERT-base&rsquo;s 91.5%. Same story on RTE (78.7% vs 61.0%) and BoolQ (Gemma&rsquo;s 80.9% vs BERT&rsquo;s 71.5%). For models running purely on prompts with zero training? I&rsquo;m calling it a win.</p>
<p><strong>Few-shot is a mixed bag</strong></p>
<p>Adding examples to the prompt doesn&rsquo;t always help.</p>
<p>On RTE, Qwen2.5-1.5B went from 78.7% zero-shot down to 53.4% with few-shot. On SST-2, it dropped from 93.8% to 89.0%. But on ANLI, few-shot helped significantly—Gemma jumped from 36.1% to 47.8%, enough to beat DeBERTa.</p>
<p>Few-shot helps on harder tasks where examples demonstrate the thought process, but can confuse models on simpler pattern matching tasks where they already &ldquo;get it.&rdquo; Sometimes examples add noise instead of signal.</p>
<h2 id="bert-goes-brrrr">BERT Goes Brrrr</h2>
<p>Okay, so the accuracy gap isn&rsquo;t huge. Maybe I could still justify using an LLM?</p>
<p>Then I looked at throughput:</p>
<table>
<thead>
<tr>
<th>Model</th>
<th>Method</th>
<th>Throughput (samples/s)</th>
<th>Latency (ms/sample)</th>
</tr>
</thead>
<tbody>
<tr>
<td>BERT-base</td>
<td>Fine-tuned</td>
<td><strong>277</strong></td>
<td>3.6</td>
</tr>
<tr>
<td>DeBERTa-v3</td>
<td>Fine-tuned</td>
<td><strong>232</strong></td>
<td>4.3</td>
</tr>
<tr>
<td>Qwen2-0.5B</td>
<td>Zero-shot</td>
<td>17.5</td>
<td>57</td>
</tr>
<tr>
<td>Qwen2.5-1.5B</td>
<td>Zero-shot</td>
<td>12.3</td>
<td>81</td>
</tr>
<tr>
<td>Gemma-2-2B</td>
<td>Zero-shot</td>
<td>11.6</td>
<td>86</td>
</tr>
</tbody>
</table>
<p><img loading="lazy" src="/posts/beatingbert/accuracy_vs_latency.png" type="" alt="Accuracy vs Latency"  /></p>
<p><strong>BERT is ~20x faster.</strong></p>
<p>BERT processes 277 samples per second. Gemma-2-2B manages 12. If you&rsquo;re classifying a million documents, that&rsquo;s one hour vs a full day.</p>
<p>Encoders process the whole sequence in one forward pass. Decoders generate tokens autoregressively, even just to output &ldquo;positive&rdquo; or &ldquo;negative&rdquo;.</p>
<blockquote>
<p><strong>Note on LLM latency:</strong> These numbers use <code>max_length=256</code> for tokenization. When I bumped it to <code>max_length=2048</code>, latency jumped 8x—from 57ms to 445ms per sample for Qwen-0.5B. Context window scales roughly linearly with inference time. For short classification tasks, keep it short or make it dynamic.</p>
</blockquote>
<h3 id="try-it-yourself">Try It Yourself</h3>
<p>These models struggled on nuanced reviews. Can you do better? Try classifying some of the trickiest examples from my experiments:</p>


<style>
.classifier-demo {
  max-width: 100%;
  margin: 1.5rem 0;
  padding: 1.5rem;
  border: 1px solid var(--border);
  border-radius: var(--radius);
  background: var(--code-bg);
}
.demo-header {
  text-align: center;
  margin-bottom: 1rem;
}
.demo-title {
  margin: 0 0 0.25rem 0;
  font-size: 1.1rem;
  color: var(--primary);
}
.demo-subtitle {
  margin: 0 0 0.75rem 0;
  font-size: 0.85rem;
  color: var(--secondary);
}
.demo-score {
  display: flex;
  justify-content: center;
  gap: 2rem;
  font-size: 0.9rem;
  color: var(--secondary);
}
.demo-score strong {
  color: var(--primary);
}
.review-box {
  background: var(--entry);
  padding: 1rem 1.25rem;
  border-radius: var(--radius);
  margin: 1rem 0;
  font-style: italic;
  line-height: 1.6;
  border-left: 3px solid #573eaa;
  color: var(--content);
}
.btn-group {
  display: flex;
  gap: 0.75rem;
  justify-content: center;
  margin: 1rem 0;
}
.demo-btn {
  padding: 0.6rem 1.5rem;
  font-size: 0.9rem;
  border: none;
  border-radius: var(--radius);
  cursor: pointer;
  transition: all 0.2s;
  font-weight: 500;
}
.demo-btn:hover:not(:disabled) { opacity: 0.9; }
.demo-btn:disabled { opacity: 0.5; cursor: not-allowed; }
.btn-positive { background: #27ae60; color: white; }
.btn-negative { background: #c0392b; color: white; }
.result-box {
  margin-top: 1rem;
  padding: 1rem;
  border-radius: var(--radius);
  display: none;
}
.result-box.show { display: block; }
.result-correct { background: rgba(39, 174, 96, 0.15); border: 1px solid rgba(39, 174, 96, 0.3); }
.result-wrong { background: rgba(192, 57, 43, 0.15); border: 1px solid rgba(192, 57, 43, 0.3); }
.model-results {
  margin-top: 0.75rem;
  font-size: 0.8rem;
  color: var(--secondary);
}
.model-row {
  display: flex;
  justify-content: space-between;
  padding: 0.2rem 0;
  border-bottom: 1px solid var(--border);
}
.model-row:last-child { border-bottom: none; }
.model-correct { color: #27ae60; }
.model-wrong { color: #c0392b; }
.next-btn {
  display: block;
  margin: 1rem auto 0;
  padding: 0.5rem 1.25rem;
  background: #573eaa;
  color: white;
  border: none;
  border-radius: var(--radius);
  cursor: pointer;
  font-size: 0.85rem;
}
.next-btn:hover { background: #6549c0; }
.progress-bar {
  height: 3px;
  background: var(--border);
  border-radius: 2px;
  margin-bottom: 1rem;
}
.progress-fill {
  height: 100%;
  background: #573eaa;
  border-radius: 2px;
  transition: width 0.3s;
}
.demo-complete {
  text-align: center;
  padding: 1rem;
}
.final-score {
  font-size: 1.25rem;
  font-weight: 600;
  margin: 0.5rem 0;
  color: var(--primary);
}
#completeSummary {
  color: var(--secondary);
}
</style>

<div class="classifier-demo" id="classifierDemo">
  <div class="demo-header">
    <div class="demo-title">Can You Beat the Models?</div>
    <p class="demo-subtitle">Classify these tricky movie reviews</p>
    <div class="demo-score">
      <span>You: <strong id="userScore">0</strong>/<span id="totalAnswered">0</span></span>
      <span>Models: <strong id="modelScore">0</strong>/<span id="totalAnswered2">0</span></span>
    </div>
  </div>
  <div class="progress-bar">
    <div class="progress-fill" id="progressFill" style="width: 0%"></div>
  </div>
  <div id="questionArea">
    <div class="review-box" id="reviewText"></div>
    <div class="btn-group">
      <button class="demo-btn btn-negative" onclick="submitAnswer('negative')">Negative</button>
      <button class="demo-btn btn-positive" onclick="submitAnswer('positive')">Positive</button>
    </div>
    <div class="result-box" id="resultBox">
      <div id="resultText"></div>
      <div class="model-results" id="modelResults"></div>
      <button class="next-btn" onclick="nextQuestion()">Next Review →</button>
    </div>
  </div>
  <div id="completeArea" style="display:none;" class="demo-complete">
    <div class="demo-title">Challenge Complete!</div>
    <div class="final-score">You: <span id="finalUserScore"></span> | Models: <span id="finalModelScore"></span></div>
    <p id="completeSummary"></p>
    <button class="next-btn" onclick="restartDemo()">Play Again</button>
  </div>
</div>

<script>
const demoExamples = [
  {text: "hilariously inept and ridiculous.", true_label: "positive", predictions: {"Gemma": "negative", "Qwen": "negative"}},
  {text: "all that's missing is the spontaneity, originality and delight.", true_label: "negative", predictions: {"Gemma": "positive", "Qwen": "positive"}},
  {text: "reign of fire looks as if it was made without much thought -- and is best watched that way.", true_label: "positive", predictions: {"Gemma": "negative", "Qwen": "negative"}},
  {text: "we root for (clara and paul), even like them, though perhaps it's an emotion closer to pity.", true_label: "positive", predictions: {"Gemma": "negative", "Qwen": "negative"}},
  {text: "a solid film... but more conscientious than it is truly stirring.", true_label: "positive", predictions: {"Gemma": "negative", "Qwen": "positive"}},
  {text: "this riveting world war ii moral suspense story deals with the shadow side of american culture: racial prejudice in its ugly and diverse forms.", true_label: "negative", predictions: {"Gemma": "positive", "Qwen": "positive"}}
];
let currentIndex = 0, userCorrect = 0, modelCorrect = 0, totalAnswered = 0;
function initDemo() {
  currentIndex = 0; userCorrect = 0; modelCorrect = 0; totalAnswered = 0;
  document.getElementById('completeArea').style.display = 'none';
  document.getElementById('questionArea').style.display = 'block';
  updateScores(); showQuestion();
}
function showQuestion() {
  const ex = demoExamples[currentIndex];
  document.getElementById('reviewText').textContent = '"' + ex.text + '"';
  document.getElementById('resultBox').classList.remove('show');
  document.querySelectorAll('.demo-btn').forEach(b => b.disabled = false);
  document.getElementById('progressFill').style.width = ((currentIndex / demoExamples.length) * 100) + '%';
}
function submitAnswer(answer) {
  const ex = demoExamples[currentIndex];
  const correct = answer === ex.true_label;
  totalAnswered++;
  if (correct) userCorrect++;
  let mc = 0;
  Object.values(ex.predictions).forEach(p => { if (p === ex.true_label) mc++; });
  modelCorrect += mc / Object.keys(ex.predictions).length;
  const resultBox = document.getElementById('resultBox');
  resultBox.className = 'result-box show ' + (correct ? 'result-correct' : 'result-wrong');
  document.getElementById('resultText').innerHTML = correct
    ? '<strong>Correct!</strong> This review is ' + ex.true_label + '.'
    : '<strong>Tricky!</strong> This review is actually <em>' + ex.true_label + '</em>.';
  let modelHtml = '<strong>Model predictions:</strong>';
  for (const [model, pred] of Object.entries(ex.predictions)) {
    const isCorrect = pred === ex.true_label;
    modelHtml += '<div class="model-row"><span>' + model + '</span><span class="' + (isCorrect ? 'model-correct' : 'model-wrong') + '">' + pred + ' ' + (isCorrect ? '✓' : '✗') + '</span></div>';
  }
  document.getElementById('modelResults').innerHTML = modelHtml;
  document.querySelectorAll('.demo-btn').forEach(b => b.disabled = true);
  updateScores();
}
function updateScores() {
  document.getElementById('userScore').textContent = userCorrect;
  document.getElementById('modelScore').textContent = modelCorrect.toFixed(1);
  document.getElementById('totalAnswered').textContent = totalAnswered;
  document.getElementById('totalAnswered2').textContent = totalAnswered;
}
function nextQuestion() {
  currentIndex++;
  if (currentIndex >= demoExamples.length) showComplete();
  else showQuestion();
}
function showComplete() {
  document.getElementById('questionArea').style.display = 'none';
  document.getElementById('completeArea').style.display = 'block';
  document.getElementById('finalUserScore').textContent = userCorrect + '/' + demoExamples.length;
  document.getElementById('finalModelScore').textContent = modelCorrect.toFixed(1) + '/' + demoExamples.length;
  const diff = userCorrect - modelCorrect;
  let msg = diff > 1 ? "You crushed the AI! Human intuition wins." : diff > 0 ? "You edged out the models!" : diff === 0 ? "Dead heat with AI." : "The models got you this time. These are genuinely tricky!";
  document.getElementById('completeSummary').textContent = msg;
}
function restartDemo() { initDemo(); }
document.addEventListener('DOMContentLoaded', initDemo);
if (document.readyState !== 'loading') initDemo();
</script>

<h2 id="when-llms-make-sense">When LLMs Make Sense</h2>
<p>Despite the efficiency gap, there are cases where small LLMs are the right choice:</p>
<p><strong>Zero Training Data</strong></p>
<p>If you have no labeled data, LLMs win by default. Zero-shot Qwen2.5-1.5B at 93.8% on SST-2 is production-ready without a single training example. You can&rsquo;t fine-tune BERT with zero examples.</p>
<p><strong>Rapidly Changing Categories</strong></p>
<p>If your categories change frequently (new product types, emerging topics), re-prompting an LLM takes seconds. Re-training BERT requires new labeled data, training time, validation, deployment. The iteration cycle matters.</p>
<p><strong>Explanations with Predictions</strong></p>
<p>LLMs can provide reasoning: &ldquo;This review is negative because the customer mentions &lsquo;defective product&rsquo; and &lsquo;waste of money.&rsquo;&rdquo; BERT gives you a probability. Sometimes you need the story, not just the number.</p>
<p><strong>Low Volume</strong></p>
<p>If you&rsquo;re processing 100 support tickets a day, throughput doesn&rsquo;t matter. The 20x speed difference is irrelevant when you&rsquo;re not hitting any resource constraints.</p>
<h2 id="when-bert-still-wins">When BERT Still Wins</h2>
<p><strong>High-Volume Production Systems</strong></p>
<p>If you&rsquo;re classifying millions of items daily, BERT&rsquo;s 20x throughput advantage matters. That&rsquo;s a job finishing in an hour vs. running all day.</p>
<p><strong>Well-Defined, Stable Tasks</strong></p>
<p>Sentiment analysis. Spam detection. Topic classification. If your task definition hasn&rsquo;t changed since 2019, fine-tuned BERT is proven and stable. No need to fix what isn&rsquo;t broken.</p>
<p><strong>You Have Training Data</strong></p>
<p>With a few thousand labeled examples, fine-tuned DeBERTa will beat small LLMs. It&rsquo;s a dedicated specialist vs. a generalist. Specialization still works.</p>
<p><strong>Latency Matters</strong></p>
<p>Real-time classification in a user-facing app where every millisecond counts? BERT&rsquo;s parallel processing wins. LLMs can&rsquo;t compete on speed.</p>
<h2 id="limitations">Limitations</h2>
<p>Before you @ me on Twitter—yes, I know this isn&rsquo;t the final word. Some caveats:</p>
<p><strong>I only tested small LLMs.</strong> Kept everything under 2B parameters to fit comfortably on a 20GB GPU. Bigger models like Llama-3-8B or Qwen-7B would probably do better, but then the efficiency comparison becomes even more lopsided. You&rsquo;re not beating BERT&rsquo;s throughput with a 7B model.</p>
<p><strong>Generic prompts.</strong> I used straightforward prompts without heavy optimization. Task-specific prompt engineering could boost LLM performance. DSPy-style optimization would probably help too—but that&rsquo;s another blog post.</p>
<p><strong>Four benchmarks isn&rsquo;t everything.</strong> There are plenty of classification scenarios I didn&rsquo;t test. Your domain might be different. Measure, don&rsquo;t assume.</p>
<h2 id="conclusion">Conclusion</h2>
<p>So, can small LLMs beat BERT at classification?</p>
<p>Sometimes, and on the hardest task, they actually do. Gemma few-shot edges out DeBERTa on adversarial NLI, the benchmark specifically designed to break models.</p>
<p>DeBERTa-v3 still wins 3 out of 4 tasks when you have training data. And BERT&rsquo;s efficiency advantage is real—~20x faster throughput matters when you&rsquo;re processing millions of documents and paying for compute.</p>
<p>Zero-shot LLMs aren&rsquo;t just a parlor trick either. Qwen2.5-1.5B hits 93.8% on sentiment with zero training examples—that&rsquo;s production-ready without a single label. For cold-start problems, rapidly changing domains, or when you need explanations alongside predictions, they genuinely work.</p>
<p>Hopefully this gives some actual data points for making that call instead of just following the hype cycle.</p>
<p>All the code is <a href="https://github.com/alexjacobs08/beatingBERT">on GitHub</a>. Go run your own experiments.</p>
<hr>
<p><em>Surely I&rsquo;ve made some embarrassing mistakes here. Don&rsquo;t just tell me—tell everyone! Share this post on your favorite social media with your corrections :)</em></p>
]]></content:encoded>
    </item>
    
  </channel>
</rss>
