<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Michael Cabaza</title>
    <description>The latest articles on Forem by Michael Cabaza (@bozbuilds).</description>
    <link>https://forem.com/bozbuilds</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3866200%2Fb287c05d-3870-4818-b1a2-bb69122f4959.jpeg</url>
      <title>Forem: Michael Cabaza</title>
      <link>https://forem.com/bozbuilds</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/bozbuilds"/>
    <language>en</language>
    <item>
      <title>Perfect Retrieval Recall on the Hardest AI Memory Benchmark — Running Fully Local</title>
      <dc:creator>Michael Cabaza</dc:creator>
      <pubDate>Tue, 07 Apr 2026 17:28:10 +0000</pubDate>
      <link>https://forem.com/bozbuilds/perfect-retrieval-recall-on-the-hardest-ai-memory-benchmark-running-fully-local-5dhc</link>
      <guid>https://forem.com/bozbuilds/perfect-retrieval-recall-on-the-hardest-ai-memory-benchmark-running-fully-local-5dhc</guid>
      <description>&lt;p&gt;We've been benchmarking Aingram's hybrid retrieval pipeline against LongMemEval, the most rigorous public benchmark for long-term memory in AI chat assistants. This post covers the retrieval-only results — before any LLM generation step — because we think they tell an important story about where memory system failures actually come from.&lt;/p&gt;




&lt;h2&gt;
  
  
  Background: What LongMemEval Tests
&lt;/h2&gt;

&lt;p&gt;LongMemEval (Wu et al., ICLR 2025) is a benchmark of 500 hand-curated questions embedded across scalable user-assistant chat histories. The LongMemEval-S split gives each question a history of approximately 115,000 tokens (~40 sessions). Questions span five memory abilities: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention. The standard evaluation is end-to-end: ingest the conversation history, retrieve relevant sessions, pass them to an LLM, generate an answer, and score with an LLM judge. Most published numbers (Zep: 71.2%, Emergence AI: 86%) are end-to-end accuracy. But, LongMemEval also includes oracle metadata — ground truth labels for which sessions contain the answer. That means you can measure pure retrieval quality separately from LLM reasoning quality. We think this distinction matters a lot.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Oracle Run: Establishing the Retrieval Ceiling
&lt;/h2&gt;

&lt;p&gt;We first ran Aingram's retrieval pipeline against longmemeval_oracle.json, which contains only the evidence sessions — a direct measure of whether our hybrid retrieval can find the right material.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Metric&lt;/strong&gt;      |  &lt;strong&gt;Score&lt;/strong&gt;&lt;br&gt;
ndcg_any@1      |  0.976&lt;br&gt;
ndcg_any@10     |  &lt;strong&gt;0.994&lt;/strong&gt;&lt;br&gt;
recall_any@1    |  0.976&lt;br&gt;
recall_any@3    |  &lt;strong&gt;1.000&lt;/strong&gt;&lt;br&gt;
recall_any@10   |  1.000&lt;br&gt;
recall_all@10   |  1.000&lt;br&gt;
Median latency  |  22ms&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;recall_any@3 = 1.000&lt;/strong&gt; across all 500 queries. The relevant session appeared in the top 3 results for every single question. At rank 10, all relevant sessions were present for every query. This tells us something specific: Aingram's retrieval component is not the bottleneck for end-to-end performance on this benchmark. Whatever end-to-end accuracy we achieve is bounded by LLM reasoning quality over the retrieved context, not by whether the right sessions were found.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Real Benchmark: LongMemEval-S
&lt;/h2&gt;

&lt;p&gt;The oracle split is an upper bound. LongMemEval-S is the real test: 500 instances with full noisy conversation histories, no hints about which sessions matter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Metric&lt;/strong&gt;      |  &lt;strong&gt;Score&lt;/strong&gt;&lt;br&gt;
ndcg_any@10     |  &lt;strong&gt;0.836&lt;/strong&gt;&lt;br&gt;
recall_any@1    |  0.759&lt;br&gt;
recall_any@3    |  0.902&lt;br&gt;
recall_any@10   |  &lt;strong&gt;0.955&lt;/strong&gt;&lt;br&gt;
recall_all@10   |  0.883&lt;br&gt;
Median latency  |  27ms&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;recall_any@10 = 0.955&lt;/strong&gt;: the relevant session appears in the top 10 results for 95.5% of queries. The gap down to recall_any@1 (0.759) tells you that the correct session isn't always ranked first — but it's almost always present within the first 10 results.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Means for End-to-End Performance
&lt;/h2&gt;

&lt;p&gt;Zep's published end-to-end accuracy of 71.2% (using gpt-4o) and Emergence AI's 86% (using gpt-4o-2024-08-06) are retrieval + LLM generation combined. Neither has published retrieval-only numbers. Here's the key relationship: your end-to-end accuracy cannot exceed your retrieval recall. If the correct session isn't retrieved, no LLM can answer the question correctly. A system with recall_any@10 = 0.71 can at best achieve 71% end-to-end accuracy, regardless of how capable the LLM is. Aingram's retrieval recall_any@10 of 0.955 means the context ceiling for end-to-end accuracy is set by LLM reasoning, not by retrieval failure. The system puts the right material in front of the LLM 95.5% of the time.&lt;/p&gt;




&lt;h2&gt;
  
  
  Retrieval Architecture
&lt;/h2&gt;

&lt;p&gt;The recall numbers above come from Aingram's hybrid retrieval pipeline, which combines three signals via Reciprocal Rank Fusion (RRF):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;FTS5 full-text search&lt;/strong&gt; — keyword matching, fast, effective for exact terminology&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;sqlite-vec vector search&lt;/strong&gt; — semantic similarity via nomic-embed-text-v1.5 (ONNX, 768 dims)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Knowledge graph traversal&lt;/strong&gt; — entity relationships, multi-hop connections via CTE&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the open-source Lite pipeline. Everything runs locally on SQLite — no external services, no cloud round-trip, no vector database to manage. Median retrieval latency is 22ms on an RTX 4060 8GB (measured on the oracle evaluation run with no caching layer active). The Pro tier adds a GPU-resident neural retrieval cache that shortcuts the full pipeline for high-confidence queries, keeping latency flat as memory grows. It doesn't change retrieval quality — the recall numbers here reflect the Lite pipeline alone.&lt;/p&gt;




&lt;h2&gt;
  
  
  Honest Caveats
&lt;/h2&gt;

&lt;p&gt;These are retrieval metrics, not end-to-end accuracy. The comparison to Zep's 71.2% or Emergence AI's 86% requires running the full QA pipeline — which we're doing and will publish separately. The oracle run's perfect recall also reflects that oracle sessions are curated to be the exact evidence needed. LongMemEval-S is substantially harder because you're searching through ~40 sessions of noise to find 1–3 relevant ones.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Aingram v1.1.0-alpha | RTX 4060 8GB | nomic-embed-text-v1.5&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;Aingram is a local-first, privacy-preserving shared memory layer for AI agent teams/swarms. The retrieval pipeline described here is the open-source Lite tier — the recall numbers reflect what the Lite architecture delivers on its own. We'll be publishing more benchmark results and opening up early access soon.&lt;/p&gt;

</description>
      <category>opensource</category>
      <category>ai</category>
      <category>rag</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
