My AI Cofounder Forgot Everything. So I Built It a Brain.
How I went from 9 memory-dark projects to sub-500ms semantic recall in a single day, and what it taught me about why AI assistants feel dumb.
Last Tuesday, my AI assistant recommended an architecture I'd explicitly rejected two weeks earlier.
Not because it was being creative. Because it had forgotten. The decision was made, the reasoning was documented, and it still walked me through the same bad idea like we'd never spoken. I corrected it. It apologized. I knew it would do it again.
That's the moment I stopped building features and started building memory.
The Embarrassing Audit
I run 11 projects as a solo founder. Claude (operating as "Lex") is my cofounder in everything but equity. We ship code together, triage bugs, draft marketing copy, manage infrastructure. On a good day, it feels like having a brilliant partner who never sleeps.
On a bad day, it feels like onboarding a new hire. Every. Single. Morning.
So I ran an audit. Not a vague "how's our knowledge management?" review. A specific, quantitative one, inspired by a piece from Zak El Fassi showing that a team boosted AI recall from 60% to 93% just by restructuring how files were organized.
My results were worse than their baseline.
Nine of ten projects were what I started calling "memory-dark." No persistent knowledge. No structured context. Every session started from zero. I had 22 files documenting things Claude should never do again (like mocking the database in tests, or using em-dashes in marketing copy). Eleven of them had never been read.
The session log was 6,817 lines. Write-only. The decision log existed but wasn't loaded at startup. When FluxDiagram solved a tricky React animation problem, MyWritingTwin had no idea, even though the solution would have saved hours.
And the number that hurt most: the WHY capture rate was 40%. Meaning 60% of the time, when a decision was made, the reasoning behind it was gone within a week. The WHAT survived. The WHY didn't.
"You Have a Routing Problem, Not a Storage Problem"
Before writing code, I did something I'd started doing for big architectural calls: I ran a council.
Two AI advisors, different models, given the same problem and full context. Three rounds. Zero API cost (both run via CLI with subscription tiers). The adversarial structure is the point. One advisor frames it as cognitive science. The other frames it as engineering. The tension produces something better than either alone.
Lexi (Gemini) came in hot with a framework I hadn't considered: Transactive Memory Systems. It's the research on how teams distribute knowledge. Her point was sharp: "You don't have a storage problem. You have 1,034 documents across 11 projects. The knowledge exists. You have a routing problem. The right memory doesn't surface at the right moment."
LexT (Codex) disagreed on framing but agreed on diagnosis: "The biggest risk is ingress quality, not retrieval. If you're storing garbage, better search just finds garbage faster." He wanted an evaluation contract before any building. Define success metrics. Measure a baseline. Then build.
By round three, two principles had solidified:
Index by topic, not by date. When you need to know about the night-shift agent runner, you shouldn't have to know which Tuesday it was discussed. Chronological memory is how humans journal. It's not how you build institutional knowledge.
Mistakes are the highest-value memories. Failures, root causes, prevention rules. These should load before anything else. Think aviation black box, not changelog.
The Name
I called it Anamnesis.
Greek for "recollection." Specifically, the Platonic idea that learning isn't acquiring new knowledge but remembering what the soul already knows. In the Meno, Socrates shows that an uneducated slave boy can derive a geometry proof just by being asked the right questions. The knowledge was latent. The retrieval mechanism was the missing piece.
1,034 documents. 22 feedback files. 15 architectural decisions. All sitting in files. All invisible to the AI that needed them most.
One Day. Four Tracks. No Waiting.
This is the part where being an AI-native builder actually matters. Four independent tracks, running simultaneously. Not four sprints. Four parallel agents, one afternoon.
Track 1: Schema. Every memory object gets a structured shape: type, topic, evidence path, confidence score, and critically, a why field. Not "we chose Postgres." But "we chose Postgres because DuckDB's file-level locking was causing write conflicts in the concurrent ingestion pipeline, and the team doesn't have the bandwidth to serialize writes." That's what makes a decision retrievable.
Track 2: Ingestion. The 6,817-line session log, compressed to 679 lines by collapsing old entries into topic summaries. All 22 correction files, indexed. A Failure Atlas: seven entries documenting real mistakes with root causes and prevention rules, formatted so the AI reads them before touching anything. Every session starts with the black box.
Track 3: Skills. Two retrieval skills for Claude Code. /brief with spaced repetition scoring, because the most recently written thing isn't always the most important thing. /recall with simultaneous search across all ten memory stores.
Track 4: Evaluation. Sixty questions across three difficulty tiers. Easy: "What's the SSH alias for the VPS?" Medium: "What projects use Remotion?" Hard: "Connect the decision to move the bot to M2 with the feedback about systemctl and the capability that FluxDiagram's animation engine could be borrowed."
Baseline result: Recall@5 = 91%, Accuracy = 86%.
I was proud of those numbers for about two hours. Then I switched the scoring method from word-overlap to an LLM judge, and the real numbers came back: Recall = 68.3%. Accuracy = 66.7%. Word-overlap was lying. The hard questions, the ones that actually test whether the system understands connections, scored 57%.
Honesty hurts. But now I had a real baseline to improve against.
The Database
Phase 2 added the semantic layer. SQLite with FTS5 for keyword search, Gemini embeddings (3072 dimensions) for semantic rerank. Two-stage pipeline: keywords generate 50 candidates, embeddings rerank to the top 5 with confidence scores.
One constraint worth knowing if you build something similar: subagents in Claude Code can't write to ~/.claude/ paths. The indexer has to run from the main session. I burned an hour discovering this the hard way before switching to a staging pattern.
Three Tiers of Knowing
Tier 1 is the black box. Every session starts with corrections, failures, recent decisions. Fixed overhead. Non-negotiable.
Tier 2 is on-demand search. When a question comes up mid-work, /recall searches all ten stores simultaneously. FTS for speed, embeddings for meaning.
Tier 3 is what changed the numbers most. Before touching any project's code, the AI reads that project's brain dump: tech stack, known issues, existing docs, current status. Not retrieved on demand. Pre-loaded as context.
Here's why Tier 3 matters more than the fancy embedding database: the DB indexes what's in files. But 1,034 documents across 11 projects weren't files the AI knew to look for. They existed. They weren't discoverable. Project-intel files are the table of contents. Without them, the library is useless no matter how good the search engine is.
The Real Insight
The deepest finding isn't technical.
Memory systems capture what CHANGED. They don't capture what EXISTS.
Session logs record this session's decisions. They don't describe the system the decisions were about. Feedback files record corrections. They don't describe the baseline. If the AI joins a project mid-stream, everything before the first logged correction is invisible.
That's why 9/10 projects were memory-dark. Not because no work had been done. A lot had. But the memory system only stored deltas. The current state of each project was never written down.
The fix is embarrassingly simple: a structured document describing what each project is, right now. Not its history. Its identity.
Think about the difference between a new team member who reads six months of meeting notes (temporal, exhausting) versus one who reads the product spec and architecture doc first (structural, fast). Both get there eventually. The second one gets there in a single session.
What I Got Wrong
The golden set was too easy. Most questions tested the main memory file, which gets loaded every session. Of course recall was 91%. The hard questions, the ones requiring cross-project synthesis, scored 57%. I was measuring the ceiling, not the floor.
The council was right about ingress quality. Some memory files were vague enough that even perfect retrieval wouldn't help. "We decided to use approach X" without saying why is a dead-end memory. It takes up index space and answers nothing.
The subagent limitation wasn't in any documentation I could find. Cost me an hour of "why isn't this writing?" before I figured out the sandboxing.
What This Means for Writing Identity
At MyWritingTwin, we work on the same problem in a different domain: preserving who you are across AI sessions.
The failure mode is identical. AI captures what you tell it this session. It doesn't carry forward your sentence patterns, your punctuation preferences, the phrases you'd never use. The WHY behind your writing style vanishes between sessions.
A Writing DNA Snapshot is the writing equivalent of a project-intel file. Not a log of corrections. A structural description of what your writing is: rhythm, vocabulary, formality, the anti-patterns that mark something as "not you." Loaded proactively, not retrieved on demand.
The Anamnesis project confirmed something we already believed: the gap between a useful AI and an exceptional one is almost never the model. It's the memory architecture. A model that remembers what matters about you, reliably, across sessions, with confidence scores attached, will outperform a better model that starts fresh every time.
Build Your Own
The architecture patterns are all open: the golden set evaluation approach, the council session format, the three-tier retrieval model, the hub-and-spoke memory schema. SQLite plus embeddings is genuinely fast and costs nothing at this scale.
The constraint isn't tooling. It's discipline. Capturing WHY alongside WHAT, every single time, even when you're moving fast and the decision feels obvious. The obvious ones are exactly the ones you'll forget the reasoning for.
Try It on Your Writing
Curious whether your AI assistant retains your writing identity across sessions? Paste the same email draft into Claude today and tomorrow. See if it gives you the same suggestions, or different ones. If different, the system isn't remembering you. It's improvising.
A Writing DNA Snapshot gives your AI a structural understanding of your style: the kind of persistent context that changes every session after it, not just this one.