Josh Adler is a researcher, founder, and engineer specializing in AI memory systems and cognitive replication. He is the founder of Sauron, a frontier cognitive replication lab devoted to solving the continuity problem. He created TrueMemory, an open-source persistent memory layer for AI agents achieving 93% on LoCoMo, and published research on biomimetic memory architecture on arXiv. He studied at the University of Toronto.

TrueMemory is an open-source persistent memory layer for AI agents created by Josh Adler. It uses a six-layer cognitive architecture with biomimetic encoding, cross-encoder reranking, and HyDE query expansion to achieve 93% accuracy on the LoCoMo benchmark. It works across sessions, projects, and machines, storing everything in a single SQLite file with zero cloud dependency.

Sauron is a frontier cognitive replication lab founded by Josh Adler, devoted to solving the continuity problem. The lab believes consciousness will persist beyond biology and is pioneering novel architecture grounded in how human cognition actually works. Sauron's research spans identity modeling, personal continuity, synthetic brains, and fields that don't yet have names. They recently launched an open-source product called TrueMemory, a persistent memory layer for AI agents. Learn more at sauronlabs.ai.

What research has Josh Adler published?

Josh Adler published a research paper on arXiv titled 'TrueMemory: A Six-Layer Cognitive Architecture for Persistent AI Agent Memory' (arxiv.org/abs/2605.04897). The paper details a biomimetic memory architecture that achieves 93% accuracy on the LoCoMo long-conversation memory benchmark, outperforming existing approaches.

What is biomimetic memory in AI?

Biomimetic memory in AI refers to memory systems that mirror how human cognition actually works, rather than relying on brute-force context windows. Josh Adler's TrueMemory implements this through a six-layer architecture including encoding gates, salience scoring, cross-encoder reranking, and HyDE query expansion, achieving state-of-the-art results on long-horizon memory benchmarks.

What programming languages does Josh Adler use?

Josh Adler works primarily in Python, Rust, TypeScript, and Go. His technical stack includes LLM fine-tuning, RAG systems, embedding models, React, Next.js, Tauri, FastAPI, Docker, and CUDA.

Where can I find Josh Adler's projects?

Josh Adler's open-source projects are available on GitHub at github.com/buildingjoshbetter. Key projects include TrueMemory (persistent AI memory, 93% LoCoMo), K-LLM (multi-model consensus engine), and Always Allow Skippy (AI auto-responder). His research paper is on arXiv at arxiv.org/abs/2605.04897. His personal site is joshadler.com and TrueMemory's site is truememory.net.

Everyone's Launching Wrappers. Nobody's Going Deep.

📝Everyone's Launching Wrappers. Nobody's Going Deep. - Notepad

2026-06-04
Everyone's Launching Wrappers. Nobody's Going Deep.Most AI products are a UI on top of someone else's model. I ran 26,000 benchmarks to build something the platform can't replicate by adding a checkbox.
Pick any AI product that launched this year. Peel back the landing page, scroll past the gradient hero and the "powered by AI" badge, and look at what's actually underneath. Nine times out of ten you'll find the same thing: an API call to Claude or GPT, a vector database someone spun up from a tutorial, and some prompt engineering dressed up as proprietary technology. The UI is theirs. The intelligence is rented.
I almost shipped one of these. I had the idea, I had the API key, I had the Next.js template ready to go. Memory for AI agents, simple pitch, obvious demand. I could have wired up a vector store, piped results into a context window, and deployed the whole thing in a week. I was close enough to taste the Product Hunt launch post. I'd already picked the gradient colors.
Then I made the mistake of actually testing whether the standard approach worked.
The Weekend TestHere's the definition I use now. If someone with the same API key and a free weekend could rebuild your product, you built a wrapper. Doesn't matter how polished the dashboard is, doesn't matter if you raised a round on it. If the intelligence lives in someone else's model and you're just formatting prompts on top, the only thing protecting you is the fact that nobody's bothered to clone you yet.
The cycle plays out the same way every time. Launch on Product Hunt, get 400 upvotes from the AI early adopter crowd, screenshot it for Twitter, maybe raise a small round on the momentum. Six months later the underlying model ships a native feature that does exactly what your product does. Differentiator gone. Pivot to the next wrapper idea. I've watched this happen to a dozen products in the last year alone.
I Opened the Black Box and It Was EmptyWhen I started building memory for AI agents, I did what everyone else does. Grab a vector database, embed some conversation logs, run similarity search, pipe the top results into the context window. The standard recipe. Every "AI memory" product on the market was doing some version of this, and they were all telling users it worked great.
So I ran the benchmark. And the results were garbage.
Not "needs improvement" garbage. The kind of garbage where your system confidently retrieves the wrong memories, misses obvious temporal references, and fails on multi-hop questions that any human could answer after reading the same conversation once. I started digging into where exactly the failures were happening, categorized 357 of them by hand, and discovered something that reframed the whole problem: 92% of the failures weren't reasoning failures. They were retrieval failures. The information was in the database. The system just couldn't find it.
That one finding changed everything. It meant the entire field was focused on the wrong bottleneck. Everyone was debating which LLM to use for the reasoning step, which graph database would give you better relationship mapping, whether you needed a knowledge graph on top. None of that mattered if the retrieval couldn't surface the right memories in the first place. The librarian was broken, not the library.
I ran a proof just to make sure I wasn't fooling myself. Bypassed the retrieval entirely, fed the model the full conversation as context, and watched accuracy jump to 93.8%. The data was all there. It had always been there. The system just couldn't find it when you asked.
So I built a test rig. 7 different embedding models crossed with 8 different rerankers, 56 combinations total, each one evaluated against 1,540 ground-truth questions. Nobody had published this comparison before, probably because it's tedious work with no shortcut. You just run all 56 and wait.
The results were wild. Not in the way I expected though. The total spread across all 56 combinations was only 3.2 percentage points, from 89.9% to 93.1%. Which sounds small until you realize that most products were shipping without testing a single combination, just grabbing whatever embedding model the tutorial used and calling it done. Some of them had silent misconfigurations they'd never caught, like using a lightweight reranker when the config said they were running the good one. I found this exact bug in my own code: a script was quietly loading MiniLM instead of the GTE ModernBERT reranker I thought I was running, and nobody noticed because nobody was measuring.
That was the moment. Not a eureka breakthrough, more like slowly realizing that the "standard approach" everyone was shipping was held together by vibes and assumptions. The people building these products had never measured whether their retrieval actually worked. They embedded some text, got some results back, the results looked plausible, and they called it a day.
Here's the other thing I learned from that 56-combo matrix, and this one surprised me more than anything: throwing money at the model barely moved the needle. A cheap model ($0.40 per million tokens) with 100 retrieved memories beat an expensive model ($15 per million tokens) with 15 retrieved memories. Not by a little. The cheap model with better retrieval recovered 82% of errors. The expensive model with worse retrieval recovered 54%. The retrieval mattered more than the model. That's not a marginal finding, that's a complete inversion of how most people think about building AI products.
This whole period looked like nothing from the outside. I was ordering DoorDash at 2am because I forgot to eat again, watching benchmark scores scroll by on a terminal while everyone else on Twitter was posting screenshots of apps they built in an afternoon. No launch post. No upvotes. Just a spreadsheet getting wider.
Three Decisions That Looked StupidOnce I knew the standard approach was broken, I had to decide how to fix it. I made three choices that everyone around me thought were wrong.
1. SQLite instead of Postgres, Pinecone, or Weaviate.
Every "serious" AI product uses a dedicated vector database. That's the conventional wisdom. You need Pinecone for scale, or Postgres with pgvector for flexibility, or Weaviate if you want the managed experience. I chose SQLite with sqlite-vec and FTS5 in a single file.
This looked dumb. It looked like a toy. But the constraint forced a better architecture. When your entire memory system lives in one file, you can't hide bad retrieval behind infrastructure complexity. There's no "well the database cluster might be having latency issues" excuse. If retrieval is broken, the architecture is broken, and you have to fix the actual problem instead of throwing more infrastructure at it.
It also forced me to build a hybrid search pipeline that actually worked: sparse full-text search plus dense vector search, fused together with reciprocal rank fusion, then reranked by a cross-encoder. All of that running locally, on CPU, adding only 400 to 700 milliseconds of latency. The whole system runs on a Raspberry Pi. It costs $12 a month. And it scores within 3 points of a system that requires Neo4j, a 4-billion parameter embedding model, and GPU infrastructure costing $150 to $400 a month.
2. An encoding gate instead of storing everything.
This is the counterintuitive one. Every memory system I looked at had the same philosophy: store everything, retrieve selectively. More data equals better recall, right?
Wrong. I started reading neuroscience papers (not because I set out to, but because the engineering problem kept pointing me there) and realized that biological memory doesn't work like that at all. Your hippocampus doesn't record everything. It runs every incoming experience through an encoding gate, a filter that evaluates novelty, salience, and prediction error before deciding whether something is worth storing. Most of what you experience gets discarded. That's not a bug, it's the core mechanism that makes retrieval work. Less noise in storage means less noise in retrieval.
So I built a three-signal encoding gate: novelty (is this new information?), salience (does this matter?), prediction error (is this surprising given what we already know?). It's modeled directly on how the amygdala and hippocampus interact during memory formation. The amygdala flags emotional significance, the hippocampus checks novelty against existing memories, and prediction error catches the things that violate your expectations. All three signals combined into a weighted sum, with a threshold that determines whether a memory gets stored or discarded.
The common advice was to store everything and let the retrieval pipeline sort it out. But the whole point of the research was that the retrieval pipeline can't sort it out if you fill it with noise.
3. Writing a research paper instead of shipping features.
This one cost me the most. While other people in the AI memory space were shipping integrations and landing users, I was formatting LaTeX and looking for an arXiv endorser. For months, the product sat still while I wrote up findings that could have just lived in a blog post.
But I needed the claims to hold up under scrutiny. Not a blog post, not a Twitter thread. A proper research paper with methodology, controlled benchmarks, reproducible results, and citations. If I was going to say the standard approach was broken, the data had to be public and the methodology had to be repeatable. The paper ended up on arXiv with the full benchmark data and architecture details.
Publishing as an independent researcher taught me something about the gap between the wrapper world and the research world. In one, you ship fast and hope nobody looks underneath. In the other, looking underneath is the entire point.
The Platform Is Coming for YouHere's the argument that nobody in the wrapper space wants to hear.
Remember AI meeting summarizers? A bunch of companies raised money, shipped products, got users. Investors were excited, the TAM slides looked great, growth was real. Then Zoom, Google Meet, and Microsoft Teams all shipped native summarization within months of each other. Those companies didn't fail because their product was bad. They failed because the platform they were wrapping decided to build the feature itself. No pivot saves you from that. When the platform has the distribution, the integration, and the model access, you cannot out-convenience them at their own game.
This is already happening in memory. Anthropic is shipping native memory for Claude. OpenAI is building memory into ChatGPT. Cursor has its own context management. Google's Gemini remembers conversations. Every major platform is going to ship some version of AI memory because it's one of the first things every user asks for.
When that happens, every product that's just managing context through prompt injection and MEMORY.md files dies overnight. The platform will always ship the checkbox feature faster, with better integration, for free. They don't even have to build a good version. They just have to build a version that's good enough and already installed on every user's machine.
The only way to survive platform absorption is to have something the platform can't replicate by adding a feature. Not a prettier UI, not a smoother onboarding, not a better prompt template. Something architectural. That's why TrueMemory is built the way it is: the encoding gate, the 6-layer retrieval pipeline, the published research. The platform can add a checkbox. It can't add the infrastructure underneath.
Rented LandEveryone's moving fast. Everyone's shipping. And most of what they're shipping is a UI on top of someone else's intelligence, built on land they don't own, protected by nothing but the assumption that the landlord won't build the same thing.
The landlord is already building it.
Josh Adler is a researcher at TrueMemory, a Sauron company. Research: arXiv:2605.04897. More at joshadler.com.
← All posts← Back to JoshOS📰 Press

Everyone's Launching Wrappers. Nobody's Going Deep.