📝Everyone's Launching Wrappers. Nobody's Going Deep. - Notepad
2026-06-04

Everyone's Launching Wrappers. Nobody's Going Deep.

Most AI products are a UI on top of someone else's model. I ran 26,000 benchmarks to build something the platform can't replicate by adding a checkbox.

Josh Adler — Everyone's Launching Wrappers, Nobody's Going Deep — an empty gift-wrapped AI Product box next to an engineer in a research lab full of equations and benchmarks

Pick any AI product that launched this year. Peel back the landing page, scroll past the gradient hero and the "powered by AI" badge, and look at what's actually underneath. Nine times out of ten you'll find the same thing: an API call to Claude or GPT, a vector database someone spun up from a tutorial, and some prompt engineering dressed up as proprietary technology. The UI is theirs. The intelligence is rented.

I almost shipped one of these. I had the idea, I had the API key, I had the Next.js template ready to go. Memory for AI agents, simple pitch, obvious demand. I could have wired up a vector store, piped results into a context window, and deployed the whole thing in a week. I was close enough to taste the Product Hunt launch post. I'd already picked the gradient colors.

Then I made the mistake of actually testing whether the standard approach worked.

The Weekend Test

Here's the definition I use now. If someone with the same API key and a free weekend could rebuild your product, you built a wrapper. Doesn't matter how polished the dashboard is, doesn't matter if you raised a round on it. If the intelligence lives in someone else's model and you're just formatting prompts on top, the only thing protecting you is the fact that nobody's bothered to clone you yet.

The cycle plays out the same way every time. Launch on Product Hunt, get 400 upvotes from the AI early adopter crowd, screenshot it for Twitter, maybe raise a small round on the momentum. Six months later the underlying model ships a native feature that does exactly what your product does. Differentiator gone. Pivot to the next wrapper idea. I've watched this happen to a dozen products in the last year alone.

I Opened the Black Box and It Was Empty

When I started building memory for AI agents, I did what everyone else does. Grab a vector database, embed some conversation logs, run similarity search, pipe the top results into the context window. The standard recipe. Every "AI memory" product on the market was doing some version of this, and they were all telling users it worked great.

So I ran the benchmark. And the results were garbage.

Not "needs improvement" garbage. The kind of garbage where your system confidently retrieves the wrong memories, misses obvious temporal references, and fails on multi-hop questions that any human could answer after reading the same conversation once. I started digging into where exactly the failures were happening, categorized 357 of them by hand, and discovered something that reframed the whole problem: 92% of the failures weren't reasoning failures. They were retrieval failures. The information was in the database. The system just couldn't find it.

That one finding changed everything. It meant the entire field was focused on the wrong bottleneck. Everyone was debating which LLM to use for the reasoning step, which graph database would give you better relationship mapping, whether you needed a knowledge graph on top. None of that mattered if the retrieval couldn't surface the right memories in the first place. The librarian was broken, not the library.

I ran a proof just to make sure I wasn't fooling myself. Bypassed the retrieval entirely, fed the model the full conversation as context, and watched accuracy jump to 93.8%. The data was all there. It had always been there. The system just couldn't find it when you asked.

So I built a test rig. 7 different embedding models crossed with 8 different rerankers, 56 combinations total, each one evaluated against 1,540 ground-truth questions. Nobody had done this before because honestly it's tedious, boring work. There's no shortcut, you just run all 56 and wait.

The results were wild. Not in the way I expected though. The total spread across all 56 combinations was only 3.2 percentage points, from 89.9% to 93.1%. Which sounds small until you realize that most products were shipping without testing a single combination, just grabbing whatever embedding model the tutorial used and calling it done. Some of them had silent misconfigurations they'd never caught, like using a lightweight reranker when the config said they were running the good one. I found this exact bug in my own code: a script was quietly loading MiniLM instead of the GTE ModernBERT reranker I thought I was running, and nobody noticed because nobody was measuring.

That was the moment. Not a eureka breakthrough, more like slowly realizing that the "standard approach" everyone was shipping was held together by vibes and assumptions. The people building these products had never measured whether their retrieval actually worked. They embedded some text, got some results back, the results looked plausible, and they called it a day.

Here's the other thing I learned from that 56-combo matrix, and this one surprised me more than anything: throwing money at the model barely moved the needle. A cheap model ($0.40 per million tokens) with 100 retrieved memories beat an expensive model ($15 per million tokens) with 15 retrieved memories. Not by a little. The cheap model with better retrieval recovered 82% of errors. The expensive model with worse retrieval recovered 54%. The retrieval mattered more than the model. That's not a marginal finding, that's a complete inversion of how most people think about building AI products.

This whole period looked like nothing from the outside. I was ordering DoorDash at 2am because I forgot to eat again, watching benchmark scores scroll by on a terminal while everyone else on Twitter was posting screenshots of apps they built in an afternoon. No launch post. No upvotes. Just a spreadsheet getting wider.

Three Decisions That Looked Stupid

Once I knew the standard approach was broken, I had to decide how to fix it. I made three choices that everyone around me thought were wrong.

1. SQLite instead of Postgres, Pinecone, or Weaviate.

Every "serious" AI product uses a dedicated vector database. That's the conventional wisdom. You need Pinecone for scale, or Postgres with pgvector for flexibility, or Weaviate if you want the managed experience. I chose SQLite with sqlite-vec and FTS5 in a single file.

This looked dumb. It looked like a toy. But the constraint forced a better architecture. When your entire memory system lives in one file, you can't hide bad retrieval behind infrastructure complexity. There's no "well the database cluster might be having latency issues" excuse. If retrieval is broken, the architecture is broken, and you have to fix the actual problem instead of throwing more infrastructure at it.

It also forced me to build a hybrid search pipeline that actually worked: sparse full-text search plus dense vector search, fused together with reciprocal rank fusion, then reranked by a cross-encoder. All of that running locally, on CPU, adding only 400 to 700 milliseconds of latency. The whole system runs on a Raspberry Pi. It costs $12 a month. And it scores within 3 points of a system that requires Neo4j, a 4-billion parameter embedding model, and GPU infrastructure costing $150 to $400 a month.

2. An encoding gate instead of storing everything.

This is the counterintuitive one. Every memory system I looked at had the same philosophy: store everything, retrieve selectively. More data equals better recall, right?

Wrong. I started reading neuroscience papers (not because I set out to, but because the engineering problem kept pointing me there) and realized that biological memory doesn't work like that at all. Your hippocampus doesn't record everything. It runs every incoming experience through an encoding gate, a filter that evaluates novelty, salience, and prediction error before deciding whether something is worth storing. Most of what you experience gets discarded. That's not a bug, it's the core mechanism that makes retrieval work. Less noise in storage means less noise in retrieval.

So I built a three-signal encoding gate: novelty (is this new information?), salience (does this matter?), prediction error (is this surprising given what we already know?). It's modeled directly on how the amygdala and hippocampus interact during memory formation. The amygdala flags emotional significance, the hippocampus checks novelty against existing memories, and prediction error catches the things that violate your expectations. All three signals combined into a weighted sum, with a threshold that determines whether a memory gets stored or discarded.

Everyone told me this was over-engineering, that I should just store everything and let the retrieval pipeline sort it out. But the whole point of my research was that the retrieval pipeline can't sort it out if you fill it with noise.

3. Writing a research paper instead of shipping features.

This one cost me the most. While other people in the AI memory space were shipping integrations and landing users, I was formatting LaTeX and looking for an arXiv endorser. For months, the product sat still while I wrote up findings that could have just lived in a blog post.

But I needed the paper to be real. Not a Medium article, not a Twitter thread. A proper research paper with methodology, controlled benchmarks, reproducible results, and citations. Because if I was going to claim that the standard approach to AI memory was fundamentally broken, I needed more than a blog post saying "trust me." The paper ended up on arXiv. I'm 27, no PhD, no academic affiliations, no lab. Just a stack of benchmark data and an architecture that tested better than everything except one system that costs 20 times more to run.

The paper is the moat. You can clone a product, you can't clone the research that explains why it works. And honestly, publishing on arXiv as an independent researcher without institutional backing taught me something about the gap between the wrapper world and the research world: in one, you ship fast and hope nobody looks underneath. In the other, looking underneath is the entire point.

The Platform Is Coming for You

Here's the argument that nobody in the wrapper space wants to hear.

Remember AI meeting summarizers? A bunch of companies raised money, shipped products, got users. Investors were excited, the TAM slides looked great, growth was real. Then Zoom, Google Meet, and Microsoft Teams all shipped native summarization within months of each other. Those companies didn't fail because their product was bad. They failed because the platform they were wrapping decided to build the feature itself. No pivot saves you from that. When the platform has the distribution, the integration, and the model access, you cannot out-convenience them at their own game.

This is already happening in memory. Anthropic is shipping native memory for Claude. OpenAI is building memory into ChatGPT. Cursor has its own context management. Google's Gemini remembers conversations. Every major platform is going to ship some version of AI memory because it's one of the first things every user asks for.

When that happens, every product that's just managing context through prompt injection and MEMORY.md files dies overnight. The platform will always ship the checkbox feature faster, with better integration, for free. They don't even have to build a good version. They just have to build a version that's good enough and already installed on every user's machine.

The only way to survive platform absorption is to have something underneath the platform can't replicate by adding a feature. Not a prettier UI, not a smoother onboarding, not a better prompt template. Something real. A biological encoding gate that decides what to store based on novelty and salience. A 6-layer retrieval pipeline tuned across 56 embedding/reranker combinations. Published research that documents why the standard approach breaks. That's what TrueMemory is built on, and the distinction matters: I'm not building on the platform, I'm building underneath it.

Rented Land

Everyone's moving fast. Everyone's shipping. And most of what they're shipping is a UI on top of someone else's intelligence, built on land they don't own, protected by nothing but the assumption that the landlord won't build the same thing.

The landlord is already building it.


Josh Adler is a researcher at TrueMemory, a Sauron company. Research: arXiv:2605.04897. More at joshadler.com.

← All posts← Back to JoshOS📰 Press
Everyone's Launching Wrappers. Nobody's Going Deep.