OpenAI Taught ChatGPT to Dream. My Agent's Memory Is Still a JSON File.

OpenAI AI Agents Agent Memory ChatGPT LLM Architecture

On June 4, OpenAI shipped the biggest change to ChatGPT's memory since the feature existed. They call it Dreaming V3, and for once the name isn't just marketing. There is now a background process that runs while you're not chatting, reads across your conversation history, and rewrites a synthesized profile of what the model knows about you. Plus and Pro users in the US got it first. Free users are next, because OpenAI says it cut the compute cost of serving the feature by roughly 5x.

I've built memory for agents three times now — for a GitHub PR-reviewer agent, for the multi-agent system we eventually collapsed back into a single agent, and for the small ops agent that runs my own infrastructure. Every version was, if I'm honest, a JSON file with ambitions. So when the biggest consumer AI product on the planet announces "we made memory a batch job," I pay attention. This post covers what actually shipped, why the architecture is the real story, what I'm stealing for my own agents — and the part of the research record the launch post didn't mention.

What actually shipped

Old ChatGPT memory was append-only. You said "remember that I use TypeScript," it wrote a bullet to a list, and that bullet sat there forever — equally weighted against everything else, never reconciled, never expiring. Dreaming V3 throws that model out. Memory is now drawn automatically from your chat history and maintained as a coherent prose profile, sorted into categories like work, hobbies, and travel. There's a new memory summary page where you can read the profile, correct it, or tell ChatGPT which topics to track or avoid.

OpenAI's internal numbers for the upgrade: fact retrieval at 82.8% (up from 67.9% in 2025 and 41.5% in 2024), preference-following at 71.3% (up from 31.4%), and sensitivity to elapsed time at 75.1% (up from 52.2%). That last metric is the interesting one — they're explicitly benchmarking whether the model knows that what you said three months ago may no longer be true. Nobody benchmarks that unless staleness was hurting them in production.

The real story is the architecture

Strip the branding and Dreaming V3 is three design decisions:

1. Memory is written offline. A background consolidation process — the "dreaming" — does the synthesis between sessions, not inline while you chat. The expensive thinking about what to remember happens when nobody is waiting on a response.

2. The store is a rewritten document, not an append-only list. Each consolidation pass produces a fresh prose profile. Old facts don't accumulate next to new ones — they get reconciled, superseded, or dropped.

3. Retrieval optimizes for freshness, relevance, and continuity — recent context outweighs old context by design, and noise is filtered before it ever reaches the conversation.

None of this is new research. It's how human memory consolidation during sleep has been described for decades, and it's the "reflection" step from the generative-agents papers of 2023. What's new is that it's now running for hundreds of millions of users, cheaply enough that OpenAI plans to give it away on the free tier. The pattern just got production-validated at the largest possible scale.

Why append-only memory fails (I have the scars)

My PR-reviewer agent stored facts about each repo it reviewed. Useful facts, like which test framework the project used. Then one repo migrated test runners, and the agent kept recommending mocks for the old framework for weeks — because nothing in an append-only store ever deletes the stale fact. The new information went in right next to the old information, and retrieval happily served whichever one matched the query better that day.

Every naive agent memory dies one of three deaths: staleness (true facts that stopped being true), noise (so much stored that retrieval surfaces the plausible-but-irrelevant), or contradiction (two facts disagree and the model picks one with full confidence). Append-only guarantees all three eventually. RAG over raw chat logs doesn't fix it — it just postpones the funeral and adds a vector database to the invoice.

What I'm stealing

You don't need OpenAI's infrastructure to copy the idea. The whole pattern fits in a cron job and one LLM call:

// memory-consolidation.ts — runs nightly, never per-request
async function consolidate(agentId: string) {
  const episodes = await db.episodes.since(agentId, lastRun); // raw interaction logs
  const profile = await db.profile.get(agentId);              // ONE prose document

  const updated = await llm.generate({
    system: `You maintain an agent's long-term memory.
      Rewrite the profile to incorporate the new episodes.
      Reconcile contradictions in favor of newer evidence.
      Drop anything stale or irrelevant. Keep it under 1500 tokens.
      Date every claim you keep.`,
    prompt: JSON.stringify({ profile, episodes }),
  });

  await db.profile.put(agentId, { text: updated, updatedAt: now });
  await db.episodes.archive(agentId, lastRun); // raw logs leave the hot path
}

At runtime the agent gets the profile injected into its system prompt. That's it. No retrieval step, no embeddings, no reranker. The three rules that matter:

Rewrite, don't append. The consolidation prompt's job is reconciliation. "Repo X migrated from Jest to Vitest" should replace the old fact, not live alongside it. This is the single change that kills staleness and contradiction at the same time.

Cap the document. A hard token budget forces the model to prioritize, which is exactly the noise filter you want. Most agents don't need a memory system; they need 1,500 well-chosen tokens. Only shard into multiple documents when one genuinely overflows — I haven't needed to yet.

Date everything. OpenAI's elapsed-time benchmark exists because models are bad at "how old is this fact" unless you make it explicit. A claim with a date lets the next consolidation pass — and the agent itself — reason about whether it's still trustworthy.

The asterisks

First, every number above is OpenAI's own. No independent verification, no comparison against competitors, no published methodology for what counts as a successfully "retrieved fact." The trend line across three years is probably real; the absolute numbers are marketing until someone external reproduces them.

Second, the part the launch coverage mostly skipped: a February arXiv study analyzed 2,050 ChatGPT memory entries and found that 96% were created unilaterally by the system rather than by user instruction — 52% containing psychological insights about the user, 28% containing personal data as defined by GDPR. Dreaming V3 makes the profile visible and editable, which is a genuine improvement. But the architecture is now explicitly built to infer things about you that you never asked it to store, by default, in the background. That's the trade you're making as a user.

And if you build this pattern into a product, you inherit the same problem. A consolidation job that infers "user seems frustrated with their manager" from support chats is a GDPR incident waiting for a subject-access request. Show users the profile, let them edit it, and keep the consolidation prompt constrained to operational facts. There's also a security angle I learned the hard way after my own agent got prompt-injected last week: persistent memory is persistence for attackers too. A poisoned fact survives across sessions until a consolidation pass questions it — so make your consolidation prompt skeptical of instructions embedded in episode data, the same way you'd sanitize anything else that crosses a trust boundary.

Verdict

Dreaming V3 is the most useful agent-engineering idea to ship inside a consumer product this year — not because the benchmarks are impressive, but because it settles an argument. Agent memory is not a database problem, and it's not a retrieval problem. It's a synthesis problem, and synthesis is a batch job. Rewrite instead of appending, cap the document, date the claims, show the user the file.

My PR-reviewer agent is getting a nightly consolidation pass this month. If the stale-fact recommendations stop, you'll get the numbers in a follow-up post. If they don't, you'll get that post too — that's the deal here.

Share this article

Share on X LinkedIn Bluesky Reddit WhatsApp Email

More writing

Like what you read?

Stay in the loop.

New articles on engineering, architecture, and building software that lasts. Straight to your inbox.

or follow

GitHub LinkedIn @flcn16