Stop wasting tokens with the wrong AI agent memory

Your agent blows its token budget on a single tool call, or forgets what the user said three turns ago. Same root cause: it has two kinds of memory and they got mixed up. One holds the conversation; the other holds large tool outputs like logs. They need different storage and different retrieval, and treating them as one store is what makes agents slow, expensive, and wrong.

This post shows how to keep them separate: the framework now offloads large data for you (no more pointer code by hand), and in production the two memories map to two AWS services. I deployed it and measured the difference.

Builds on AI Context Window Overflow: Memory Pointer Fix. Code uses Strands Agents; the patterns carry over to other frameworks. Repo: sample-why-agents-fail.

What are the two kinds of agent memory?

An AI agent has two kinds of memory: conversation memory holds what was said (turns, preferences, facts) and is recalled by meaning, while context memory holds large tool outputs (logs, datasets, documents) and is recalled by an exact identifier. They are different stores with different retrieval, and using one where the other belongs is the root cause of both "my agent forgets things" and "my agent blew the token budget."

Before any code, get the distinction straight:

	Conversation memory	Context memory
Holds	Turns, preferences, extracted facts	Large tool outputs (logs, datasets)
Recalled by	Meaning (semantic similarity)	Exact identifier (a reference)
Question it answers	"What did the user tell me earlier?"	"Give me that 5MB log file back, exactly"
Wrong fit for	A 5MB log blob	"What's the user's name again?"

That table is the whole article. Everything below is just where each row lives in code.

Why context memory overflows first

Large tool outputs overflow the context window because they are indivisible and re-sent on every model call. A tool that returns 200KB of logs doesn't just cost 200KB once. That payload rides along in the input of every subsequent turn until it pushes the original question out of the window.

The first post quantified this with IBM Research (Solving Context Window Overflow in AI Agents, 2025): a materials-science workflow that consumed 20,822,181 tokens and failed dropped to 1,234 tokens and succeeded once large data was stored outside context and referenced by a pointer.

The fix, then and now: stop putting data in the conversation

The original post stored large data by hand: a tool wrote it to agent.state and returned a short pointer string; the next tool read it back by that key. It works, but the offloading logic lived inside every tool.

Strands now ships that exact pattern as a first-class plugin, ContextOffloader, so your tools go back to being ordinary functions:

from strands import Agent
from strands.vended_plugins.context_offloader import ContextOffloader, FileStorage

# Ordinary tools — no pointer logic, no agent.state inside them
agent = Agent(
    model=MODEL,
    tools=[fetch_application_logs, count_errors_by_service],
    plugins=[ContextOffloader(storage=FileStorage("./artifacts"),
                              max_result_tokens=800, preview_tokens=200)],
)
agent("Fetch 2 hours of logs for 'api-gateway' and tell me the top error service.")

When a tool result is larger than max_result_tokens, the plugin intercepts it, stores each block in the backend, and leaves a small preview plus a reference in context. The agent gets a retrieve_offloaded_content(reference) tool to pull the full data back by exact reference when it actually needs it.

What is the native Memory Pointer Pattern in Strands?

The native Memory Pointer Pattern is ContextOffloader, a plugin that intercepts oversized tool results at execution time, stores each block in a storage backend, and replaces the in-context result with a preview plus a reference. Large data never floods the context window, and your tools never touch pointer logic.

Measured results

I ran the same query through three strategies. Same query, gpt-4o-mini, 2 hours of logs:

Strategy	Tokens in context
No management	~18,000 to 20,000
`ContextOffloader` (FileStorage)	~490
`context_manager="auto"`	~1,000

That is roughly 97% fewer tokens for the same answer. Numbers vary per run because the log data is randomized; test_native_pointer.py reproduces them.

One honest caveat: the offloader is a safety net, not the whole win. The big savings come from pairing it with a selective tool. My count_errors_by_service computes the answer server-side and returns a small summary, so the agent answers from the summary and the logs stay offloaded. Without a selective tool, an agent that needs the full dataset will just call retrieve_offloaded_content and bring it all back. The offloader guarantees you won't overflow; selective tools are what keep the token count low.

One line for most agents

For a typical multi-turn agent you don't wire up offloading and summarization separately:

agent = Agent(model=MODEL, tools=[...], context_manager="auto")

This composes a SummarizingConversationManager (summarizes old history with proactive compression) and a ContextOffloader (in-memory) with benchmark-validated defaults. Anything you pass explicitly takes precedence.

The same idea, on real Amazon S3 storage

FileStorage writes to local disk. Swap one line and large tool outputs land in a real S3 bucket, recalled by exact reference, never in the window:

from strands.vended_plugins.context_offloader import ContextOffloader, S3Storage

agent = Agent(
    model=MODEL,
    tools=[fetch_application_logs, count_errors_by_service],
    plugins=[ContextOffloader(S3Storage(bucket=CONTEXT_BUCKET, prefix="log-artifacts/"))],
)

An 83KB log dataset was stored in S3, ~486 tokens stayed in context, and the data came back byte-for-byte by its exact reference:

📊 Tokens left in LLM context:  486
📦 Objects offloaded to S3:     1
   pointer in context:  s3://…/log-artifacts/1781569100199_1_call_…_0
   storage.retrieve()  → 77,050 bytes  (text/plain)
   verified: 200 log events recovered verbatim — exact data, no loss

That is the second row of the table, in production form: exact-identifier recall. You don't want "the logs most similar to my query." You want those logs, exactly. That's object storage, not semantic search.

Production: two memories, on purpose

In production the split becomes architecture. An agent on Amazon Bedrock AgentCore keeps each memory where it belongs:

Conversation → AgentCore Memory. Turns, preferences, and extracted facts, recalled by semantic similarity (RetrieveMemoryRecords: embeddings, top_k, relevance score), scoped per user with actor_id. Wired in through the Strands AgentCoreMemorySessionManager.
Context memory → Amazon S3. The same ContextOffloader, with S3Storage instead of FileStorage. Recalled by exact reference.

Why not put the logs in AgentCore Memory too? Because AgentCore Memory recalls the semantically most similar memory, which is exactly wrong for "return this dataset verbatim by id." Conversation wants meaning; data wants an exact key. One agent, two memories, each doing what it's good at.

agent = Agent(
    model=BedrockModel(region_name=REGION),
    tools=[fetch_application_logs, count_errors_by_service],
    session_manager=AgentCoreMemorySessionManager(memory_config, REGION),     # conversation
    plugins=[ContextOffloader(S3Storage(bucket=CONTEXT_BUCKET, prefix="…"))],  # data
)

Observability and evaluation come for free

On AgentCore, full observability is built in. You add the instrumentation library and get traces, metrics, and logs for every invocation without writing any monitoring code. The deploy already enabled it: the agent emits OpenTelemetry (OTEL) traces and metrics under the bedrock-agentcore namespace, and a CloudWatch GenAI Observability dashboard shows agent, session, and trace views (latency, error rate, token usage, tool calls) out of the box.

That is how I diagnosed the ListEvents permission error from earlier in seconds: the failing trace was right there in CloudWatch, no extra setup. See View observability data for AgentCore agents.

The same instrumentation feeds AgentCore Evaluations: automated, LLM-as-a-Judge scoring of task completion and tool-call accuracy from the same traces, so you can measure agent quality continuously instead of only at launch.

Which memory, when

Just the data problem, locally? ContextOffloader(FileStorage(...)). Ordinary tools, no pointer code.
A typical multi-turn agent? context_manager="auto". Summarization plus offloading in one line.
Production? AgentCore Memory for the conversation, ContextOffloader(S3Storage(...)) for the data. Keep them separate.
Either way: pair the offloader with selective tools that return summaries, not raw blobs. The offloader prevents overflow; selective tools keep the token count low.

Try it yourself

You need Python 3.11+, uv, and an OPENAI_API_KEY (or swap the model for BedrockModel). The S3 and AgentCore steps also need AWS credentials.

git clone https://github.com/aws-samples/sample-why-agents-fail
cd sample-why-agents-fail/stop-ai-agents-wasting-tokens/01-context-overflow-demo
uv venv && uv pip install -r requirements.txt

uv run python test_native_pointer.py              # local, measured token comparison
AWS_PROFILE=you uv run python test_s3_offload_local.py   
# Production deploy + two-memory walkthrough: setup_agentcore_s3.ipynb

Notebooks: test_native_pointer.ipynb (local) and setup_agentcore_s3.ipynb (provision + deploy + invoke on AWS).

Key takeaways

An agent has two memories. Conversation (semantic) and data (exact reference). Most context problems are one put where the other belongs.
You don't build the data side by hand anymore. ContextOffloader is the Memory Pointer Pattern as a plugin; tools stay ordinary functions.
Measured ~97% fewer tokens in this demo, and verified an 83KB dataset offloaded to real S3 and recovered byte-for-byte by reference.
In production, keep the two memories separate. AgentCore Memory for conversation, S3 for data. Logs recalled by meaning is the wrong design.
The offloader is a safety net; selective tools are the win. Return summaries, not blobs.
On AgentCore, observability and evaluation are free. Add the library, get traces, metrics, and LLM-as-a-Judge scoring with no monitoring code.

FAQ

Does ContextOffloader need AWS? No. With FileStorage or InMemoryStorage it runs fully local. You only need AWS when you choose S3Storage or deploy to AgentCore.

Can I store large files in AgentCore Memory instead of S3? You can, but you shouldn't. AgentCore Memory recalls by semantic similarity, so it returns the most similar memory, not an exact file. Large tool outputs need exact-identifier retrieval, which is what S3 (via ContextOffloader) gives you.

Do I need Docker to deploy to AgentCore? No. The starter toolkit builds the image in the cloud with AWS CodeBuild by default. Docker is only needed for a local build.

What is the difference between agent.state and ContextOffloader? agent.state is the manual Memory Pointer Pattern: you write and read pointers inside your tools. ContextOffloader is the same idea as a plugin: tools stay ordinary and the framework offloads large results for you.

Which of my two memories is costing me tokens? The data one. Conversation memory is small text; the token blowups come from large tool outputs riding along in context. That is the memory ContextOffloader fixes.

Which of your agent's two memories is leaking tokens? Tell me in the comments.

References

Research

Solving Context Window Overflow in AI Agents — IBM Research, 2025
Towards Effective GenAI Multi-Agent Collaboration — Amazon, 2024 (payload referencing between agents)

Implementation

Gracias!

🇻🇪🇨🇱 Dev.to Linkedin GitHub Twitter Instagram Youtube

Elizabeth Fuentes L

I help developers build production-ready AI applications through hands-on tutorials and open-source projects.

Stop wasting tokens with the wrong AI agent memory

What are the two kinds of agent memory?

Why context memory overflows first

The fix, then and now: stop putting data in the conversation

What is the native Memory Pointer Pattern in Strands?

Measured results

One line for most agents

The same idea, on real Amazon S3 storage

Production: two memories, on purpose

Observability and evaluation come for free

Which memory, when

Try it yourself

Key takeaways

FAQ

References

Elizabeth Fuentes L

Related Articles

What was your win this week??

Bletchley's Longest Day: a wartime cipher escape game for the …

My First Week on DEV — Badges, Game Jams, and …

Stop wasting tokens with the wrong AI agent memory

What are the two kinds of agent memory?

Why context memory overflows first

The fix, then and now: stop putting data in the conversation

What is the native Memory Pointer Pattern in Strands?

Measured results

One line for most agents

The same idea, on real Amazon S3 storage

Production: two memories, on purpose

Observability and evaluation come for free

Which memory, when

Try it yourself

Key takeaways

FAQ

References

Elizabeth Fuentes LFollow

Related Articles

What was your win this week??

Bletchley's Longest Day: a wartime cipher escape game for the …

My First Week on DEV — Badges, Game Jams, and …

Elizabeth Fuentes L