Your agent blows its token budget on a single tool call, or forgets what the user said three turns ago. Same root cause: it has two kinds of memory and they got mixed up. One holds the conversation; the other holds large tool outputs like logs. They need different storage and different retrieval, and treating them as one store is what makes agents slow, expensive, and wrong.
This post shows how to keep them separate: the framework now offloads large data for you (no more pointer code by hand), and in production the two memories map to two AWS services. I deployed it and measured the difference.
Builds on AI Context Window Overflow: Memory Pointer Fix. Code uses Strands Agents; the patterns carry over to other frameworks. Repo: sample-why-agents-fail.
What are the two kinds of agent memory?
An AI agent has two kinds of memory: conversation memory holds what was said (turns, preferences, facts) and is recalled by meaning, while context memory holds large tool outputs (logs, datasets, documents) and is recalled by an exact identifier. They are different stores with different retrieval, and using one where the other belongs is the root cause of both "my agent forgets things" and "my agent blew the token budget."
Before any code, get the distinction straight:
| Conversation memory | Context memory | |
|---|---|---|
| Holds | Turns, preferences, extracted facts | Large tool outputs (logs, datasets) |
| Recalled by | Meaning (semantic similarity) | Exact identifier (a reference) |
| Question it answers | "What did the user tell me earlier?" | "Give me that 5MB log file back, exactly" |
| Wrong fit for | A 5MB log blob | "What's the user's name again?" |
That table is the whole article. Everything below is just where each row lives in code.
Why context memory overflows first
Large tool outputs overflow the context window because they are indivisible and re-sent on every model call. A tool that returns 200KB of logs doesn't just cost 200KB once. That payload rides along in the input of every subsequent turn until it pushes the original question out of the window.
The first post quantified this with IBM Research (Solving Context Window Overflow in AI Agents, 2025): a materials-science workflow that consumed 20,822,181 tokens and failed dropped to 1,234 tokens and succeeded once large data was stored outside context and referenced by a pointer.
The fix, then and now: stop putting data in the conversation
The original post stored large data by hand: a tool wrote it to agent.state and returned a short pointer string; the next tool read it back by that key. It works, but the offloading logic lived inside every tool.
Strands now ships that exact pattern as a first-class plugin, ContextOffloader, so your tools go back to being ordinary functions:
from strands import Agent
from strands.vended_plugins.context_offloader import ContextOffloader, FileStorage
# Ordinary tools โ no pointer logic, no agent.state inside them
agent = Agent(
model=MODEL,
tools=[fetch_application_logs, count_errors_by_service],
plugins=[ContextOffloader(storage=FileStorage("./artifacts"),
max_result_tokens=800, preview_tokens=200)],
)
agent("Fetch 2 hours of logs for 'api-gateway' and tell me the top error service.")
When a tool result is larger than max_result_tokens, the plugin intercepts it, stores each block in the backend, and leaves a small preview plus a reference in context. The agent gets a retrieve_offloaded_content(reference) tool to pull the full data back by exact reference when it actually needs it.
What is the native Memory Pointer Pattern in Strands?
The native Memory Pointer Pattern is ContextOffloader, a plugin that intercepts oversized tool results at execution time, stores each block in a storage backend, and replaces the in-context result with a preview plus a reference. Large data never floods the context window, and your tools never touch pointer logic.
Measured results
I ran the same query through three strategies. Same query, gpt-4o-mini, 2 hours of logs:
| Strategy | Tokens in context |
|---|---|
| No management | ~18,000 to 20,000 |
ContextOffloader (FileStorage) |
~490 |
context_manager="auto" |
~1,000 |
That is roughly 97% fewer tokens for the same answer. Numbers vary per run because the log data is randomized; test_native_pointer.py reproduces them.
One honest caveat: the offloader is a safety net, not the whole win. The big savings come from pairing it with a selective tool. My count_errors_by_service computes the answer server-side and returns a small summary, so the agent answers from the summary and the logs stay offloaded. Without a selective tool, an agent that needs the full dataset will just call retrieve_offloaded_content and bring it all back. The offloader guarantees you won't overflow; selective tools are what keep the token count low.
One line for most agents
For a typical multi-turn agent you don't wire up offloading and summarization separately:
agent = Agent(model=MODEL, tools=[...], context_manager="auto")
This composes a SummarizingConversationManager (summarizes old history with proactive compression) and a ContextOffloader (in-memory) with benchmark-validated defaults. Anything you pass explicitly takes precedence.
The same idea, on real Amazon S3 storage
FileStorage writes to local disk. Swap one line and large tool outputs land in a real S3 bucket, recalled by exact reference, never in the window:
from strands.vended_plugins.context_offloader import ContextOffloader, S3Storage
agent = Agent(
model=MODEL,
tools=[fetch_application_logs, count_errors_by_service],
plugins=[ContextOffloader(S3Storage(bucket=CONTEXT_BUCKET, prefix="log-artifacts/"))],
)
An 83KB log dataset was stored in S3, ~486 tokens stayed in context, and the data came back byte-for-byte by its exact reference:
๐ Tokens left in LLM context: 486
๐ฆ Objects offloaded to S3: 1
pointer in context: s3://โฆ/log-artifacts/1781569100199_1_call_โฆ_0
storage.retrieve() โ 77,050 bytes (text/plain)
verified: 200 log events recovered verbatim โ exact data, no loss
That is the second row of the table, in production form: exact-identifier recall. You don't want "the logs most similar to my query." You want those logs, exactly. That's object storage, not semantic search.
Production: two memories, on purpose
In production the split becomes architecture. An agent on Amazon Bedrock AgentCore keeps each memory where it belongs:
-
Conversation โ AgentCore Memory. Turns, preferences, and extracted facts, recalled by semantic similarity (
RetrieveMemoryRecords: embeddings,top_k, relevance score), scoped per user withactor_id. Wired in through the StrandsAgentCoreMemorySessionManager. -
Context memory โ Amazon S3. The same
ContextOffloader, withS3Storageinstead ofFileStorage. Recalled by exact reference.
Why not put the logs in AgentCore Memory too? Because AgentCore Memory recalls the semantically most similar memory, which is exactly wrong for "return this dataset verbatim by id." Conversation wants meaning; data wants an exact key. One agent, two memories, each doing what it's good at.
agent = Agent(
model=BedrockModel(region_name=REGION),
tools=[fetch_application_logs, count_errors_by_service],
session_manager=AgentCoreMemorySessionManager(memory_config, REGION), # conversation
plugins=[ContextOffloader(S3Storage(bucket=CONTEXT_BUCKET, prefix="โฆ"))], # data
)
Observability and evaluation come for free
On AgentCore, full observability is built in. You add the instrumentation library and get traces, metrics, and logs for every invocation without writing any monitoring code. The deploy already enabled it: the agent emits OpenTelemetry (OTEL) traces and metrics under the bedrock-agentcore namespace, and a CloudWatch GenAI Observability dashboard shows agent, session, and trace views (latency, error rate, token usage, tool calls) out of the box.
That is how I diagnosed the ListEvents permission error from earlier in seconds: the failing trace was right there in CloudWatch, no extra setup. See View observability data for AgentCore agents.
The same instrumentation feeds AgentCore Evaluations: automated, LLM-as-a-Judge scoring of task completion and tool-call accuracy from the same traces, so you can measure agent quality continuously instead of only at launch.
Which memory, when
-
Just the data problem, locally?
ContextOffloader(FileStorage(...)). Ordinary tools, no pointer code. -
A typical multi-turn agent?
context_manager="auto". Summarization plus offloading in one line. -
Production? AgentCore Memory for the conversation,
ContextOffloader(S3Storage(...))for the data. Keep them separate. - Either way: pair the offloader with selective tools that return summaries, not raw blobs. The offloader prevents overflow; selective tools keep the token count low.
Try it yourself
You need Python 3.11+, uv, and an OPENAI_API_KEY (or swap the model for BedrockModel). The S3 and AgentCore steps also need AWS credentials.
git clone https://github.com/aws-samples/sample-why-agents-fail
cd sample-why-agents-fail/stop-ai-agents-wasting-tokens/01-context-overflow-demo
uv venv && uv pip install -r requirements.txt
uv run python test_native_pointer.py # local, measured token comparison
AWS_PROFILE=you uv run python test_s3_offload_local.py
# Production deploy + two-memory walkthrough: setup_agentcore_s3.ipynb
Notebooks: test_native_pointer.ipynb (local) and setup_agentcore_s3.ipynb (provision + deploy + invoke on AWS).
Key takeaways
- An agent has two memories. Conversation (semantic) and data (exact reference). Most context problems are one put where the other belongs.
-
You don't build the data side by hand anymore.
ContextOffloaderis the Memory Pointer Pattern as a plugin; tools stay ordinary functions. - Measured ~97% fewer tokens in this demo, and verified an 83KB dataset offloaded to real S3 and recovered byte-for-byte by reference.
- In production, keep the two memories separate. AgentCore Memory for conversation, S3 for data. Logs recalled by meaning is the wrong design.
- The offloader is a safety net; selective tools are the win. Return summaries, not blobs.
- On AgentCore, observability and evaluation are free. Add the library, get traces, metrics, and LLM-as-a-Judge scoring with no monitoring code.
FAQ
Does ContextOffloader need AWS? No. With FileStorage or InMemoryStorage it runs fully local. You only need AWS when you choose S3Storage or deploy to AgentCore.
Can I store large files in AgentCore Memory instead of S3? You can, but you shouldn't. AgentCore Memory recalls by semantic similarity, so it returns the most similar memory, not an exact file. Large tool outputs need exact-identifier retrieval, which is what S3 (via ContextOffloader) gives you.
Do I need Docker to deploy to AgentCore? No. The starter toolkit builds the image in the cloud with AWS CodeBuild by default. Docker is only needed for a local build.
What is the difference between agent.state and ContextOffloader? agent.state is the manual Memory Pointer Pattern: you write and read pointers inside your tools. ContextOffloader is the same idea as a plugin: tools stay ordinary and the framework offloads large results for you.
Which of my two memories is costing me tokens? The data one. Conversation memory is small text; the token blowups come from large tool outputs riding along in context. That is the memory ContextOffloader fixes.
Which of your agent's two memories is leaking tokens? Tell me in the comments.
References
Research
- Solving Context Window Overflow in AI Agents โ IBM Research, 2025
- Towards Effective GenAI Multi-Agent Collaboration โ Amazon, 2024 (payload referencing between agents)
Implementation
- Strands ยท Context Management
- Strands ยท Conversation Management
- Strands ยท Agent State
- Amazon Bedrock AgentCore Memory โ Get started
- AgentCore Runtime โ IAM permissions
- AgentCore โ Observability in CloudWatch
- AgentCore โ Evaluations
- Code: 01-context-overflow-demo
Gracias!
๐ป๐ช๐จ๐ฑ Dev.to Linkedin GitHub Twitter Instagram Youtube




';" />
';" />
';" />