The Light Bulb
Newsletter by Paper Lantern
👋 Hey Thinkers & Tinkerers,
Ever feel like building AI agents is more art than science? This week, research brings the receipts. We've got four incredible papers that offer concrete blueprints for solving the biggest agent challenges: control, scaling, memory, and cost. These aren't just ideas—they're tools you can use today.
📋 In This Issue:
Agent Control: Steer LLM Beliefs with 'Belief Box' Prompts
🔍 Paper Lantern analyzed 108 papers to find this trend
TL;DR
Add a 'Belief Box' section to your agent's system prompt to explicitly define its beliefs and make its behavior more predictable.
Current State:

Getting LLM agents to behave predictably and adhere to a consistent persona is a major challenge in multi-agent systems.1 2
Current approaches often focus on improving the agent's information diet, either through massive context windows or more precise retrieval-augmented generation (RAG). While useful, simply providing better data doesn't guarantee an agent will adopt a specific viewpoint or personality.3 4 5 6 7
This Paper:
This paper introduces a startlingly simple, training-free technique to directly control an agent's convictions. Instead of hoping the model infers its beliefs, you explicitly state them in the system prompt inside a 'Belief Box'. This is a dedicated text block where you list propositions and assign a confidence score. For example: Belief: Social media is beneficial for society (Confidence: 2/5). Alongside this, you provide an 'open-mindedness' score, such as Open-mindedness: 5/5, which instructs the agent on how easily it should be persuaded by new arguments.
Results:
To test this, the authors had two GPT-4 agents debate controversial topics. A 'subject' agent was equipped with a Belief Box and a specific open-mindedness level. The results were clear: the technique works. Agents instructed to be stubborn (open-mindedness: 1/5) changed their initial belief in only 12.5% of debates. In contrast, agents told to be open-minded (5/5) changed their minds 87.5% of the time. A baseline agent with no instructions was persuaded in 50% of cases, showing the prompts provide strong, steerable control over agent behavior.
This means you can create agents with stable, predictable personas without any fine-tuning. It's a powerful and cheap method for building more reliable multi-agent simulations, game characters, or specialized chatbots.
🔧 Use This Today
In your agent's system prompt, add a section like --- YOUR BELIEF BOX ---. Inside, list key beliefs as plain text statements, each with a confidence score (e.g., Confidence: 4/5). Then, add a line defining its persuadability, like You are moderately open-minded. This structure gives you a direct control knob for agent personality.
⚖️ Limitation:
This technique's effectiveness is proven in debates but not yet validated for complex, real-world agentic tasks.
Citations:
¹ Yifu Qiu et al. (2025) - Eliciting In-context Retrieval and Reasoning for Long-context Large Language Models
² Yuchen Yan et al. (2025) - InftyThink: Breaking the Length Limits of Long-Context Reasoning in Large Language Models
³ Zhuowan Li et al. (2024) - Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach
⁴ Chejian Xu et al. (2025) - From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models
⁵ Tan Yu et al. (2024) - In Defense of RAG in the Era of Long-Context Language Models
Agent Scaling Science: A Framework to Stop Over-Engineering Your AI Agents
🔍 Paper Lantern analyzed 212 papers to find this trend
TL;DR
Stop assuming more agents are better; use this paper's data to justify a simple single agent unless your task is proven to be highly parallelizable.
Current State:

The belief that complex problems require complex multi-agent systems is driving teams to build elaborate solutions without clear evidence they work.1
The field has largely split between trying to scale single agents with massive context windows or building multi-agent collectives, leaving builders to guess which approach is right.2 3 4 5 6
This Paper:
This paper introduces a scientific framework to end the guesswork. Researchers systematically evaluated five canonical agent architectures—from a single agent to hierarchical and centralized multi-agent systems—across benchmarks like WebShop and AgentBench. They measured not just task success but also the hidden costs of coordination, such as communication overhead and error amplification when one agent's mistake cascades through the system. Using data from 180 experimental configurations, they developed a predictive model that, given a task's properties, recommends the optimal agent architecture.
Results:
The results provide clear, data-backed rules. On highly parallelizable tasks like those in WebShop, a 'Centralized' multi-agent system with a manager agent boosted success rates by a staggering 80.9% compared to a single agent. However, for sequential tasks, adding more agents was actively harmful; a 'Hierarchical' system performed 17.8% worse than a simple single agent due to coordination overhead. The model predicted the best architecture with 87% accuracy, demonstrating that the trade-offs are quantifiable and predictable.
This provides a data-driven defense against hype. You can now justify sticking with a simple, cheap single agent for many tasks, and know precisely when the complexity of a multi-agent system is worth the cost.
🔧 Use This Today
Before you design an agent system, follow this paper's method:
1. Analyze the task: Is it highly sequential or can it be broken into parallel sub-tasks?
2. Default to a single agent: If the task is sequential or involves many interdependent tools, start and likely stay with a single agent.
3. Justify complexity: Only for highly parallelizable tasks where a single agent struggles, test a 'Centralized' multi-agent architecture. Measure the performance gain against the increase in token cost and latency to prove its worth.
⚖️ Limitation:
The findings are based on five standard architectures; your custom system's coordination costs will vary and require their own measurement.
Citations:
¹ Sibo Xiao et al. (2025) - Long Context Scaling: Divide and Conquer via Multi-Agent Question-driven Collaboration
² Zhuowan Li et al. (2024) - Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach
³ Xinze Li et al. (2024) - Long Context vs. RAG for LLMs: An Evaluation and Revisits
⁴ Sibo Xiao et al. (2025) - Long Context Scaling: Divide and Conquer via Multi-Agent Question-driven Collaboration
⁵ Tan Yu et al. (2024) - In Defense of RAG in the Era of Long-Context Language Models
Agent Memory: Stop Storing History, Start Distilling Wisdom
🔍 Paper Lantern analyzed 162 papers to find this trend
TL;DR
Build a post-task process where an LLM distills key lessons from agent actions, then feed those lessons back into future prompts to make your agent smarter over time.
Current State:

LLM agents often fail to learn from experience, repeating the same mistakes because their memory is a simple, static log.1 2
Current solutions involve either stuffing raw history into massive context windows3 or using RAG to retrieve verbatim past interactions.4 5 Both approaches treat memory as a passive archive, not an active source of wisdom, leading to inefficiency and repeated failures.3 4 5
This Paper:
Zouying Cao et al.
2025-12-11
This paper introduces a dynamic procedural memory framework that transforms agents from amnesiacs into lifelong learners. Instead of just logging events, it implements a full memory lifecycle. First, after a task, a 'distiller' LLM analyzes the trajectory to extract a core lesson, like 'Failure: process_refund tool was called without an order_id. Insight: Always ask for order_id first.' Second, this distilled insight is indexed in a vector database for context-aware retrieval. Finally, a 'refiner' process periodically prunes memories that are rarely used or have been superseded, keeping the knowledge base potent and relevant.
Results:
On the ToolBench benchmark, a 7B parameter model equipped with this framework achieved 68.5% task success. This significantly outperformed a standard 70B model using a basic memory log, which scored 62.1%. The baseline 7B model with simple RAG memory only reached 45.3%, demonstrating a massive 23.2 percentage point lift from the dynamic memory system alone.
This means you can build agents that measurably improve with every interaction. It opens a path to using smaller, cheaper models that gain 'experience' to outperform larger, more expensive ones, directly impacting your bottom line.
🔧 Use This Today
Implement a post-task memory subsystem. After an agent runs, feed its trajectory (actions, observations, outcome) to an LLM with a prompt like: "Extract the key lesson or reason for failure from this interaction." Store the resulting insight in a vector DB. Before the next task, perform a similarity search on the user's query against this memory and prepend the top 1-2 insights into the agent's system prompt.
⚖️ Limitation:
The quality of the distilled 'wisdom' is critical and depends entirely on your distillation prompt and model, creating a new potential point of failure.
Citations:
¹ Weijie Liu et al. (2024) - MemLong: Memory-Augmented Retrieval for Long Text Modeling
² Hongjin Qian et al. (2024) - MemoRAG: Moving towards Next-Gen RAG Via Memory-Inspired Knowledge Discovery
³ Xinze Li et al. (2024) - Long Context vs. RAG for LLMs: An Evaluation and Revisits
⁴ Yixiong Fang et al. (2025) - AttentionRAG: Attention-Guided Context Pruning in Retrieval-Augmented Generation
⁵ Zhichao Xu et al. (2025) - RECON: Reasoning with Condensation for Efficient Retrieval-Augmented Generation
Agent Reliability: A Dual-Strategy Framework Cuts Token Use 3-9x on Open-Source LLMs
🔍 Paper Lantern analyzed 188 papers to find this trend
TL;DR
Implement a multi-step prompting loop where your agent creates a high-level plan, proposes a next step, scores that step against the plan, and revises the plan if the score is low.
Current State:

Building reliable agents for complex tasks often forces a choice between expensive, massive-context models and less capable open-source alternatives.1 2
The field has largely pursued two paths: pushing context windows to millions of tokens or optimizing retrieval-augmented generation (RAG). Both approaches focus on providing more external data, but can still lead to high token counts and planning failures.3 4 5 6 7
This Paper:
Wentao Zhang et al.
2025-12-09
This paper introduces a co-adaptive framework that improves reasoning instead of just adding context. At each step, the agent uses two strategies from a single frozen LLM: (1) A 'Holistic Strategy' maintains the high-level, multi-step plan. (2) A 'Local Strategy' proposes the immediate next action based on current observations. A 'Strategy Fitness Score' then evaluates the local action's alignment with the holistic plan. If the score is low, the agent reflects, revising the high-level plan before acting. A final module integrates both views to execute the best step.
Results:
On the ALFWorld benchmark, the framework using Llama-3.1-70B-Instruct achieved a 47.5% success rate, outperforming GPT-4o based agents while using 3x fewer tokens. On the more complex Mind2Web benchmark, it surpassed prior state-of-the-art methods, achieving a 35.2% success rate. Crucially, this performance was achieved with a 9x reduction in token consumption compared to baseline reflection methods, demonstrating massive efficiency gains.
This architecture lets you build powerful agents on smaller, locally-hosted models. You can achieve state-of-the-art performance without relying on expensive proprietary APIs, saving costs and ensuring data privacy. This approach, alongside multi-agent frameworks, offers a new path for building smarter systems.8
🔧 Use This Today
Implement this as a control loop in your agent's orchestration code. At each turn, maintain a 'holisticplan' string. First, call your LLM to generate a 'localaction' based on the current state. Second, make another LLM call with a specific prompt to generate a 'fitnessscore' (1-10). If the score is below a threshold (e.g., 5), trigger a third call to update the 'holisticplan' based on the failure. Finally, a fourth call synthesizes the plan and action.
⚖️ Limitation:
The method has only been validated on academic benchmarks, not on complex, dynamic real-world websites or applications.
Citations:
¹ Xinze Li et al. (2024) - Long Context vs. RAG for LLMs: An Evaluation and Revisits
² Zhuowan Li et al. (2024) - Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach
³ Chejian Xu et al. (2025) - From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models
⁴ Zhenrui Yue et al. (2024) - Inference Scaling for Long-Context Retrieval Augmented Generation
⁵ Zhichao Xu et al. (2025) - RECON: Reasoning with Condensation for Efficient Retrieval-Augmented Generation
That's it for this week.
Happy Researching!
Paper Lantern Team