Preprint

LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues

UCLA
5memory abilities
451questions
up to 500trajectories
up to 115Mtokens

Overview

LongMemEval-V2 tests whether memory systems can turn long web-agent histories into reusable environment experience.

Long-term memory is crucial for agents in specialized web environments, where success depends on recalling interface affordances, state dynamics, workflows, and recurring failure modes. LongMemEval-V2 evaluates whether memory systems can help agents acquire the experience needed to become knowledgeable colleagues in customized environments.

The benchmark pairs manually curated questions with long histories of multimodal web-agent trajectories. Memory systems consume the trajectory history and return compact evidence for downstream question answering, making accuracy and query latency both central outcomes.

Examples of LongMemEval-V2 questions across memory abilities
LongMemEval-V2 questions cover static state recall, dynamic state tracking, workflow knowledge, environment gotchas, and premise awareness.

Benchmark

The benchmark is organized around five abilities that distinguish an experienced colleague from a stateless web agent.

What defines an experienced colleague?

Static state recallremembers important landmarks, page layouts, module affordances, and subtle state differences.
Dynamic state trackingunderstands how states and actions change the environment over time.
Workflow knowledgeknows the steps needed to complete recurring tasks in customized environments.
Environment gotchasrecognizes recurring local failure modes and avoids environment-specific traps.
Premise awarenessdetects assumptions that are valid elsewhere but wrong in the current deployment.

Benchmark construction

1Collectweb-agent trajectories
2Annotatememory questions
3Labelanswer evidence
4Assemblesparse haystacks
Question distribution statistics
Questions span domains, types, and answer formats.
Haystack trajectory and token statistics
History haystacks are large, while answer-bearing trajectories remain sparse.
Group Benchmark Domain Sessions Tokens Context MM Questions Question MM Static Dynamic Workflow Gotchas Premise
General long context LongBench V2 Mixed N/A 260K No 503 No Yes No No No No
General long context MemoryAgentBench Mixed N/A 285K No 2,071 No Yes No No No No
General long context CL-Bench Mixed N/A 10K No 1,899 No Yes No Yes No No
Conversational long context LoCoMo User-user chat 28 ~16K Yes 7,512 No Yes Yes No No Yes
Conversational long context LongMemEval-V1 User-assistant chat 48-475 115K-1.5M No 500 No Yes Yes No No Yes
Conversational long context PersonaMem User-assistant chat 5-60 26K-951K No 5,990 No Yes Yes No No No
Conversational long context PersonaMem-v2 User-assistant chat 10-20 33K-124K Yes 5,000 No Yes Yes No No No
Conversational long context BEAM User-assistant chat 4.5-100 124K-10M No 2,000 No Yes Yes No No Yes
Agentic long context MemoryArena Agent mixed 7 40K+ No 766 No Yes No Yes Yes No
Agentic long context AgentLongBench Game agent 1 31K-4M No 6,400 No Yes Yes No No No
Agentic long context EMemBench Game agent 1 2K-infinity Yes 1,280+ No Yes No Yes Yes Yes
Agentic long context FileGramBench File-system agent 12 11K Yes 4,333 No Yes No Yes No No
Agentic long context AMA-Bench Agent mixed 1 57K No 2,496 No Yes Yes Yes Yes No
Agentic long context LongMemEval-V2 Web agent 100-498 25M-115M Yes 451 Yes Yes Yes Yes Yes Yes

Evaluation

Memory systems are evaluated by the evidence they retrieve and the latency they add before the reader answers.
1Insertstream trajectories
2Memorystore experience
3Querygather evidence
4Readeranswer from context

LME-V2 uses a context gathering formulation. A memory system sequentially inserts trajectories, then returns a bounded multimodal memory context for a question. A fixed reader answers from that context.

Insert(h) Query(q) → context Reader(q, context) → answer

The evaluation reports answer accuracy and query latency, so memory modules are measured by both evidence quality and operational cost.

Pilot Studies

Pilot studies show that the questions depend on environment-specific evidence rather than parametric model knowledge.
No-context pilot study accuracy by model
Frontier LLMs perform poorly without trajectory evidence; the best no-context result reaches 14.1% overall accuracy.
14.1%best no-context accuracy
65.3%GPT-5.4-mini with oracle trajectories
86.3%oracle slices plus notes
89.7%Codex with oracle trajectory files

The pilot studies show that LME-V2 questions require environment-specific experience, and that reading long multimodal trajectories is itself difficult even when answer-bearing trajectories are known. This motivates the AgentRunbook designs evaluated below.

AgentRunbook

AgentRunbook provides two initial memory designs for turning trajectory histories into query-time evidence.
AgentRunbook memory module overview
AgentRunbook explores two memory designs: a structured retrieval pipeline and a file-based coding-agent memory controller.

AgentRunbook-R

RAG memory with three knowledge pools for targeted recall.

  • Raw state slices for fine-grained UI evidence.
  • State transition events for environment dynamics.
  • Procedure and hint notes for workflows and gotchas.

AgentRunbook-C

File-based memory that invokes a coding agent to gather evidence.

  • Trajectory files stored directly on disk.
  • Workflow documents and memory manifests.
  • Helper scripts for state and trajectory inspection.

Results

AgentRunbook-C currently gives the strongest average accuracy while improving the accuracy-latency frontier.
Method Family Small Overall Small Latency Medium Overall Medium Latency
No retrieval Reader only 1.3% 0s 1.3% 0s
RAG: query to slice RAG 42.8% 0.1s 38.1% 0.1s
RAG: query to slice + notes RAG 51.0% 0.2s 45.9% 0.3s
AgentRunbook-R RAG 58.6% 26.9s 57.0% 25.8s
Codex Coding agent 69.9% 177.2s 68.7% 185.8s
AgentRunbook-C Coding agent 74.9% 108.3s 70.1% 139.9s
Accuracy and latency tradeoff for memory methods
AgentRunbook improves the accuracy-latency frontier for long-term agent memory.

Data Viewer

The viewer samples real benchmark questions and trajectory excerpts from the released dataset.

Questions

Trajectory Excerpts

Leaderboard

Leaderboard entries will summarize submitted memory systems across accuracy, latency, and benchmark tier.
TODO

Leaderboard shell

The public leaderboard will use a sortable table with filters for method family, model, memory setting, tier, accuracy, and query latency. Entries and submission instructions will be added after the leaderboard schema is finalized.

Rank System Memory Type Small Acc. Medium Acc. Avg. Latency Status
Leaderboard entries coming soon.

Citation

Use the citation below as a placeholder until the final preprint metadata is available.
@article{wu2026longmemevalv2,
  title = {LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues},
  author = {Wu, Di and Ji, Zixiang and Kawatkar, Asmi and Kwan, Bryan and Gu, Jia-Chen and Peng, Nanyun and Chang, Kai-Wei},
  year = {2026},
  note = {Preprint}
}