Research paper

EGOSTREAM

A Diagnostic Benchmark for Streaming Episodic Memory in Egocentric Vision

Rosario Forte, Giuseppe Lando, Antonino Furnari

University of Catania

EGOSTREAM asks what models remember, for how long, and when an answer stops being valid as the observed world changes.

Egocentric video Evidence grounded Streaming only

GitHub (coming soon)

Benchmark

2,250 validated memory questions

8,528 recall-conditioned evaluations

7 episodic memory dimensions

45.3h maximum Answer Validity Window

Ego4D EgoLife EgoTempo Multi-Hop EgoQA HD-EPIC

Memory dimensions

The benchmark organizes questions into detail, spatial, temporal, event, social, causal, and prospective memory.

Answer Validity Window

AVW marks the span during which a past answer remains factually correct, separating model forgetting from natural scene changes.

Recall regimes

Questions are evaluated across recall regimes from 0 seconds through short, mid, long, and ultra-long horizons.

Recall regimes

Instant0s

Short0-5s

Short-mid5-30s

Mid30s-3m

Mid-long3m-30m

Long30m-8h

Ultra-long8h+

Protocol

Questions are curated from Ego4D Episodic Memory VQA, EgoLife, EgoTempo, Multi-Hop EgoQA, and HD-EPIC.

Models process video sequentially and answer using only visual content observed up to the recall time.

Performance is measured as multiple-choice accuracy, with breakdowns by memory category and recall regime.

The unified framework compares sliding windows, attention sinks, KV-cache pruning, merging, and offloading.

Current gap

Aggregate scores can hide different memory profiles across semantic categories and recall horizons.

Token pruning better preserves detail and temporal structure, while quantized offloading helps ultra-long recall.

Top-performing mechanisms remain around a 45% ceiling and operate below real-time.