PermaVid

Consistent Video Generation Across Edits via Disentangled Context Memory

1Shanghai Jiao Tong University  ·  2Stanford University  ·  3S-Lab, NTU  ·  4CUHK  ·  5Shanghai Innovation Institute

Abstract: Consistent video generation under editing operations requires persistence: when edits modify scene appearance or layout, subsequent generations should remain coherent across time and viewpoints. However, existing memory designs struggle to maintain long-term consistency after such modifications, as stored contexts may become outdated or invalid. To address this, we propose PermaVid, a novel framework built upon a multi-modal context memory that disentangles spatial context into semantic appearance and geometric structure, together with an edit-aware memory update and retrieval strategy that keeps memory evolution aligned with subsequent observations. Specifically, we develop two complementary memory banks: an RGB context memory that captures appearance-aware observations while implicitly encoding geometry, and a depth context memory that preserves geometry-only structure disentangled from semantics. Building on this design, we introduce a memory-guided video generation model that performs multi-modal feature fusion under reference conditions drawn from mixed-modality memory contexts. Experiments demonstrate that our method maintains strong long-term semantic and structural consistency after edits, significantly outperforming state-of-the-art methods.

What is PermaVid?

PermaVid Demo
PermaVid maintains a disentangled multi-modal context memory with an RGB bank for semantic appearance and a depth bank for geometric structure. Given target camera poses and editing operations, it updates and retrieves memory in an edit-aware manner, then fuses mixed-modality references to guide consistent video generation across time, viewpoints, and edits.

Consistency Across Time, Viewpoints, Edits

PermaVid Demo

Global Edits Occur

(e.g., style transformation)
  • Preserve stable geometry
  • Propagate edited semantic appearance

Local Edits Occur

(e.g., object-level editing)
  • Recall edited local content
  • Preserve unchanged geometry structure

Long-term Consistency after Global Edits

Long-term Consistency after Local Edits

Citation

@article{yang2026permavid,
  title={PermaVid: Consistent Video Generation Across Edits via Disentangled Context Memory},
  author={Yang, Shuai and Gao, Bingjie and Liu, Ziwei and Wang, Jiaqi and Lin, Dahua and Wu, Tong},
  journal={arXiv preprint arXiv:26xx.xxxxx},
  year={2026}
}