World Models for Embodied AI

World model（世界模型）在 embodied AI 中是 learned internal simulator：它把 observations 和 actions 压缩成 predictive latent state，并 rollout future states 来支持 perception、prediction、planning、control 和 counterfactual reasoning。A Comprehensive Survey on World Models for Embodied AI 明确把 scope 限定在能产生 actionable predictions 的 models，而不是单纯 static scene descriptors 或不受 action 控制的 visual generators。

数学结构

论文用 POMDP formalization 描述 embodied interaction。变量含义如下： $o_{t}$ 是第 $t$ 步 observation， $a_{t}$ 是 action， $s_{t}$ 是不可直接观测的 true state， $z_{t}$ 是 learned latent state， $θ$ 是 generative model parameters， $ϕ$ 是 inference model parameters。

核心 world model 由三部分组成：

Dynamics Prior: Filtered Posterior: Reconstruction: p_{θ} (z_{t} ∣ z_{t - 1}, a_{t - 1}) q_{ϕ} (z_{t} ∣ z_{t - 1}, a_{t - 1}, o_{t}) p_{θ} (o_{t} ∣ z_{t})

Joint distribution 写成 action-conditioned latent transition 与 observation decoder 的乘积：

p_{θ} (o_{1 : T}, z_{0 : T} ∣ a_{0 : T - 1}) = p_{θ} (z_{0}) t = 1 \prod T p_{θ} (z_{t} ∣ z_{t - 1}, a_{t - 1}) p_{θ} (o_{t} ∣ z_{t}) .

真实 posterior $p_{θ} (z_{0 : T} ∣ o_{1 : T}, a_{0 : T - 1})$ 不可直接求，论文使用 time-factorized variational posterior：

q_{ϕ} (z_{0 : T} ∣ o_{1 : T}, a_{0 : T - 1}) = q_{ϕ} (z_{0} ∣ o_{1}) t = 1 \prod T q_{ϕ} (z_{t} ∣ z_{t - 1}, a_{t - 1}, o_{t}) .

Training objective 是 ELBO（evidence lower bound）：

lo g p_{θ} (o_{1 : T} ∣ a_{0 : T - 1}) \geq E_{q_{ϕ}} [lo g \frac{p _{θ} ( o _{1 : T} , z _{0 : T} ∣ a _{0 : T - 1} )}{q _{ϕ} ( z _{0 : T} ∣ o _{1 : T} , a _{0 : T - 1} )}] = L (θ, ϕ) .

在 Markov factorization 下，ELBO 可理解为 reconstruction objective 加上 KL regularization：

L (θ, ϕ) = t = 1 \sum T E_{q_{ϕ} (z_{t})} [lo g p_{θ} (o_{t} ∣ z_{t})] - D_{KL} (q_{ϕ} (z_{0 : T} ∣ o_{1 : T}, a_{0 : T - 1}) ∥ p_{θ} (z_{0 : T} ∣ a_{0 : T - 1})) .

直觉

Filtered posterior $q_{ϕ}$ 是 recognition side：它看见当前 observation $o_{t}$ ，把 history 压进 latent state $z_{t}$ 。Dynamics prior $p_{θ}$ 是 imagination side：它在没有未来 observation 的情况下，根据 $z_{t - 1}$ 和 action $a_{t - 1}$ 推进 latent future。Reconstruction $p_{θ} (o_{t} ∣ z_{t})$ 让 latent state 不只是任意 embedding，而是保留可预测 observation 的信息。

ELBO 的两个 terms 对应一个 tension：reconstruction term 希望 $z_{t}$ 对 observations 足够 informative；KL term 希望 filtered posterior 不要偏离 action-conditioned dynamics prior 太远。若 KL 太弱，model 可能只学到 posterior encoding 而不会 rollout；若 reconstruction 太弱，latent dynamics 可能可 rollout 但失去可解释的 state fidelity。

flowchart LR
  A["history<br/>o_1:t, a_0:t-1"] --> B["filtered posterior<br/>q_phi(z_t given z_{t-1}, a_{t-1}, o_t)"]
  B --> C["latent state z_t<br/>predictive memory"]
  C --> D["dynamics prior<br/>p_theta(z_{t+1} given z_t, a_t)"]
  D --> E["imagined future<br/>z_{t+1:T}"]
  C --> F["reconstruction<br/>p_theta(o_t given z_t)"]
  E --> G["planning / policy optimization / MPC / counterfactuals"]

作为 Visual Subgoal Generator

π0.7 给了一个更窄但很实用的 world-model role：world model 不直接输出 robot action，也不一定 rollout long-horizon trajectory，而是把 current observation $o_{t}$ 、semantic subtask $\hat{ℓ}_{t}$ 和 metadata $m$ 转成 near-future visual goal：

g^{⋆} \sim p_{ψ} (g^{⋆} ∣ o_{t}, \hat{ℓ}_{t}, m) .

这个 $g^{⋆}$ 是 multi-view subgoal images，随后进入 VLA 的 context $C_{t}$ ，condition action chunk prediction。直觉上，它把 language 中难以说明的 spatial details 转成 visual target，例如 gripper 应该如何接近 handle、cloth 应该折到什么形状、或 object 应该出现在什么 view 中。

这说明 world model 可以作为 decision-coupled intermediate representation：它未必自己完成 planning，但会改变 policy 的 action distribution。因此 evaluation 也不能只看 generated image fidelity，而要看 subgoal images 是否提升 closed-loop instruction following、cross-embodiment transfer 或 compositional generalization。

作为 Latent Dynamics Pretraining

LDA-1B 给出另一种 decision-coupled world-model role：world model 不生成 RGB subgoal image，也不单独做 MPC，而是在 DINO latent space 中 cotrain policy、forward dynamics、inverse dynamics 和 visual forecasting。Future observation target 被表示为 $z_{t + 1 : t + k} = f_{DINO} (o_{t + 1 : t + k})$ ，然后与 action chunk $a_{t + 1 : t + k}$ 一起进入 diffusion-style denoising objective。

这个设计把 world model 的价值放在 representation learning 和 policy pretraining 上。High-quality demonstrations 可以训练 action policy；low-quality trajectories 仍可训练 action-conditioned dynamics；actionless egocentric videos 则训练 visual forecasting。相比 pixel-space UWM，LDA-1B source 的 central claim 是 structured DINO latent 能减少 appearance modeling，扩大 mixed-quality embodied data 的可用范围。

Failure Modes

Long-horizon error accumulation：Sequential Simulation and Inference 一步步 rollout，早期 state error 会进入后续 inputs，导致 temporal drift。
Weak physical consistency：FID、FVD、LPIPS 等 pixel-level metrics 可能给出高分，但不检查 dynamics、causality 或 physical constraints。
Real-time latency：Transformer 和 Diffusion backbones 表现强，但 inference cost 可能不满足 robot control loop 或 autonomous driving planning 的时限。
Dataset fragmentation：manipulation、navigation、driving 和 video pretraining 使用不同 modality、scale 与 protocol，限制 cross-domain generalization。
Spatial bottleneck：Global Latent Vector 高效但丢失细节；Token Feature Sequence 表达力强但 sequence length 变重；Spatial Latent Grid 依赖 geometry priors；NeRF/3DGS-style Decomposed Rendering Representation 保真但 dynamic scene scalability 较弱。
Evaluation heterogeneity：benchmark comparisons 常被 input modality、auxiliary supervision、resolution、episode budget 和 task subset 差异混淆。
Frozen latent bottleneck：LDA-1B 说明 DINO latent 有助于 scaling，但也承认 fixed DINO visual features 是 limitation；如果 downstream control 需要的 force、tactile 或 material state 不在 latent 中，world model 可能预测 plausible future features 却缺少控制变量。

实践含义

对 MPC 和 model-based RL，world model 的价值在于可 rollout 的 transition model，而不是漂亮的 reconstruction。评估时需要检查 imagined future 是否能改变 action choice，并且在 closed-loop setting 下仍然稳定。

对 robotics sim-to-real，learned world model 可能缓解 hand-designed simulator 的 mismatch，也可能把 dataset bias 或 pixel-level artifacts 变成新的 simulation reality gap。因此需要把 real-robot validation、physical consistency 和 causal intervention metrics 放进 evaluation loop。

对 foundation-model style embodied agents，WorldModelTaxonomy 提示不要把所有 video predictors 都叫 world models。只有当 representation、temporal rollout 和 action coupling 能支持 downstream decisions 时，它才是 embodied AI 意义上的 world model。

Niuverse LLM Wiki

探索

World Models for Embodied AI

World Models for Embodied AI

数学结构

直觉

作为 Visual Subgoal Generator

作为 Latent Dynamics Pretraining

Failure Modes

实践含义

关系图谱

目录

反向链接