MoNaVLA: Mobile Navigation
Vision-Language-Action

Decomposition-based VLA for mobile robot basket navigation.
96.6% closed-loop success (PG2 grounding · SigLIP action pipeline)
vs E2E Kosmos-2 0% — ×9.4 pipeline gap.

Research Story Results & Evidence Grounding Hub Masking Ablation Robot Tests GitHub
96.6%
Closed-Loop (SOTA)
Exp66 · PG2 grounding + SigLIP
0%
E2E VLA Baseline
Exp11 · Kosmos-2 LoRA (text attn 0%)
0.080 m
Best FPE (↓)
LSTM w=16 · MLP w=4: 0.094 m
×9.4
Pipeline Gap
vs simple MLP 10.3%

Abstract

MoNaVLA investigates Vision-Language-Action (VLA) models for mobile robot basket navigation. End-to-end fine-tuning of Kosmos-2 with LoRA collapses to 0% closed-loop success due to a structural text-attention failure in the Google-robot post-trained backbone. We instead adopt a decomposition pipeline: PaliGemma2's frozen SigLIP vision encoder extracts frame features (L2-normalized, 256-dim), concatenated with 8-frame bbox history (32-dim), and fed to a 3-layer ActionMLP predicting 8 discrete actions. PaliGemma2 also serves as the grounder: detect gray basket → 4-filter validated bbox → temporal N=3 goal trigger. This pipeline achieves 96.6% closed-loop success (FPE 0.094 m) — a ×9.4 improvement over a simple MLP baseline (10.3%). Ablation confirms that the L2-norm + bbox augmentation pipeline is the sole performance driver; grounding source (HSV heuristic, base PG2, fine-tuned LoRA) is irrelevant once the pipeline is correct. Zero-shot linear probe (96.6%) and basket masking ablation (9/9 action flip) provide causal evidence that the frozen SigLIP encoder independently localizes the basket.

Key Findings

Main Results

Method Architecture CL ↑ FPE ↓ Note
E2E VLA (Exp11) Kosmos-2 + LoRA 0.0% 1.454 m Text attn 0%, structural failure
Decomp v1 (Exp14) CLIP + BBox MLP 66.7% 0.555 m First decomposition baseline
Simple MLP (Exp65b) SigLIP + plain MLP 10.3% No L2-norm, no aug → pipeline ablation
Ours (Exp66) ★ PG2 grounding + SigLIP + L2-aug 96.6% 0.094 m SOTA · MLP w=4
Ours (Exp66 LSTM) PG2 grounding + SigLIP + L2-aug 96.6% 0.080 m Best FPE · LSTM w=16

CL = Closed-Loop success (FPE < 0.5m AND TLD ∈ [0.7, 1.5]). Evaluated on V5 dataset (30 ep val, 9 path types).

Pipeline Ablation (Exp66 계열)

HeadWindowCL ↑FPE ↓
Linear469.0%
FCHead493.1%
MLP ★496.6%0.094 m
LSTM1696.6%0.080 m

Architecture

MoNaVLA Architecture Diagram

PaliGemma2 SigLIP (frozen) → 256-dim L2-norm → concat BBox History (32-dim) → ActionMLP → 8 discrete actions.
Grounder: PaliGemma2 detect gray basket → 4-filter bbox validation → temporal N=3 goal trigger.
Proximity STOP: area ≥ 0.25 AND |cx − 0.5| ≤ 0.35 for 3 consecutive frames (GOAL_CONSEC_FRAMES=3).

Key Documents