MoNaVLA: Mobile Navigation VLA

Abstract

MoNaVLA investigates Vision-Language-Action (VLA) models for mobile robot basket navigation. End-to-end fine-tuning of Kosmos-2 with LoRA collapses to 0% closed-loop success due to a structural text-attention failure in the Google-robot post-trained backbone. We instead adopt a decomposition pipeline: PaliGemma2's frozen SigLIP vision encoder extracts frame features (L2-normalized, 256-dim), concatenated with 8-frame bbox history (32-dim), and fed to a 3-layer ActionMLP predicting 8 discrete actions. PaliGemma2 also serves as the grounder: detect gray basket → 4-filter validated bbox → temporal N=3 goal trigger. This pipeline achieves 96.6% closed-loop success (FPE 0.094 m) — a ×9.4 improvement over a simple MLP baseline (10.3%). Ablation confirms that the L2-norm + bbox augmentation pipeline is the sole performance driver; grounding source (HSV heuristic, base PG2, fine-tuned LoRA) is irrelevant once the pipeline is correct. Zero-shot linear probe (96.6%) and basket masking ablation (9/9 action flip) provide causal evidence that the frozen SigLIP encoder independently localizes the basket.

Key Findings

✅

Decomposition 96.6% CL — E2E 0% 대비 압도적 우위

Exp54/66/67: PG2 grounding + SigLIP ActionMLP + L2-norm + aug → 96.6% vs Exp11 E2E 0%. 벤치마크: V5 150 ep, 30 ep val.
✅

파이프라인(L2-norm + bbox 증강)이 유일한 성능 결정 요소

Simple MLP(Exp65b) 10.3% vs L2+aug(Exp66) 96.6% — 동일 cx 소스, 파이프라인만 바꿔 ×9.4배 향상. grounding 소스(cx 출처)는 무관.
✅

Grounding source irrelevance 확정 — LoRA 개선이 action에 기여 없음

HSV(Exp54) = base PG2(Exp66) = Exp59 LoRA(Exp67) = 96.6%. grounding LoRA 연구의 action 기여 = 0 (명확한 음성 결과).
⚠️

Text attention 구조적 사망 — Google-robot backbone 기인

Exp11/15 모두 text attention 0.000%. LoRA·head-only 모두 복구 불가. Pure HF Kosmos-2는 정상(22.6%). E2E 실패의 근본 원인.
🔬

Basket localization 이중 증명 — Zero-shot Probe 96.6% + Masking 9/9 flip

Frozen SigLIP (PG2)이 학습 없이 바스켓 위치를 96.6% 분류. Basket masking → 9/9 프레임 행동 반전 (Exp66, base PG2 grounding). SigLIP 이미지 경로가 basket 픽셀을 독립적으로 인식.
🤖

실로봇 PG2 그라운딩 — 0% → 51.4% (S1→S6, 6세션 512프레임)

카메라 decode 버그 수정 → 4종 bbox 필터 → PNG 인코딩 → temporal N=3 패치 순서로 적용. S6에서 0 노이즈 bbox, 1264ms 평균 latency. 실로봇 closed-loop CL 테스트 다음 단계.

Main Results

Method	Architecture	CL ↑	FPE ↓	Note
E2E VLA (Exp11)	Kosmos-2 + LoRA	0.0%	1.454 m	Text attn 0%, structural failure
Decomp v1 (Exp14)	CLIP + BBox MLP	66.7%	0.555 m	First decomposition baseline
Simple MLP (Exp65b)	SigLIP + plain MLP	10.3%	—	No L2-norm, no aug → pipeline ablation
Ours (Exp66) ★	PG2 grounding + SigLIP + L2-aug	96.6%	0.094 m	SOTA · MLP w=4
Ours (Exp66 LSTM)	PG2 grounding + SigLIP + L2-aug	96.6%	0.080 m	Best FPE · LSTM w=16

CL = Closed-Loop success (FPE < 0.5m AND TLD ∈ [0.7, 1.5]). Evaluated on V5 dataset (30 ep val, 9 path types).

Pipeline Ablation (Exp66 계열)

Head	Window	CL ↑	FPE ↓
Linear	4	69.0%	—
FCHead	4	93.1%	—
MLP ★	4	96.6%	0.094 m
LSTM	16	96.6%	0.080 m

Architecture

PaliGemma2 SigLIP (frozen) → 256-dim L2-norm → concat BBox History (32-dim) → ActionMLP → 8 discrete actions.
Grounder: PaliGemma2 detect gray basket → 4-filter bbox validation → temporal N=3 goal trigger.
Proximity STOP: area ≥ 0.25 AND |cx − 0.5| ≤ 0.35 for 3 consecutive frames (GOAL_CONSEC_FRAMES=3).

Key Documents

Research Story

MoNaVLA: Mobile Navigation
Vision-Language-Action

Abstract

Key Findings

Main Results

Pipeline Ablation (Exp66 계열)

Architecture

Key Documents

Full Research Journey (CH1→CH36)

VIS — 논문 부록 시각 자료

Grounding Analysis (7 Models)

Closed-Loop 평가 전체 이력

논문 Table 초안 (6/12)

실로봇 PG2 그라운딩 세션 분석

Legacy Archive

MoNaVLA: Mobile NavigationVision-Language-Action

Abstract

Key Findings

Main Results

Pipeline Ablation (Exp66 계열)

Architecture

Key Documents

Full Research Journey (CH1→CH36)

VIS — 논문 부록 시각 자료

Grounding Analysis (7 Models)

Closed-Loop 평가 전체 이력

논문 Table 초안 (6/12)

실로봇 PG2 그라운딩 세션 분석

Legacy Archive

MoNaVLA: Mobile Navigation
Vision-Language-Action