MoNaVLA V5 Paper-Ready Results

Representative Results Table

Exp / Method	Family	PM (%)	Closed-loop (%)	Main Interpretation
Exp04	End-to-end VLM	0.0	N/A	Loss improved, but inference collapsed.
Exp10 + rule	Grounding proxy	34.4	N/A	Perception was strong, but policy transfer remained weak.
Exp11	End-to-end VLM	58.6	0.0	Current policy baseline, but real rollout failed.
Exp14 Step1	Decomposition	68.4	N/A	BBox history alone beat Exp11 in PM.
Exp14 Step2	Decomposition	75.9	66.7	Current strongest practical baseline.
Exp17	End-to-end VLM	76.95	11.1	PM improved, but rollout still failed.
Exp18	End-to-end VLM + text fusion	27.62	11.1	Lower val loss did not translate to rollout improvement.

1. Booktabs LaTeX Version

\begin{table}[t]
\centering
\caption{Main experimental results on V5 navigation. PM denotes frame-level Perfect Match, while Closed-loop denotes episode-level navigation success rate.}
\begin{tabular}{l l c c l}
\toprule
Method & Family & PM (\%) & Closed-loop (\%) & Interpretation \\
\midrule
Exp04 & End-to-end VLM & 0.0 & N/A & Loss improved, but inference collapsed \\
Exp10 + rule & Grounding proxy & 34.4 & N/A & Perception strong, transfer weak \\
Exp11 & End-to-end VLM & 58.6 & 0.0 & Current policy baseline, rollout failure \\
Exp14 Step1 & Decomposition & 68.4 & N/A & BBox history alone outperforms Exp11 in PM \\
Exp14 Step2 & Decomposition & 75.9 & 66.7 & Strongest practical baseline \\
Exp17 & End-to-end VLM & 76.95 & 11.1 & PM improved, but rollout still failed \\
Exp18 & End-to-end VLM + text fusion & 27.62 & 11.1 & Lower val loss, but rollout still failed \\
\bottomrule
\end{tabular}
\end{table}

Root-cause update. Pure HF Kosmos-2 keeps about 22.6\% text attention, while Google-Robot policy variants including head-only Exp15 collapse to 0.000\%. This means LoRA is not a necessary condition for the observed text-path collapse.

\begin{table}[t]
\centering
\caption{Representative design differences of the main V5 experiments. Only settings directly supported by configs/docs are reported.}
\begin{tabular}{l l c l l}
\toprule
Method & Backbone / Core & LoRA & Text conditioning & Notes \\
\midrule
Exp11 & Google-Robot pretrained Kosmos-2 & Yes & Raw instruction tokens & Current policy baseline \\
Exp14 Step2 & BBox history + 16x16 image MLP & No & None (explicitly bypassed) & Strongest practical baseline \\
Exp15 & Frozen Google-Robot VLM + action head & No & Frozen text path & Head-only control ablation \\
Exp17 & Exp11 line + balanced 33/33/34 sampling & Yes & Raw instruction tokens & Professor Step-3 test \\
Exp18 & Google-Robot Kosmos-2 + text embedding fusion & Yes & Raw text + frozen text embedding & Gate evaluation failed \\
\bottomrule
\end{tabular}
\end{table}

\begin{table}[t]
\centering
\caption{Evidence supporting the current working conclusion.}
\begin{tabular}{l l l}
\toprule
Claim & Status & Direct evidence \\
\midrule
PM alone is insufficient for model selection & Strongly supported & Exp17: 76.95 PM vs 11.1 Closed-loop \\
Decomposition is the strongest practical route & Supported & Exp14 Step2: 66.7 Closed-loop \\
End-to-end VLM still suffers rollout instability & Supported & Exp11: 0.0, Exp17: 11.1 Closed-loop \\
V5 does not directly supervise STOP & Supported & Raw V5 action count: STOP = 0 \\
Goal-near / stop-near is a proxy-signal problem & Supported & V5 proxy analysis documents \\
Text-path collapse is real & Strongly supported & Attention analysis in root-cause document \\
\bottomrule
\end{tabular}
\end{table}

\begin{table}[t]
\centering
\caption{Current research decision table.}
\begin{tabular}{l l}
\toprule
Decision item & Current decision \\
\midrule
Mainline baseline & Exp14 Step2 \\
Promotion criterion for end-to-end VLM & Must exceed 66.7\% Closed-loop \\
Role of Exp18 & Gate evaluation branch \\
Next mainline training & Exp19 = Step2 + proxy features \\
If Exp19 is insufficient & Exp20 = Step2 + proxy auxiliary head \\
Use goal\_near\_v0 as hard stop rule now? & No \\
\bottomrule
\end{tabular}
\end{table}

2. Korean Caption Version

\caption{V5 내비게이션 실험의 주요 결과. PM은 프레임 단위 Perfect Match를, Closed-loop는 에피소드 단위 주행 성공률을 의미한다.}

\caption{주요 V5 실험들의 대표 설계 차이. 표에는 config와 문서에서 직접 확인 가능한 설정만 포함했다.}

\caption{현재 중간 결론을 지지하는 근거 요약.}

\caption{현재 시점 연구 의사결정 표.}

3. Experimental Results Draft

Although end-to-end VLM training improved frame-level PM in some cases, it did not reliably translate into successful closed-loop navigation. Exp17 is the clearest example: it reached 76.95\% PM, but only 11.1\% closed-loop success. In contrast, the decomposition-based policy in Exp14 Step2 achieved 75.9\% PM and 66.7\% closed-loop success, making it the strongest practical baseline on V5 at the current stage.

This gap indicates that frame-level action matching is not sufficient for model selection in mobile navigation. The result is consistent with the root-cause analyses: the current end-to-end policy remains heavily image-dominant, while the text pathway often collapses during fine-tuning. In parallel, the V5 dataset analysis shows that STOP is not directly supervised in the raw action distribution, which suggests that late-phase stopping behavior is better treated as a proxy-signal problem than as a naturally emergent property of the current end-to-end objective.

Accordingly, our current research direction keeps Exp14 Step2 as the mainline baseline, while Exp18 is now treated as a failed gate branch. Although Exp18 reached a best validation loss of 1.325, its PM was only 27.62\% and its closed-loop success remained 11.1\%. Until a single-stage policy exceeds the 66.7\% closed-loop result of Exp14 Step2, decomposition remains the most defensible practical approach.

Current paper-safe takeaway: on V5, decomposition is still the strongest public result, and Exp18 should now be reported as training finished, evaluation failed.

Experimental Results Tables

Representative Results Table

1. Booktabs LaTeX Version

2. Korean Caption Version

3. Experimental Results Draft