MoNaVLA / V5 / Paper-Ready Summary

Experimental Results Tables

Paper-ready tables, Korean captions, and an Experimental Results draft built only from currently checked V5 documents.

Strongest Baseline
Exp14 Step 2
PM 75.9%, closed-loop 66.7%. Current practical baseline.
Best Checked End-to-End
Exp17
PM 76.95%, closed-loop 11.1%. PM improved, rollout still weak.
Exp18 Status
Gate Failed
Best val loss 1.325, but PM 27.62% and closed-loop 11.1%.
Current Conclusion
Decomposition > End-to-End
On V5, decomposition remains the strongest practical route until a single-stage policy beats 66.7% closed-loop.

Representative Results Table

Exp / Method Family PM (%) Closed-loop (%) Main Interpretation
Exp04 End-to-end VLM 0.0 N/A Loss improved, but inference collapsed.
Exp10 + rule Grounding proxy 34.4 N/A Perception was strong, but policy transfer remained weak.
Exp11 End-to-end VLM 58.6 0.0 Current policy baseline, but real rollout failed.
Exp14 Step1 Decomposition 68.4 N/A BBox history alone beat Exp11 in PM.
Exp14 Step2 Decomposition 75.9 66.7 Current strongest practical baseline.
Exp17 End-to-end VLM 76.95 11.1 PM improved, but rollout still failed.
Exp18 End-to-end VLM + text fusion 27.62 11.1 Lower val loss did not translate to rollout improvement.

1. Booktabs LaTeX Version

\begin{table}[t]
\centering
\caption{Main experimental results on V5 navigation. PM denotes frame-level Perfect Match, while Closed-loop denotes episode-level navigation success rate.}
\begin{tabular}{l l c c l}
\toprule
Method & Family & PM (\%) & Closed-loop (\%) & Interpretation \\
\midrule
Exp04 & End-to-end VLM & 0.0 & N/A & Loss improved, but inference collapsed \\
Exp10 + rule & Grounding proxy & 34.4 & N/A & Perception strong, transfer weak \\
Exp11 & End-to-end VLM & 58.6 & 0.0 & Current policy baseline, rollout failure \\
Exp14 Step1 & Decomposition & 68.4 & N/A & BBox history alone outperforms Exp11 in PM \\
Exp14 Step2 & Decomposition & 75.9 & 66.7 & Strongest practical baseline \\
Exp17 & End-to-end VLM & 76.95 & 11.1 & PM improved, but rollout still failed \\
Exp18 & End-to-end VLM + text fusion & 27.62 & 11.1 & Lower val loss, but rollout still failed \\
\bottomrule
\end{tabular}
\end{table}
Root-cause update. Pure HF Kosmos-2 keeps about 22.6\% text attention, while Google-Robot policy variants including head-only Exp15 collapse to 0.000\%. This means LoRA is not a necessary condition for the observed text-path collapse.
\begin{table}[t]
\centering
\caption{Representative design differences of the main V5 experiments. Only settings directly supported by configs/docs are reported.}
\begin{tabular}{l l c l l}
\toprule
Method & Backbone / Core & LoRA & Text conditioning & Notes \\
\midrule
Exp11 & Google-Robot pretrained Kosmos-2 & Yes & Raw instruction tokens & Current policy baseline \\
Exp14 Step2 & BBox history + 16x16 image MLP & No & None (explicitly bypassed) & Strongest practical baseline \\
Exp15 & Frozen Google-Robot VLM + action head & No & Frozen text path & Head-only control ablation \\
Exp17 & Exp11 line + balanced 33/33/34 sampling & Yes & Raw instruction tokens & Professor Step-3 test \\
Exp18 & Google-Robot Kosmos-2 + text embedding fusion & Yes & Raw text + frozen text embedding & Gate evaluation failed \\
\bottomrule
\end{tabular}
\end{table}
\begin{table}[t]
\centering
\caption{Evidence supporting the current working conclusion.}
\begin{tabular}{l l l}
\toprule
Claim & Status & Direct evidence \\
\midrule
PM alone is insufficient for model selection & Strongly supported & Exp17: 76.95 PM vs 11.1 Closed-loop \\
Decomposition is the strongest practical route & Supported & Exp14 Step2: 66.7 Closed-loop \\
End-to-end VLM still suffers rollout instability & Supported & Exp11: 0.0, Exp17: 11.1 Closed-loop \\
V5 does not directly supervise STOP & Supported & Raw V5 action count: STOP = 0 \\
Goal-near / stop-near is a proxy-signal problem & Supported & V5 proxy analysis documents \\
Text-path collapse is real & Strongly supported & Attention analysis in root-cause document \\
\bottomrule
\end{tabular}
\end{table}
\begin{table}[t]
\centering
\caption{Current research decision table.}
\begin{tabular}{l l}
\toprule
Decision item & Current decision \\
\midrule
Mainline baseline & Exp14 Step2 \\
Promotion criterion for end-to-end VLM & Must exceed 66.7\% Closed-loop \\
Role of Exp18 & Gate evaluation branch \\
Next mainline training & Exp19 = Step2 + proxy features \\
If Exp19 is insufficient & Exp20 = Step2 + proxy auxiliary head \\
Use goal\_near\_v0 as hard stop rule now? & No \\
\bottomrule
\end{tabular}
\end{table}

2. Korean Caption Version

\caption{V5 내비게이션 실험의 주요 결과. PM은 프레임 단위 Perfect Match를, Closed-loop는 에피소드 단위 주행 성공률을 의미한다.}

\caption{주요 V5 실험들의 대표 설계 차이. 표에는 config와 문서에서 직접 확인 가능한 설정만 포함했다.}

\caption{현재 중간 결론을 지지하는 근거 요약.}

\caption{현재 시점 연구 의사결정 표.}

3. Experimental Results Draft

Although end-to-end VLM training improved frame-level PM in some cases, it did not reliably translate into successful closed-loop navigation. Exp17 is the clearest example: it reached 76.95\% PM, but only 11.1\% closed-loop success. In contrast, the decomposition-based policy in Exp14 Step2 achieved 75.9\% PM and 66.7\% closed-loop success, making it the strongest practical baseline on V5 at the current stage.

This gap indicates that frame-level action matching is not sufficient for model selection in mobile navigation. The result is consistent with the root-cause analyses: the current end-to-end policy remains heavily image-dominant, while the text pathway often collapses during fine-tuning. In parallel, the V5 dataset analysis shows that STOP is not directly supervised in the raw action distribution, which suggests that late-phase stopping behavior is better treated as a proxy-signal problem than as a naturally emergent property of the current end-to-end objective.

Accordingly, our current research direction keeps Exp14 Step2 as the mainline baseline, while Exp18 is now treated as a failed gate branch. Although Exp18 reached a best validation loss of 1.325, its PM was only 27.62\% and its closed-loop success remained 11.1\%. Until a single-stage policy exceeds the 66.7\% closed-loop result of Exp14 Step2, decomposition remains the most defensible practical approach.
Current paper-safe takeaway: on V5, decomposition is still the strongest public result, and Exp18 should now be reported as training finished, evaluation failed.