MoNaVLA / V5 / Paper-Ready Summary
Paper-ready tables, Korean captions, and an Experimental Results draft built only from currently checked V5 documents.
| Exp / Method | Family | PM (%) | Closed-loop (%) | Main Interpretation |
|---|---|---|---|---|
| Exp04 | End-to-end VLM | 0.0 | N/A | Loss improved, but inference collapsed. |
| Exp10 + rule | Grounding proxy | 34.4 | N/A | Perception was strong, but policy transfer remained weak. |
| Exp11 | End-to-end VLM | 58.6 | 0.0 | Current policy baseline, but real rollout failed. |
| Exp14 Step1 | Decomposition | 68.4 | N/A | BBox history alone beat Exp11 in PM. |
| Exp14 Step2 | Decomposition | 75.9 | 66.7 | Current strongest practical baseline. |
| Exp17 | End-to-end VLM | 76.95 | 11.1 | PM improved, but rollout still failed. |
| Exp18 | End-to-end VLM + text fusion | 27.62 | 11.1 | Lower val loss did not translate to rollout improvement. |
\begin{table}[t]
\centering
\caption{Main experimental results on V5 navigation. PM denotes frame-level Perfect Match, while Closed-loop denotes episode-level navigation success rate.}
\begin{tabular}{l l c c l}
\toprule
Method & Family & PM (\%) & Closed-loop (\%) & Interpretation \\
\midrule
Exp04 & End-to-end VLM & 0.0 & N/A & Loss improved, but inference collapsed \\
Exp10 + rule & Grounding proxy & 34.4 & N/A & Perception strong, transfer weak \\
Exp11 & End-to-end VLM & 58.6 & 0.0 & Current policy baseline, rollout failure \\
Exp14 Step1 & Decomposition & 68.4 & N/A & BBox history alone outperforms Exp11 in PM \\
Exp14 Step2 & Decomposition & 75.9 & 66.7 & Strongest practical baseline \\
Exp17 & End-to-end VLM & 76.95 & 11.1 & PM improved, but rollout still failed \\
Exp18 & End-to-end VLM + text fusion & 27.62 & 11.1 & Lower val loss, but rollout still failed \\
\bottomrule
\end{tabular}
\end{table}
\begin{table}[t]
\centering
\caption{Representative design differences of the main V5 experiments. Only settings directly supported by configs/docs are reported.}
\begin{tabular}{l l c l l}
\toprule
Method & Backbone / Core & LoRA & Text conditioning & Notes \\
\midrule
Exp11 & Google-Robot pretrained Kosmos-2 & Yes & Raw instruction tokens & Current policy baseline \\
Exp14 Step2 & BBox history + 16x16 image MLP & No & None (explicitly bypassed) & Strongest practical baseline \\
Exp15 & Frozen Google-Robot VLM + action head & No & Frozen text path & Head-only control ablation \\
Exp17 & Exp11 line + balanced 33/33/34 sampling & Yes & Raw instruction tokens & Professor Step-3 test \\
Exp18 & Google-Robot Kosmos-2 + text embedding fusion & Yes & Raw text + frozen text embedding & Gate evaluation failed \\
\bottomrule
\end{tabular}
\end{table}
\begin{table}[t]
\centering
\caption{Evidence supporting the current working conclusion.}
\begin{tabular}{l l l}
\toprule
Claim & Status & Direct evidence \\
\midrule
PM alone is insufficient for model selection & Strongly supported & Exp17: 76.95 PM vs 11.1 Closed-loop \\
Decomposition is the strongest practical route & Supported & Exp14 Step2: 66.7 Closed-loop \\
End-to-end VLM still suffers rollout instability & Supported & Exp11: 0.0, Exp17: 11.1 Closed-loop \\
V5 does not directly supervise STOP & Supported & Raw V5 action count: STOP = 0 \\
Goal-near / stop-near is a proxy-signal problem & Supported & V5 proxy analysis documents \\
Text-path collapse is real & Strongly supported & Attention analysis in root-cause document \\
\bottomrule
\end{tabular}
\end{table}
\begin{table}[t]
\centering
\caption{Current research decision table.}
\begin{tabular}{l l}
\toprule
Decision item & Current decision \\
\midrule
Mainline baseline & Exp14 Step2 \\
Promotion criterion for end-to-end VLM & Must exceed 66.7\% Closed-loop \\
Role of Exp18 & Gate evaluation branch \\
Next mainline training & Exp19 = Step2 + proxy features \\
If Exp19 is insufficient & Exp20 = Step2 + proxy auxiliary head \\
Use goal\_near\_v0 as hard stop rule now? & No \\
\bottomrule
\end{tabular}
\end{table}
\caption{V5 내비게이션 실험의 주요 결과. PM은 프레임 단위 Perfect Match를, Closed-loop는 에피소드 단위 주행 성공률을 의미한다.}
\caption{주요 V5 실험들의 대표 설계 차이. 표에는 config와 문서에서 직접 확인 가능한 설정만 포함했다.}
\caption{현재 중간 결론을 지지하는 근거 요약.}
\caption{현재 시점 연구 의사결정 표.}
Although end-to-end VLM training improved frame-level PM in some cases, it did not reliably translate into successful closed-loop navigation. Exp17 is the clearest example: it reached 76.95\% PM, but only 11.1\% closed-loop success. In contrast, the decomposition-based policy in Exp14 Step2 achieved 75.9\% PM and 66.7\% closed-loop success, making it the strongest practical baseline on V5 at the current stage.
This gap indicates that frame-level action matching is not sufficient for model selection in mobile navigation. The result is consistent with the root-cause analyses: the current end-to-end policy remains heavily image-dominant, while the text pathway often collapses during fine-tuning. In parallel, the V5 dataset analysis shows that STOP is not directly supervised in the raw action distribution, which suggests that late-phase stopping behavior is better treated as a proxy-signal problem than as a naturally emergent property of the current end-to-end objective.
Accordingly, our current research direction keeps Exp14 Step2 as the mainline baseline, while Exp18 is now treated as a failed gate branch. Although Exp18 reached a best validation loss of 1.325, its PM was only 27.62\% and its closed-loop success remained 11.1\%. Until a single-stage policy exceeds the 66.7\% closed-loop result of Exp14 Step2, decomposition remains the most defensible practical approach.