VLM 3종 텍스트 생성 비교 — MoNaVLA (2026-04-11)

🔑 핵심 발견 — "Ring Ring Ring"의 원인

Google-robot 체크포인트가 원인: navigation action prediction 학습 과정에서 image_to_text_projection이 완전히 오염됨. 비전 feature가 텍스트 토큰 공간이 아닌 action feature 공간으로 매핑되어, 텍스트 디코더가 무의미한 토큰을 반복 생성 ("Old Faithful Faithful", "Tin Tin Tin Roof", "Sie Sie bin Laden").
V4 LoRA는 텍스트 생성 능력 유지: 동일 조건에서 V4 LoRA는 <phrase>, <patch_index_NNNN> 구조를 정상 생성. "Ring Ring Ring"이 V4에서 나왔다면 당시 로드 방식(MobileVLATrainer 경유, LoRA 레이어 그대로 적용)이 원인이었을 가능성이 높음.
Pure HF Kosmos-2는 완전 정상: 모든 프롬프트에서 의미있는 텍스트 + BBox 좌표 생성. Grounding 파이프라인 기준선으로 유효.
Google-robot vs V4의 차이: Google-robot은 full fine-tune (LoRA 없이 전체 파라미터 업데이트). V4는 LoRA로 언어 레이어를 부분 업데이트 + image_to_text_projection만 full fine-tune. V4가 텍스트 생성 능력을 더 잘 보존한 이유.

🔵 Pure HF Kosmos-2

Vision Encoder→

image_to_text_projection→

Text Decoder (LM)→

✅ "The gray basket is located in the corner of a room.<phrase>..."

🟡 Google-robot Pretrained

Vision Encoder→

image_to_text_projection ❌→

Text Decoder (혼란)→

❌ "video of the mud-cake Tin Tin Tin Roof Tin Tin roof..."

🔴 V4 LoRA Fine-tuned

Vision Encoder→

image_to_text_projection ⚠️→

Text Decoder (LoRA 적용)→

⚠️ "the end of the room, and<phrase> the black box</phrase>..." (내용 부정확)

P1

Grounding 완성형 (채택 프롬프트)

<grounding>The gray basket is at

Pure HF Kosmos-2 BASELINE

생성 텍스트

the center of the image, with<phrase> the white wall</phrase><object><patch_index_0000><patch_index_0735></object> and<phrase> floor</phrase><object><patch_index_0640><patch_index_1023></object> in the background.</s>

BBox 파싱

(0.00,0.00) → (0.97,0.69)

(0.00,0.62) → (0.97,0.97)

토큰 ID (첫 29개)

5111595164862364k+5687131464k+64k+64k+64k+64k+864k+104264k+64k+64k+64k+64k+125260442

✅ 정상 — 의미있는 텍스트 + BBox 생성

Google-robot Pretrained BROKEN

BBox 없음

생성 텍스트

the back of the Old Faithful Faithful Faithful photo--- video- video</s>

BBox 파싱

없음

토큰 ID (첫 16개)

51039529065281552815528151121191919567195672

🔵 텍스트 생성 가능 (grounding 구조 없음)

V4 LoRA Fine-tuned PARTIAL

생성 텍스트

the end of the room, and<phrase> the black box</phrase><object><patch_index_0456><patch_index_0847></object> is on the floor.</s>

BBox 파싱

(0.25,0.44) → (0.47,0.81)

토큰 ID (첫 22개)

5232953706864k+5637140664k+64k+64k+64k+64k+17205104242

⚠️ 부분 정상 — BBox 구조는 유지되나 내용 부정확

P2

Grounding QA형

<grounding>An image of a robot. Where is the gray basket? Answer:

Pure HF Kosmos-2 BASELINE

생성 텍스트

The gray basket is located in the corner of a room.<phrase> The room</phrase><object><patch_index_0000><patch_index_1023></object> is empty.</s>

BBox 파싱

(0.00,0.00) → (0.97,0.97)

토큰 ID (첫 24개)

24620064721714541251815910370464k+2437064k+64k+64k+64k+64k+17275842

✅ 정상 — 의미있는 텍스트 + BBox 생성

Google-robot Pretrained BROKEN

BBox 없음

생성 텍스트

video of the mud-cake Tin Tin Tin Roof Tin Tin roof Tin Tin Duc Duc Duc Til Til Til Tin Tin Du Du Du Duc Duc Du Du Geng Duc Duc Tin Tin tin Duc Duc. Du Duc Til Tin Duc Til Du Duc Du Duc Tin Duc Du Tin Duc Tin Du Duc. Tin Duc duct duct duct Til

BBox 파싱

없음

토큰 ID (첫 30개)

5679569651926045138161381613816200701381613816353913816138163652536525365251716917169171691381613816492249224922365253652549224922

🔵 텍스트 생성 가능 (grounding 구조 없음)

V4 LoRA Fine-tuned PARTIAL

생성 텍스트

In the corner of the room.<phrase> The image</phrase><object><patch_index_0000><patch_index_1023></object> shows a room with a concrete floor.</s>

BBox 파싱

(0.00,0.00) → (0.97,0.97)

토큰 ID (첫 24개)

835181595370464k+24164864k+64k+64k+64k+64k+8161037023104558104242

⚠️ 부분 정상 — BBox 구조는 유지되나 내용 부정확

P3

지시문 직접 사용

<grounding>Navigate toward the gray basket until it gets closer

Pure HF Kosmos-2 BASELINE

생성 텍스트

to the wall.<phrase> The gray basket</phrase><object><patch_index_0456><patch_index_0848></object> is a small air conditioner.</s>

BBox 파싱

(0.25,0.44) → (0.50,0.81)

토큰 ID (첫 20개)

751314464k+246200647264k+64k+64k+64k+64k+17103687162657942

✅ 정상 — 의미있는 텍스트 + BBox 생성

Google-robot Pretrained BROKEN

BBox 없음

생성 텍스트

to the the the Godiva Magna Magna Magna prototype prototype prototype.</s>

BBox 파싱

없음

토큰 ID (첫 14개)

75555441159135921359213592112090120901209042

🔵 텍스트 생성 가능 (grounding 구조 없음)

V4 LoRA Fine-tuned PARTIAL

생성 텍스트

to the wall.<phrase> The gray basket</phrase><object><patch_index_0456><patch_index_0848></object> is located on the floor.</s>

BBox 파싱

(0.25,0.44) → (0.50,0.81)

토큰 ID (첫 20개)

751314464k+246200647264k+64k+64k+64k+64k+171454205104242

⚠️ 부분 정상 — BBox 구조는 유지되나 내용 부정확

P4

일반 VQA (grounding 없음)

What do you see in this image?

Pure HF Kosmos-2 BASELINE

BBox 없음

생성 텍스트

The image features a small gray air conditioner sitting on the floor in a room. The air conditioner is positioned next to a wall, and it appears to be plugged into an electrical outlet. There is also a chair in the room, positioned near the air conditioner. The room appears to have a tiled floor, and a window

BBox 파싱

없음

토큰 ID (첫 30개)

241648133810368620071626579128020510421210370424716265791799841977101314682216847

✅ 정상 — 의미있는 텍스트 + BBox 생성

Google-robot Pretrained BROKEN

BBox 없음

생성 텍스트

Sie Sie Sie sin bin Laden Laden Laden Fil Fil Fil fil fil fil p p psaksaksak pan pan panini titanium titanium titanium duct duct duct system by the Tin Tin Tin Hat Tin Tin Sie Sie Tin Tin tin Tin Tin Ho Ho Ho Sie Sie Ho Sie Tin Ho Sie Ho Ho Tin Tin To Tin Ho Tin

BBox 파싱

없음

토큰 ID (첫 30개)

254392543925439651256391856118561185611905119051190514626446264462645665665662988229882298826394639463946045390833908339083203612036120361

🔵 텍스트 생성 가능 (grounding 구조 없음)

V4 LoRA Fine-tuned PARTIAL

BBox 없음

생성 텍스트

The image features a small, gray, square-shaped air conditioner sitting on a concrete floor. The air conditioner is positioned towards the left side of the room, and it is connected to a power cord. There is a chair positioned in the room as well, positioned towards its right side. The room appears to be empty

BBox 파싱

없음

토큰 ID (첫 30개)

2416481338103686620062756199554716265791280201045581042424716265791799841028523533995

🔵 텍스트 생성 가능 (grounding 구조 없음)

프롬프트	🔵 Pure HF	🟡 Google-robot	🔴 V4 LoRA
Grounding 완성형 (채택 프롬프트)	✅ the center of the image, with<phrase> the white wall</phrase...	⚠️ the back of the Old Faithful Faithful Faithful photo--- vide...	✅ the end of the room, and<phrase> the black box</phrase><obje...
Grounding QA형	✅ The gray basket is located in the corner of a room.<phrase> ...	⚠️ video of the mud-cake Tin Tin Tin Roof Tin Tin roof Tin Tin ...	✅ In the corner of the room.<phrase> The image</phrase><object...
지시문 직접 사용	✅ to the wall.<phrase> The gray basket</phrase><object><patch_...	⚠️ to the the the Godiva Magna Magna Magna prototype prototype ...	✅ to the wall.<phrase> The gray basket</phrase><object><patch_...
일반 VQA (grounding 없음)	⚠️ The image features a small gray air conditioner sitting on t...	⚠️ Sie Sie Sie sin bin Laden Laden Laden Fil Fil Fil fil fil fi...	⚠️ The image features a small, gray, square-shaped air conditio...

왜 image_to_text_projection이 핵심인가

역할: Vision Encoder (CLIP)의 패치 임베딩을 Text Decoder의 토큰 임베딩 공간으로 변환하는 유일한 브릿지 레이어. 이 레이어가 정상이어야 "이미지를 보고 텍스트를 생성"할 수 있음.
Google-robot 오염 메커니즘: Navigation pre-training에서 이 레이어를 action feature 생성 방향으로 재학습. 결과적으로 비전 feature가 텍스트 토큰 공간이 아닌 전혀 다른 공간으로 매핑됨 → 디코더가 혼란.
V4 LoRA 부분 보존 이유: LoRA는 언어 레이어(Q/K/V/O/FC) 만 업데이트. image_to_text_projection은 action prediction을 위해 업데이트되었으나, LoRA base_layer 가중치(텍스트 생성 방향)가 남아있어 완전 소실은 아님.
결론: Grounding에는 Pure HF 별도 로드가 필수. V4 base로 Exp04(Google-robot 기반)를 학습하면 action learning에는 유리하지만 그 모델의 generate()는 사용 불가.

Pure HF Kosmos-2

로드 방식AutoModelForVision2Seq.from_pretrained(HF_PATH)

키 패턴—

image_to_text_projection✅ 원본 유지

LoRA없음

microsoft/kosmos-2-patch14-224 원본 HuggingFace 가중치. Action prediction 학습 없음. 텍스트 생성 경로 온전한 기준선.

Google-robot Pretrained

로드 방식HF 아키텍처 + ckpt['state_dict'] (model.backbone.* 키)

키 패턴model.backbone.text_model.*

image_to_text_projection❌ Navigation으로 오염

LoRA없음 (full fine-tune)

kosmos_ph_google-robot-post-train.pt. RoboVLMs 프레임워크로 navigation pre-training됨. image_to_text_projection이 action feature 방향으로 재학습.

V4 LoRA Fine-tuned

로드 방식HF 아키텍처 + LoRA base_layer 가중치 추출 (lora_A/B 제외)

키 패턴model.backbone.base_model.model.*.base_layer.*

image_to_text_projection⚠️ Action regression으로 부분 오염

LoRArank=32, alpha=64 (q/k/v/o/fc1/fc2)

mobile_vla_v4_regression_v2 체크포인트. LoRA(rank=32) + image_to_text_projection full fine-tune. Action regression 학습됨.

VLM 3종 텍스트 생성 능력 비교

🔑 핵심 발견 — "Ring Ring Ring"의 원인

🔵 Pure HF Kosmos-2

🟡 Google-robot Pretrained

🔴 V4 LoRA Fine-tuned

왜 image_to_text_projection이 핵심인가