🔑 핵심 발견 — "Ring Ring Ring"의 원인

🔵 Pure HF Kosmos-2

Vision Encoder
image_to_text_projection
Text Decoder (LM)
✅ "The gray basket is located in the corner of a room.<phrase>..."

🟡 Google-robot Pretrained

Vision Encoder
image_to_text_projection ❌
Text Decoder (혼란)
❌ "video of the mud-cake Tin Tin Tin Roof Tin Tin roof..."

🔴 V4 LoRA Fine-tuned

Vision Encoder
image_to_text_projection ⚠️
Text Decoder (LoRA 적용)
⚠️ "the end of the room, and<phrase> the black box</phrase>..." (내용 부정확)
P1
Grounding 완성형 (채택 프롬프트)
<grounding>The gray basket is at
Pure HF Kosmos-2 BASELINE
the center of the image, with<phrase> the white wall</phrase><object><patch_index_0000><patch_index_0735></object> and<phrase> floor</phrase><object><patch_index_0640><patch_index_1023></object> in the background.</s>
(0.00,0.00) → (0.97,0.69)
(0.00,0.62) → (0.97,0.97)
5111595164862364k+5687131464k+64k+64k+64k+64k+864k+104264k+64k+64k+64k+64k+125260442
✅ 정상 — 의미있는 텍스트 + BBox 생성
Google-robot Pretrained BROKEN
BBox 없음
the back of the Old Faithful Faithful Faithful photo--- video- video</s>
없음
51039529065281552815528151121191919567195672
🔵 텍스트 생성 가능 (grounding 구조 없음)
V4 LoRA Fine-tuned PARTIAL
the end of the room, and<phrase> the black box</phrase><object><patch_index_0456><patch_index_0847></object> is on the floor.</s>
(0.25,0.44) → (0.47,0.81)
5232953706864k+5637140664k+64k+64k+64k+64k+17205104242
⚠️ 부분 정상 — BBox 구조는 유지되나 내용 부정확
P2
Grounding QA형
<grounding>An image of a robot. Where is the gray basket? Answer:
Pure HF Kosmos-2 BASELINE
The gray basket is located in the corner of a room.<phrase> The room</phrase><object><patch_index_0000><patch_index_1023></object> is empty.</s>
(0.00,0.00) → (0.97,0.97)
24620064721714541251815910370464k+2437064k+64k+64k+64k+64k+17275842
✅ 정상 — 의미있는 텍스트 + BBox 생성
Google-robot Pretrained BROKEN
BBox 없음
video of the mud-cake Tin Tin Tin Roof Tin Tin roof Tin Tin Duc Duc Duc Til Til Til Tin Tin Du Du Du Duc Duc Du Du Geng Duc Duc Tin Tin tin Duc Duc. Du Duc Til Tin Duc Til Du Duc Du Duc Tin Duc Du Tin Duc Tin Du Duc. Tin Duc duct duct duct Til
없음
5679569651926045138161381613816200701381613816353913816138163652536525365251716917169171691381613816492249224922365253652549224922
🔵 텍스트 생성 가능 (grounding 구조 없음)
V4 LoRA Fine-tuned PARTIAL
In the corner of the room.<phrase> The image</phrase><object><patch_index_0000><patch_index_1023></object> shows a room with a concrete floor.</s>
(0.00,0.00) → (0.97,0.97)
835181595370464k+24164864k+64k+64k+64k+64k+8161037023104558104242
⚠️ 부분 정상 — BBox 구조는 유지되나 내용 부정확
P3
지시문 직접 사용
<grounding>Navigate toward the gray basket until it gets closer
Pure HF Kosmos-2 BASELINE
to the wall.<phrase> The gray basket</phrase><object><patch_index_0456><patch_index_0848></object> is a small air conditioner.</s>
(0.25,0.44) → (0.50,0.81)
751314464k+246200647264k+64k+64k+64k+64k+17103687162657942
✅ 정상 — 의미있는 텍스트 + BBox 생성
Google-robot Pretrained BROKEN
BBox 없음
to the the the Godiva Magna Magna Magna prototype prototype prototype.</s>
없음
75555441159135921359213592112090120901209042
🔵 텍스트 생성 가능 (grounding 구조 없음)
V4 LoRA Fine-tuned PARTIAL
to the wall.<phrase> The gray basket</phrase><object><patch_index_0456><patch_index_0848></object> is located on the floor.</s>
(0.25,0.44) → (0.50,0.81)
751314464k+246200647264k+64k+64k+64k+64k+171454205104242
⚠️ 부분 정상 — BBox 구조는 유지되나 내용 부정확
P4
일반 VQA (grounding 없음)
What do you see in this image?
Pure HF Kosmos-2 BASELINE
BBox 없음
The image features a small gray air conditioner sitting on the floor in a room. The air conditioner is positioned next to a wall, and it appears to be plugged into an electrical outlet. There is also a chair in the room, positioned near the air conditioner. The room appears to have a tiled floor, and a window
없음
241648133810368620071626579128020510421210370424716265791799841977101314682216847
✅ 정상 — 의미있는 텍스트 + BBox 생성
Google-robot Pretrained BROKEN
BBox 없음
Sie Sie Sie sin bin Laden Laden Laden Fil Fil Fil fil fil fil p p psaksaksak pan pan panini titanium titanium titanium duct duct duct system by the Tin Tin Tin Hat Tin Tin Sie Sie Tin Tin tin Tin Tin Ho Ho Ho Sie Sie Ho Sie Tin Ho Sie Ho Ho Tin Tin To Tin Ho Tin
없음
254392543925439651256391856118561185611905119051190514626446264462645665665662988229882298826394639463946045390833908339083203612036120361
🔵 텍스트 생성 가능 (grounding 구조 없음)
V4 LoRA Fine-tuned PARTIAL
BBox 없음
The image features a small, gray, square-shaped air conditioner sitting on a concrete floor. The air conditioner is positioned towards the left side of the room, and it is connected to a power cord. There is a chair positioned in the room as well, positioned towards its right side. The room appears to be empty
없음
2416481338103686620062756199554716265791280201045581042424716265791799841028523533995
🔵 텍스트 생성 가능 (grounding 구조 없음)
프롬프트 🔵 Pure HF 🟡 Google-robot 🔴 V4 LoRA
Grounding 완성형 (채택 프롬프트)
the center of the image, with<phrase> the white wall</phrase...
⚠️
the back of the Old Faithful Faithful Faithful photo--- vide...
the end of the room, and<phrase> the black box</phrase><obje...
Grounding QA형
The gray basket is located in the corner of a room.<phrase> ...
⚠️
video of the mud-cake Tin Tin Tin Roof Tin Tin roof Tin Tin ...
In the corner of the room.<phrase> The image</phrase><object...
지시문 직접 사용
to the wall.<phrase> The gray basket</phrase><object><patch_...
⚠️
to the the the Godiva Magna Magna Magna prototype prototype ...
to the wall.<phrase> The gray basket</phrase><object><patch_...
일반 VQA (grounding 없음) ⚠️
The image features a small gray air conditioner sitting on t...
⚠️
Sie Sie Sie sin bin Laden Laden Laden Fil Fil Fil fil fil fi...
⚠️
The image features a small, gray, square-shaped air conditio...

왜 image_to_text_projection이 핵심인가

Pure HF Kosmos-2
로드 방식AutoModelForVision2Seq.from_pretrained(HF_PATH)
키 패턴
image_to_text_projection✅ 원본 유지
LoRA없음
microsoft/kosmos-2-patch14-224 원본 HuggingFace 가중치. Action prediction 학습 없음. 텍스트 생성 경로 온전한 기준선.
Google-robot Pretrained
로드 방식HF 아키텍처 + ckpt['state_dict'] (model.backbone.* 키)
키 패턴model.backbone.text_model.*
image_to_text_projection❌ Navigation으로 오염
LoRA없음 (full fine-tune)
kosmos_ph_google-robot-post-train.pt. RoboVLMs 프레임워크로 navigation pre-training됨. image_to_text_projection이 action feature 방향으로 재학습.
V4 LoRA Fine-tuned
로드 방식HF 아키텍처 + LoRA base_layer 가중치 추출 (lora_A/B 제외)
키 패턴model.backbone.base_model.model.*.base_layer.*
image_to_text_projection⚠️ Action regression으로 부분 오염
LoRArank=32, alpha=64 (q/k/v/o/fc1/fc2)
mobile_vla_v4_regression_v2 체크포인트. LoRA(rank=32) + image_to_text_projection full fine-tune. Action regression 학습됨.