← Back to main

Exp14 Feature Ablation: BBox vs Image

동일 MLP 용량(256→128→64)에서 input feature 조합만 바꿔 Step 1→Step 2의 +7.5%p 향상이 image feature 덕분인지 아키텍처 용량 증가 덕분인지 분리한다. 각 조건 5 split seed로 mean ± std 보고.

BBox-only
67.4%
± 9.8% (5 seeds)
Image-only
75.6%
± 0.8% (5 seeds)
BBox+Image
76.7%
± 1.3% (5 seeds)

PM per Path Type (seed-averaged)

Path TypeBBox-onlyImage-onlyBBox+Image
center_straight90.0%78.6%81.4%
center_left45.6%64.4%71.1%
center_right43.3%57.8%47.8%
left_straight83.3%90.0%91.1%
left_left70.5%69.5%74.7%
left_right67.4%83.2%82.1%
right_straight83.3%85.6%88.9%
right_left65.5%77.5%80.5%
right_right62.1%74.1%72.7%
설정
MLP backbone: Linear(d, 256)→ReLU→Dropout(0.25)→Linear(256,128)→ReLU→Dropout(0.2)→Linear(128,64)→ReLU→Linear(64,8)
lr=2e-3, epochs=220, batch=128, AdamW, inverse-frequency class weights.
Dataset: bbox_dataset.json (45 eps, 794 frames, Pure HF Kosmos-2 grounding). Split: episode-level stratified 80/20, 5 seeds.