[Refactor] Multimodal data processing for VLM (#6659)

Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>
This commit is contained in:
Xinyuan Tong
2025-06-04 11:22:33 -07:00
committed by GitHub
parent bd75690f4e
commit cf9815ba69
11 changed files with 248 additions and 167 deletions

View File

@@ -132,7 +132,7 @@
"\n",
"mm_item = dict(\n",
" modality=\"IMAGE\",\n",
" image_grid_thws=processed_prompt[\"image_grid_thw\"],\n",
" image_grid_thw=processed_prompt[\"image_grid_thw\"],\n",
" precomputed_features=precomputed_features,\n",
")\n",
"out = llm.generate(input_ids=input_ids, image_data=[mm_item])\n",