入门指南：如何微调视觉语言模型（VLM）实现特定任务

2025年5月15日

视觉语言模型（Vision-Language Models, VLMs）是近年来多模态人工智能领域的热门研究方向。这类模型能够同时处理视觉（如图像或视频）与语言信息，广泛应用于图像问答、图文生成、图文检索等任务。随着开源模型的发展，普通开发者也能通过微调技术将预训练模型快速适配特定领域需求。

本教程以电商图文数据为例，通过微调 Qwen2.5-VL-3B-Instruct 模型，让模型从通用视觉理解转变为专注于商品识别和描述的能力，实现对商品图片的多级分类与产品描述生成。Qwen2.5-VL-3B-Instruct 是一款体积小、推理效率高、资源消耗低的小型视觉语言模型，特别适合在中小规模场景中快速部署和落地应用，可显著降低计算与成本开销。

微调前的视觉语言模型虽然可以对图像进行一定的泛化理解，但往往只能输出较为模糊、通用的描述。通过本次微调，我们的目标是让模型能够准确识别电商图片中的商品类别，并生成更符合平台风格的精细化产品描述，从而为电商系统自动补全结构化商品信息。

环境准备

请确保你的开发环境已安装以下依赖：

pip install torch torchvision transformers datasets accelerate bitsandbytes trl peft qwen-vl-utils flash-attn scikit-learn

此外，你需要具备一块支持 CUDA 的 GPU，本教程是在 Spader.AI 单卡 A100 40G 完成的。

实操步骤：微调 VLM 模型以实现图像产品识别与产品描述生成任务

步骤一：加载与划分数据集

本教程使用的数据集 deepfashion-multimodal 包含多个电商商品图像，以及对应的多层级分类标签与详细商品描述文本字段。这些数据将作为训练模型的图文对齐样本，帮助模型学习如何识别商品类别并生成贴合的文本内容。

from datasets import load_dataset
from sklearn.model_selection import train_test_split

# 从 Hugging Face 加载 DeepFashion-Multimodal 数据集，包含图像 + 分类 + 描述
dataset_id = "Marqo/deepfashion-multimodal"
dataset = load_dataset(dataset_id)
data = dataset["data"]

# 按 80/10/10 比例划分训练、验证、测试集
train_valid_test = data.train_test_split(test_size=0.2, seed=42)
valid_test = train_valid_test["test"].train_test_split(test_size=0.5, seed=42)
train = train_valid_test["train"]
valid = valid_test["train"]
test = valid_test["test"]

该数据集中的每个样本包含：

一张商品图像
两层分类标签：category1 和 category2
一段自然语言商品描述 text

这些字段将用于训练模型识别商品类型并生成结构化描述。

步骤二：构造多模态对话格式输入

system_message = "You are a fashion product expert assistant who excels at identifying product types from images and generating concise descriptions."
prompt = "What is this product?"

def format_data(sample):
    generation = f"{sample['category1']}/{sample['category2']}: {sample['text']}"
    return [
        {
            "role": "system",
            "content": [{"type": "text", "text": system_message}],
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "image": sample["image"],
                },
                {
                    "type": "text",
                    "text": prompt,
                },
            ],
        },
        {
            "role": "assistant",
            "content": [{"type": "text", "text": generation}],
        },
    ]

train_dataset = [format_data(sample) for sample in train]
valid_dataset = [format_data(sample) for sample in valid]
test_dataset = [format_data(sample) for sample in test]

步骤三：加载模型和处理器

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor

# 加载模型与处理器（支持图文输入）
model_id = "Qwen/Qwen2.5-VL-3B-Instruct"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto"
)

processor = AutoProcessor.from_pretrained(model_id)

步骤四：配置训练参数

from peft import LoraConfig, get_peft_model

# 配置 LoRA（轻量参数高效微调）
peft_config = LoraConfig(
    lora_alpha=16,             # LoRA 的缩放因子
    lora_dropout=0.05,         # LoRA 的 Dropout 比例
    r=8,                       # LoRA 的秩，越大代表参数越多
    bias="none",              # 不微调 bias 项
    target_modules=["q_proj", "v_proj"],  # 指定插入 LoRA 的模块
    task_type="CAUSAL_LM",    # 指定任务类型为自回归语言建模
)

# 应用 LoRA 微调配置
model = get_peft_model(model, peft_config)

# 配置训练参数
training_args = SFTConfig(
    output_dir="./qwen2.5-3b-instruct-trl-sft-deepfashion",  # 模型保存路径
    num_train_epochs=3,                 # 训练轮数
    per_device_train_batch_size=4,      # 每设备训练批大小
    per_device_eval_batch_size=4,       # 每设备验证批大小
    gradient_accumulation_steps=8,      # 梯度累积步数
    gradient_checkpointing=True,        # 启用梯度检查点以节省显存
    optim="adamw_torch_fused",          # 优化器类型
    learning_rate=2e-4,                 # 学习率
    lr_scheduler_type="constant",       # 学习率调度器类型
    logging_steps=10,                   # 日志记录步长
    eval_steps=10,                      # 验证步长
    eval_strategy="steps",              # 验证策略为按步执行
    save_strategy="steps",              # 保存策略为按步执行
    save_steps=20,                      # 保存间隔步数
    metric_for_best_model="eval_loss",  # 用于选择最佳模型的指标
    greater_is_better=False,            # 指标越小越好
    load_best_model_at_end=True,        # 最后加载最佳模型
    bf16=True,                          # 使用 bfloat16 精度
    tf32=True,                          # 启用 TensorFloat-32
    max_grad_norm=0.3,                  # 最大梯度裁剪
    warmup_ratio=0.03,                  # 学习率预热比例
    push_to_hub=False,                  # 是否推送至 Hugging Face Hub
    report_to="none",                   # 不启用日志平台
    gradient_checkpointing_kwargs={"use_reentrant": False},  # 检查点设置
    dataset_text_field="",              # 文本字段留空（因使用 messages 格式）
    dataset_kwargs={"skip_prepare_dataset": True},  # 跳过默认字段检查
)

步骤五：启动微调训练

from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    data_collator=collate_fn,
    peft_config=peft_config,
)

trainer.train()

步骤六：微调效果对比与验证

在未微调的基础模型上，对图片的理解通常较为笼统，例如下图，

sample

微调前输出（通用）：

This product appears to be a hoodie with a unique design. The hoodie features a black and white pattern that resembles a tie-dye effect, combined with a hood and possibly some decorative elements like pom-poms or other embellishments. The overall style suggests it might be a casual, trendy piece of clothing suitable for cooler weather or as a fashion statement.。

微调后输出（定制）：

women/sweatshirts: The person wears a long-sleeve sweater with graphic patterns. The sweater is with cotton fabric. It has a round neckline. The person wears a three-point pants. The pants are with cotton fabric and solid color patterns. There is an accessory on her wrist. This lady wears a ring.

注意：输出的实际内容取决于训练数据质量以及你设计输出的格式。该图片的原始标注信息见下图，该数据集的产品描述大概率是通过 AI 工具提供，但不影响我们教程演示目的。

sample label

你可以在推理脚本中加载微调前后的模型进行验证，详细代码见 GitHub 或 Gitee。

最终，你能获得针对电商图片更精确、风格一致的产品分类与描述文本，实现高质量电商图文结构化生成。

值得注意的是你也可以通过 prompt engineering 的方式对未微调的模型达到类似的效果，但对写好提示词有一定的要求，如

system_message = (
    "You are an expert e-commerce assistant. "
    "Your task is to analyze a product image and generate a structured output. "
    "The output must follow this format:\n"
    "[category1/category2]: [description]\n\n"
    "category1 is the high-level gender category (e.g., 'men' or 'women').\n"
    "category2 is the specific clothing type (e.g., 'dresses', 'hoodies', 'jeans').\n"
    "The description should be concise, informative, and suitable for an online product listing."
)

得到的结果：

[category1: women]: [category2: hoodies]\nThis product is a women's hoodie with a unique tie-dye pattern on the front. The hoodie features a long, loose fit with a drawstring hood and a front pocket. The sleeves are long and the bottom hem is slightly flared, giving it a casual and comfortable look. The color scheme is predominantly black and white, with a contrasting mustard yellow section at the bottom of the hoodie. This style is perfect for layering under dresses or skirts during cooler weather.

总结

通过本教程，我们完成了从数据准备、格式构建、模型微调到效果验证的全过程实践，成功将通用视觉语言模型 Qwen2.5-VL-3B-Instruct 适配到了电商领域。微调使模型具备了识别商品类别并生成电商风格描述的能力，显著提升了文本生成的针对性和可用性。该流程也适用于医疗、工业质检、教育评估等垂直场景，有助于推动多模态 AI 的行业落地。

环境准备​

实操步骤：微调 VLM 模型以实现图像产品识别与产品描述生成任务​

步骤一：加载与划分数据集​

步骤二：构造多模态对话格式输入​

步骤三：加载模型和处理器​

步骤四：配置训练参数​

步骤五：启动微调训练​

步骤六：微调效果对比与验证​

总结​