使用 Qwen3 与 vLLM 开发支持 Function Call 的推理服务

2025年7月8日

一、背景介绍

Function Call 是大语言模型（LLM）的一种重要功能，允许模型根据用户输入动态调用外部工具函数，实现与外部系统交互。本教程介绍如何使用 Qwen3 模型和 vLLM 框架搭建支持 Function Call 功能的推理服务。

二、Qwen3 Function Call 原理介绍

Qwen3 通过特殊的提示（prompt）机制和工具定义（tools schema）实现 Function Call 功能。模型推理时，会根据用户输入的内容和预定义的工具 schema 动态识别出所需调用的工具函数及其参数。

具体实现流程如下： function call flow chart

关键步骤详解：第一次 Qwen3 模型推理

Qwen3 模型首次推理的核心任务是解析自然语言输入并确定是否需要调用工具函数。如果需要调用工具函数，模型将以结构化的 JSON 格式输出调用信息，具体如下：

例如，当用户发送请求：

请帮我读取一下文件 /data/example.txt 的内容。

服务端会构建包含工具 schema 的 Prompt，示例如下：

{
  "role": "user",
  "content": "请帮我读取一下文件 /data/example.txt 的内容。",
  "tools": [
    {
      "name": "filesystem_read_file",
      "description": "读取指定路径的文件内容。",
      "parameters": {
        "type": "object",
        "properties": {
          "file_path": {"type": "string", "description": "文件路径"},
          "max_size": {"type": "integer", "default": 102400}
        },
        "required": ["file_path"]
      }
    }
  ]
}

经过推理后，Qwen3 会识别用户意图，并生成如下结构化的工具调用指令：

{
  "tool_calls": [
    {
      "name": "filesystem_read_file",
      "arguments": {
        "file_path": "/data/example.txt"
      }
    }
  ]
}

服务端接收到此指令后，会解析指令并调用对应的外部工具函数。

Qwen3 如何实现这一推理过程？

Prompt 引导：Prompt 明确定义了工具函数及其参数，模型借助此结构化提示进行推理。
结构化输出：Qwen3 在预训练和微调阶段学习到了如何识别并生成结构化的调用指令。
语义解析能力：模型具备强大的自然语言理解能力，能够精准提取并映射用户请求到对应的工具调用。

三、环境准备

使用的核心组件：

Qwen3 8B 模型
vLLM 推理框架
FastAPI 服务框架

安装依赖

pip install fastapi uvicorn transformers vllm pydantic

模型准备

从 Hugging Face 或本地目录加载 Qwen3 模型：

from transformers import AutoTokenizer
from vllm import LLM

model_path = "Qwen/Qwen3-8B"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
tokenizer.pad_token_id = tokenizer.eos_token_id
llm = LLM(model=model_path, dtype="float16", trust_remote_code=True)

四、定义工具函数（Tools）

工具函数是模型可以调用的外部函数，例如文件系统操作。这里以文件系统工具为例。

工具函数实现

工具函数位于 filesystem_agent.py 文件中，例如读取文件内容的函数：

import os

def read_file_content(file_path: str, max_size: int = 100 * 1024) -> dict:
    if not os.path.isfile(file_path):
        return {"error": f"文件不存在: {file_path}"}
    size = os.path.getsize(file_path)
    if size > max_size:
        return {"error": f"文件过大（{size}字节），最大支持{max_size}字节。"}
    with open(file_path, "r", encoding="utf-8") as f:
        content = f.read()
    return {"content": content, "size": size}

工具 Schema 定义

在 tool_schemas.py 中定义工具函数的 schema，这些 schema 会传递给模型以指导工具调用：

tool_schemas = [
    {
        "name": "filesystem_read_file",
        "description": "读取指定路径的文件内容。",
        "parameters": {
            "type": "object",
            "properties": {
                "file_path": {"type": "string", "description": "文件路径"},
                "max_size": {"type": "integer", "default": 102400}
            },
            "required": ["file_path"]
        }
    },
    # 更多工具...
]

tool_map = {
    "filesystem_read_file": read_file_content,
    # 其它工具函数 mapping
}

五、搭建推理服务

使用 FastAPI 搭建推理服务，在 inference_server_vllm.py 中实现。

服务初始化

from fastapi import FastAPI
from pydantic import BaseModel
from utils.tool_call_utils import extract_tool_calls, clean_output_strict
from utils.formatters import format_function_response

app = FastAPI()

接口实现

接口处理用户查询、模型推理、工具调用及结果返回：

class QueryInput(BaseModel):
    user_query: str

@app.post("/chat")
async def chat(input_data: QueryInput):
    messages = [{
        "role": "system",
        "content": "你是一个文件系统助手。"
    }, {
        "role": "user",
        "content": input_data.user_query
    }]

    prompt = tokenizer.apply_chat_template(
        messages,
        tools=tool_schemas, # 带工具 schema 的 提示词
        add_generation_prompt=True,
        tokenize=False,
        enable_thinking=False,
        output_tool_calls=True
    )

    outputs = llm.generate(prompt, SamplingParams(temperature=0.0, max_tokens=512, stop=["<|im_end|>"]))
    generated_text = outputs[0].outputs[0].text
    tool_calls = extract_tool_calls(generated_text)

    if not tool_calls:
        return {"response": clean_output_strict(generated_text)}

    tool_messages = []
    for call in tool_calls:
        func = tool_map.get(call["name"])
        response = func(**call.get("arguments", {}))
        tool_messages.append({"role": "tool", "name": call["name"], "content": format_function_response(call["name"], response)})

    messages += [{"role": "assistant", "tool_calls": tool_calls}] + tool_messages

    followup_prompt = tokenizer.apply_chat_template(
        messages,
        tools=[],
        add_generation_prompt=True,
        tokenize=False
    )

    final_outputs = llm.generate(followup_prompt, SamplingParams(temperature=0.0, max_tokens=512, stop=["<|im_end|>"]))
    final_text = final_outputs[0].outputs[0].text

    return {"response": clean_output_strict(final_text)}

完整代码见 GitHub 或 Gitee。

六、启动服务

使用以下命令启动服务：

uvicorn inference_server_vllm:app --host 0.0.0.0 --port 8000

七、测试服务

发送请求进行测试：

curl -X POST "http://localhost:8000/chat" -d '{"user_query": "读取文件 /data/example.txt"}' -H "Content-Type: application/json"

八、总结

通过上述步骤，你可以快速搭建一个功能丰富的 Function Call 推理服务，使你的大语言模型能够更智能地与外部系统交互，显著提升应用场景的多样性和用户体验。

一、背景介绍​

二、Qwen3 Function Call 原理介绍​

关键步骤详解：第一次 Qwen3 模型推理​

三、环境准备​

安装依赖​

模型准备​

四、定义工具函数（Tools）​

工具函数实现​

工具 Schema 定义​

五、搭建推理服务​

服务初始化​

接口实现​

六、启动服务​

七、测试服务​

八、总结​