1. 多模态核心能力与特点

1.1. 跨模态理解与关联

这是最核心的能力。模型不仅能理解每种模态内部的信息，更能理解不同模态信息之间的关联。

理解一张图片描绘的内容，并用文字描述出来（看图说话）。
根据一段文字描述，生成符合描述的图像（文生图）。
理解视频中发生了什么，并回答相关问题（视频问答）。
分析医学影像（图像模态）并结合病历报告（文本模态）做出诊断建议。
理解语音指令（音频模态）并操控智能家居设备（可能需要关联传感器模态）

1.2. 跨模态生成

模型可以基于一种模态的信息，生成另一种模态的内容。

文生图：输入文字描述，生成图片。
图生文：输入图片，生成描述、故事或回答问题。
文生视频：输入文字描述，生成短视频。
语音合成：输入文字，生成逼真的语音（文生音频）。
音乐生成：根据描述或情绪生成音乐。

1.3. 信息互补与增强

不同模态的信息可以相互补充，提供更全面、更准确的理解。例如，一段视频配上文字解说，理解起来比单独看视频或单独看文字更清晰。多模态模型能自动利用这种互补性。

1.4. 更接近人类感知世界的方式

人类天生就是多模态的。我们通过眼睛看（图像/视频）、耳朵听（音频）、嘴巴说和阅读（文本）等多种方式来感知和理解世界。多模态大模型的目标就是模拟这种更自然、更全面的感知和理解方式。

2. 图片内容识别

阿里百炼“全模态 | 通义千问-Omni-Turbo” 模型，兼容 OpenAI 接口调用方式。

2.1. 配置

spring:
  ai:
    openai:
      chat:
        options:
          model: qwen-omni-turbo # 模型名称
          temperature: 0.7 # 温度值

在 /resources 资源目录下，新建一个 /images 文件夹，并添加一张图片。

2.2. 调用

@RestController
@RequestMapping("/v9/ai")
public class MultimodalityController {

    @Resource
    private OpenAiChatModel chatModel;

    /**
     * 流式对话
     * @param message
     * @return
     */
    @GetMapping(value = "/generateStream", produces = "text/html;charset=utf-8")
    public Flux<String> generateStream(@RequestParam(value = "message") String message) {
        // 1. 创建媒体资源
        Media image = new Media(
            MimeTypeUtils.IMAGE_PNG,
            new ClassPathResource("/images/multimodal-test.png")
        );

        // 2. 附加选项（可选），如温度值等等
        Map<String, Object> metadata = new HashMap<>();
        metadata.put("temperature", 0.7);

        // 3. 构建多模态消息
        UserMessage userMessage = UserMessage.builder()
            .text(message)
            .media(image)
            .metadata(metadata)
            .build();

        // 4. 构建提示词
        Prompt prompt = new Prompt(List.of(userMessage));
        // 5. 流式调用
        return chatModel.stream(prompt)
            .mapNotNull(chatResponse -> {
                Generation generation = chatResponse.getResult();
                return generation.getOutput().getText();
            });
    }
}

3. 文生图

“通义万相2.1-文生图” 大模型未兼容 OpenAI 调用方式，所以无法通过之前的方式，通过 Spring AI 来调用，需要额外添加阿里百炼的 SDK。

3.1. 依赖

<!-- 阿里百炼 SDK -->
<dependency>
    <groupId>com.alibaba</groupId>
    <artifactId>dashscope-sdk-java</artifactId>
    <version>2.20.6</version>
    <exclusions>
        <exclusion>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-simple</artifactId>
        </exclusion>
    </exclusions>
</dependency>

上述依赖中，排除了 slf4j-simple 包，不然启动项目时，提示 “项目中同时存在多个 SLF4J 的实现”

3.2. 调用

import com.alibaba.dashscope.aigc.imagesynthesis.ImageSynthesis;
import com.alibaba.dashscope.aigc.imagesynthesis.ImageSynthesisParam;
import com.alibaba.dashscope.aigc.imagesynthesis.ImageSynthesisResult;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.utils.JsonUtils;
import lombok.extern.slf4j.Slf4j;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RequestParam;
import org.springframework.web.bind.annotation.RestController;

@RestController
@RequestMapping("/v10/ai")
@Slf4j
public class Text2ImgController {

    @Value("${spring.ai.openai.api-key}")
    private String apiKey;

    /**
     * 调用阿里百炼图生文大模型
     * @param prompt 提示词
     * @return
     */
    @GetMapping("/text2img")
    public String text2Image(@RequestParam(value = "prompt") String prompt) {
        // 构建文生图参数
        ImageSynthesisParam param = ImageSynthesisParam.builder()
                        .apiKey(apiKey) // 阿里百炼 API Key
                        .model("wanx2.1-t2i-plus") // 模型名称
                        .prompt(prompt) // 提示词
                        .n(1) // 生成图片的数量，这里指定为一张
                        .size("1024*1024") // 输出图像的分辨率
                        .build();

        // 同步调用 AI 大模型，生成图片
        ImageSynthesis imageSynthesis = new ImageSynthesis();
        ImageSynthesisResult result = null;
        try {
            log.info("## 同步调用，请稍等一会...");
            result = imageSynthesis.call(param);
        } catch (ApiException | NoApiKeyException e){
            log.error("", e);
        }

	    // 返回生成的结果（包含图片的 URL 链接）
        return JsonUtils.toJson(result);
    }

}

4. 文生音频

文生音频 AI 大模型（Text-to-Audio / Text-to-Speech Large Language Models）是一种基于海量文本和音频数据训练而成的、参数规模巨大的深度学习模型。它的核心功能是：根据用户输入的文本描述或提示，生成与之匹配的、高质量的音频内容。

模型广场cosyvoice-V2

4.1. 挑选音色

https://help.aliyun.com/zh/model-studio/text-to-speech#3a8c7759a4yyx

4.2. 调用

import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisParam;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;
import lombok.extern.slf4j.Slf4j;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RequestParam;
import org.springframework.web.bind.annotation.RestController;

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.nio.ByteBuffer;

@RestController
@RequestMapping("/v11/ai")
@Slf4j
public class Text2AudioController {

    @Value("${spring.ai.openai.api-key}")
    private String apiKey;

    /**
     * 调用阿里百炼-语音合成大模型
     * @param prompt
     * @return
     */
    @GetMapping("/text2audio")
    public String text2audio(@RequestParam(value = "prompt") String prompt) {

        // 构建语音合成相关参数
        SpeechSynthesisParam param = SpeechSynthesisParam.builder()
                        .apiKey(apiKey) // 阿里百炼 API Key
                        .model("cosyvoice-v2") // 模型名称
                        .voice("longanran") // 音色
                        .build();

        // 同步调用语音合成大模型，并获字节流
        SpeechSynthesizer synthesizer = new SpeechSynthesizer(param, null);
        ByteBuffer audio = synthesizer.call(prompt);

        // 音频文件存储路径
        String path = "E:\\result-audio.mp3";
        File file = new File(path);
        
        log.info("## requestId: {}", synthesizer.getLastRequestId());
        
        // 存储字节流
        try (FileOutputStream fos = new FileOutputStream(file)) {
            fos.write(audio.array());
        } catch (IOException e) {
            log.error("", e);
        }

        return "success";
    }

}

5. 文生视频

文生视频（Text-to-Video）是指利用人工智能技术，根据用户输入的文本描述（Prompt），自动生成一段符合描述的视频内容的技术。它代表了生成式人工智能（Generative AI）在视频创作领域的重大突破。

模型广场通义万相2.1-图生视频-Plus” 模型

5.1. 准备一张静态图片

放到某个路径下

5.2. 调用

import com.alibaba.dashscope.aigc.videosynthesis.VideoSynthesis;
import com.alibaba.dashscope.aigc.videosynthesis.VideoSynthesisParam;
import com.alibaba.dashscope.aigc.videosynthesis.VideoSynthesisResult;
import com.alibaba.dashscope.utils.JsonUtils;
import lombok.extern.slf4j.Slf4j;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RequestParam;
import org.springframework.web.bind.annotation.RestController;

import java.util.HashMap;
import java.util.Map;

@RestController
@RequestMapping("/v12/ai")
@Slf4j
public class Text2VideoController {

    @Value("${spring.ai.openai.api-key}")
    private String apiKey;

    /**
     * 调用阿里百炼图生视频大模型
     * @param prompt
     * @return
     */
    @GetMapping("/text2video")
    public String text2video(@RequestParam(value = "prompt") String prompt) {
        // 设置视频处理参数，如指定输出视频的分辨率、视频时长等。
        Map<String, Object> parameters = new HashMap<>();
        // 是否开启 prompt 智能改写。开启后使用大模型对输入 prompt 进行智能改写。对于较短的 prompt 生成效果提升明显，但会增加耗时。
        parameters.put("prompt_extend", true);

        // 静态图片路径，将它转换为动态视频
        String imgUrl = "file:///" + "E:/xiaojiejie.png"; // Windows 系统

        // 构建调用大模型所需参数
        VideoSynthesisParam param =
            VideoSynthesisParam.builder()
                .apiKey(apiKey) // API Key
                .model("wanx2.1-i2v-plus") // 模型名称
                .prompt(prompt) // 提示词
                .imgUrl(imgUrl) // 静态图片路径
                .parameters(parameters) // 视频处理参数
                .build();

        log.info("## 正在生成中, 请稍等...");

        // 调用 AI 大模型生成视频
        VideoSynthesis vs = new VideoSynthesis();
        VideoSynthesisResult result = null;
        try {
            result = vs.call(param);
        } catch (Exception e) {
            log.error("", e);
        }

        // 返参
        return JsonUtils.toJson(result);
    }
}

目录CONTENT

多模态AI大模型