1. 引言:语音合成的轻量化革命

在人工智能技术快速迭代的背景下,文本转语音(TTS)技术已成为人机交互、内容创作等领域的核心基础设施。然而,传统TTS模型普遍存在体积庞大、计算资源需求高、多语言支持有限等问题,尤其在移动端和边缘设备上难以高效部署。

Kokoro-TTS 的出现打破了这一僵局。这款由hexgrad开发的超轻量级模型仅有8200万参数,基于StyleTTS 2和ISTFTNet的混合架构,采用纯解码器设计,大幅降低了计算复杂度。本文将带领大家在CentOS 7.9环境下通过Docker部署Kokoro服务,并使用C#进行对接调用

2. Kokoro技术概览

2.1 核心特性

Kokoro-TTS之所以能在轻量化与高性能之间取得平衡,得益于其创新的技术架构:

  • 超轻量级设计:模型参数量压缩至传统模型的1/10以下,体积仅约320MB(全精度)

  • 多语言支持:原生支持美式英语、英式英语、西班牙语、法语、意大利语、葡萄牙语,通过自定义音素化方案可支持中文、日语、印地语

  • 多种语音风格:提供超过10种预设语音包,涵盖不同性别和语音特征,甚至支持低语等特殊风格

  • 实时处理能力:优化后的架构支持极低延迟的语音合成,满足实时交互需求

2.2 技术架构

Kokoro的核心架构采用“编码器-风格控制器-解码器”三段式设计:

  • 编码器:基于Transformer的文本特征提取模块,支持字符嵌入与上下文感知

  • 风格控制器:通过条件变分自编码器生成风格嵌入向量,动态调整语速、音高、情感强度

  • 解码器:轻量级WaveRNN结构,结合稀疏注意力机制高效生成音频

架构示意图

┌──────────────────────────┐
│        Client端                                             │
│  (Web / API / C# 调用)                              │
└─────────────┬────────────┘
                                    │ HTTP JSON
                                   ▼
┌──────────────────────────┐
│     FastAPI 服务层                                     │
│  (kokoro-fastapi-cpu)                              │
└─────────────┬────────────┘
                                    │
                                   ▼
┌──────────────────────────┐
│     文本预处理模块                                     │
│  - 分词                                                        │
│  - 语言检测                                                 │
│  - Voice参数解析                                       │
└─────────────┬────────────┘
                                    │
                                   ▼
┌──────────────────────────┐
│   Kokoro Acoustic Model                        │
│  (文本 → Mel频谱)                                     │
└─────────────┬────────────┘
                                    │ Mel Spectrogram
                                   ▼
┌──────────────────────────┐
│        Vocoder                                             │
│   (HiFi-GAN / 类似模型)                            │
│  Mel → PCM波形                                      │
└─────────────┬────────────┘
                                    │
                                   ▼
┌──────────────────────────┐
│   音频编码模块                                          │
│  PCM → WAV / MP3                                 │
└─────────────┬────────────┘
                                    │
                                   ▼
┌──────────────────────────┐
│     HTTP Response                                  │
│  (audio/wav 或 raw pcm)                         │
└──────────────────────────┘

图:Kokoro-TTS 典型架构流程图

3. CentOS 7.9 Docker环境部署

3.1 环境准备

首先确保CentOS 7.9服务器已安装Docker。

3.2 Docker镜像构建

我安装的CPU版本的docker

docker run -d --restart=always -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-cpu

github仓库地址
https://github.com/remsky/Kokoro-FastAPI

(具体资料可以查看github仓库)

构建完成之后的URL测试地址及系统截图

3.3 web页面

http://localhost:8880/web/

3.4 API接口

http://192.168.0.101:8880/docs#/

4. 基于api接口的C#实现音频播放

这一步耗费了我不少时间,在windows和linux都进行了测试

4.1 基于net8实现对接

1. 调用文本转语音获取音频文件

2. 实现音频播放,分别实现了windows和linux下的音频播放,核心代码如下

using Alsa.Net;
using audioplay.Options;
using NAudio.CoreAudioApi;
using NAudio.Wave;
using NAudio.Wave.Compression;
using System.IO;
using Serilog;

namespace audioplay.Util
{
    public class AudioPlayer
    {
        /// <summary>
        /// alsa播放wav格式音频, linux下使用
        /// </summary>
        /// <param name="wavBytes"></param>
        public static void AlsaWavPlay(byte[] wavBytes)
        {
            using var alsaDevice = AlsaDeviceBuilder.Create(new SoundDeviceSettings());
            Serilog.Log.Information("当前音量:{0}", alsaDevice.PlaybackVolume);
            using var inputStream = new MemoryStream(wavBytes);
            alsaDevice.Play(inputStream);
        }

        /// <summary>
        /// naudio播放pcm格式音频,windows下使用
        /// </summary>
        /// <param name="audioBytes"></param>
        /// <param name="stoppingToken"></param>
        /// <returns></returns>
        public static async Task PlayPCMAudioAsync(byte[] audioBytes, CancellationToken stoppingToken, SysConfOption sysConfOption)
        {
            try
            {
                if (audioBytes.Length == 0)
                {
                    Serilog.Log.Error("Empty audio data");
                    return;
                }

                Serilog.Log.Information("Playing PCM format audio");

                // Create PCM wave format (assuming 16-bit, 24kHz, mono)
                var waveFormat = new WaveFormat(24000, 16, 1);

                using (var memoryStream = new System.IO.MemoryStream(audioBytes))
                using (var waveStream = new RawSourceWaveStream(memoryStream, waveFormat))
                using (var waveOut = new WaveOutEvent())
                {
                    Serilog.Log.Information("Audio format: {0}, {1} channels, {2} Hz",
                        waveStream.WaveFormat.Encoding,
                        waveStream.WaveFormat.Channels,
                        waveStream.WaveFormat.SampleRate);

                    waveOut.Init(waveStream);
                    waveOut.Play();
                    while (waveOut.PlaybackState == PlaybackState.Playing && !stoppingToken.IsCancellationRequested)
                    {
                        await Task.Delay(100, stoppingToken);
                    }
                }
            }
            catch (Exception ex)
            {
                Serilog.Log.Error(ex, "Error playing audio");

                // Try to save the data to a file for inspection
                if (!Directory.Exists(sysConfOption.Path))
                    Directory.CreateDirectory(sysConfOption.Path);
                var filePath = Path.Combine(sysConfOption.Path, $"audio_error_{DateTime.Now:yyyyMMddHHmmss}.pcm");
                await System.IO.File.WriteAllBytesAsync(filePath, audioBytes, stoppingToken);
                Serilog.Log.Information("Saved problematic audio data to {0} for inspection", filePath);
            }
        }

        /// <summary>
        /// naudio播放wav格式音频,windows下使用
        /// </summary>
        /// <param name="wavBytes"></param>
        /// <param name="stoppingToken"></param>
        /// <param name="sysConfOption"></param>
        /// <returns></returns>
        public static async Task PlayWavAudioAsync(byte[] wavBytes, CancellationToken stoppingToken, SysConfOption sysConfOption)
        {
            try
            {
                if (wavBytes.Length == 0)
                {
                    Serilog.Log.Error("Empty audio data");
                    return;
                }

                // Create a copy of the wavBytes to avoid modifying the original
                byte[] wavData = new byte[wavBytes.Length];
                Array.Copy(wavBytes, wavData, wavBytes.Length);

                // Check and fix WAV length field if it's FF FF FF FF
                if (wavData.Length >= 8)
                {
                    // Read the current length field (bytes 4-7)
                    var currentLength = BitConverter.ToInt32(wavData, 4);
                    if (currentLength == -1) // 0xFFFFFFFF in two's complement
                    {
                        Serilog.Log.Information("Detected FF FF FF FF length field, calculating actual length");
                        // Calculate actual length (total bytes - 8 for RIFF header)
                        int actualLength = wavData.Length - 8;
                        // Ensure the length is within the valid range for 32-bit integer
                        if (actualLength < 0 || actualLength > int.MaxValue)
                        {
                            Serilog.Log.Error("Invalid WAV length: {0}", actualLength);
                            actualLength = Math.Max(0, Math.Min(int.MaxValue, wavData.Length - 8));
                            Serilog.Log.Information("Adjusted WAV length to: {0} bytes", actualLength);
                        }
                        // Convert to little-endian bytes and update the length field
                        byte[] lengthBytes = BitConverter.GetBytes(actualLength);
                        Array.Copy(lengthBytes, 0, wavData, 4, 4);
                        Serilog.Log.Information("Updated WAV length field to: {0} bytes", actualLength);

                        // Also check and fix data chunk length if present
                        if (wavData.Length >= 44)
                        {
                            // Find data chunk
                            int dataChunkOffset = 12;
                            while (dataChunkOffset < wavData.Length - 4)
                            {
                                if (wavData[dataChunkOffset] == 0x64 && wavData[dataChunkOffset + 1] == 0x61 &&
                                    wavData[dataChunkOffset + 2] == 0x74 && wavData[dataChunkOffset + 3] == 0x61) // data
                                {
                                    // Calculate actual data size
                                    int dataStart = dataChunkOffset + 8;
                                    int actualDataSize = wavData.Length - dataStart;
                                    if (actualDataSize < 0) actualDataSize = 0;
                                    if (actualDataSize > int.MaxValue) actualDataSize = int.MaxValue;

                                    // Update data chunk length
                                    byte[] dataLengthBytes = BitConverter.GetBytes(actualDataSize);
                                    Array.Copy(dataLengthBytes, 0, wavData, dataChunkOffset + 4, 4);
                                    Serilog.Log.Information("Updated data chunk length to: {0} bytes", actualDataSize);
                                    break;
                                }
                                try
                                {
                                    int chunkSize = BitConverter.ToInt32(wavData, dataChunkOffset + 4);
                                    dataChunkOffset += 8 + chunkSize;
                                }
                                catch
                                {
                                    break;
                                }
                            }
                        }
                    }
                }

                // Try to parse WAV header to get format information
                WaveFormat waveFormat = null;
                byte[] pcmData = null;

                try
                {
                    // Basic WAV header parsing
                    if (wavData.Length >= 44) // Minimum WAV header size
                    {
                        // Check if it's a WAV file
                        if (wavData[0] == 0x52 && wavData[1] == 0x49 && wavData[2] == 0x46 && wavData[3] == 0x46) // RIFF
                        {
                            // Find format chunk
                            int formatChunkOffset = 12;
                            while (formatChunkOffset < wavData.Length - 4)
                            {
                                if (wavData[formatChunkOffset] == 0x66 && wavData[formatChunkOffset + 1] == 0x6d &&
                                    wavData[formatChunkOffset + 2] == 0x74 && wavData[formatChunkOffset + 3] == 0x20) // fmt
                                {
                                    // Read format information
                                    short audioFormat = BitConverter.ToInt16(wavData, formatChunkOffset + 8);
                                    short numChannels = BitConverter.ToInt16(wavData, formatChunkOffset + 10);
                                    int sampleRate = BitConverter.ToInt32(wavData, formatChunkOffset + 12);
                                    int byteRate = BitConverter.ToInt32(wavData, formatChunkOffset + 16);
                                    short blockAlign = BitConverter.ToInt16(wavData, formatChunkOffset + 20);
                                    short bitsPerSample = BitConverter.ToInt16(wavData, formatChunkOffset + 22);

                                    // Create wave format
                                    waveFormat = new WaveFormat(sampleRate, bitsPerSample, numChannels);
                                    Serilog.Log.Information("Parsed WAV format: {0} channels, {1} Hz, {2} bits",
                                        numChannels, sampleRate, bitsPerSample);

                                    // Find data chunk
                                    int dataChunkOffset = formatChunkOffset + 8 + BitConverter.ToInt32(wavData, formatChunkOffset + 4);
                                    while (dataChunkOffset < wavData.Length - 4)
                                    {
                                        if (wavData[dataChunkOffset] == 0x64 && wavData[dataChunkOffset + 1] == 0x61 &&
                                            wavData[dataChunkOffset + 2] == 0x74 && wavData[dataChunkOffset + 3] == 0x61) // data
                                        {
                                            int dataSize = BitConverter.ToInt32(wavData, dataChunkOffset + 4);
                                            int dataStart = dataChunkOffset + 8;
                                            if (dataStart + dataSize <= wavData.Length)
                                            {
                                                // Extract PCM data
                                                pcmData = new byte[dataSize];
                                                Array.Copy(wavData, dataStart, pcmData, 0, dataSize);
                                                Serilog.Log.Information("Extracted PCM data, size: {0} bytes", dataSize);
                                            }
                                            else
                                            {
                                                // If data size is invalid, use all remaining data
                                                int actualDataSize = wavData.Length - dataStart;
                                                pcmData = new byte[actualDataSize];
                                                Array.Copy(wavData, dataStart, pcmData, 0, actualDataSize);
                                                Serilog.Log.Information("Data size invalid, using all remaining data: {0} bytes", actualDataSize);
                                            }
                                            break;
                                        }
                                        try
                                        {
                                            int chunkSize = BitConverter.ToInt32(wavData, dataChunkOffset + 4);
                                            dataChunkOffset += 8 + chunkSize;
                                        }
                                        catch
                                        {
                                            break;
                                        }
                                    }
                                    break;
                                }
                                try
                                {
                                    int chunkSize = BitConverter.ToInt32(wavData, formatChunkOffset + 4);
                                    formatChunkOffset += 8 + chunkSize;
                                }
                                catch
                                {
                                    break;
                                }
                            }
                        }
                    }

                    // If we couldn't parse the WAV header, use default format
                    if (waveFormat == null || pcmData == null)
                    {
                        Serilog.Log.Warning("Could not parse WAV header, using default PCM format");
                        waveFormat = new WaveFormat(24000, 16, 1); // Default format: 24kHz, 16-bit, mono
                        pcmData = wavData;
                    }
                }
                catch (Exception ex)
                {
                    Serilog.Log.Error(ex, "Error parsing WAV header");
                    // Use default format if parsing fails
                    waveFormat = new WaveFormat(24000, 16, 1);
                    pcmData = wavData;
                }

                // Play PCM data directly using RawSourceWaveStream
                using var memoryStream = new MemoryStream(pcmData);
                using var waveStream = new RawSourceWaveStream(memoryStream, waveFormat);
                using var output = new WaveOutEvent();

                Serilog.Log.Information("Playing PCM data, format: {0}, {1} channels, {2} Hz",
                    waveStream.WaveFormat.Encoding,
                    waveStream.WaveFormat.Channels,
                    waveStream.WaveFormat.SampleRate);

                output.Init(waveStream);
                output.Play();

                while (output.PlaybackState == PlaybackState.Playing && !stoppingToken.IsCancellationRequested)
                {
                    await Task.Delay(100, stoppingToken);
                }
            }
            catch (Exception ex)
            {
                Serilog.Log.Error(ex, "Error playing wav audio");

                // Try to save the data to a file for inspection
                if (!Directory.Exists(sysConfOption.Path))
                    Directory.CreateDirectory(sysConfOption.Path);
                var filePath = Path.Combine(sysConfOption.Path, $"audio_error_{DateTime.Now:yyyyMMddHHmmss}.wav");
                await System.IO.File.WriteAllBytesAsync(filePath, wavBytes, stoppingToken);
                Serilog.Log.Information("Saved problematic audio data to {0} for inspection", filePath);
            }
        }
    }
}

4.2 docker部署net8

docker文件如下

# 基于 ASP.NET 8.0 运行时镜像
FROM mcr.microsoft.com/dotnet/aspnet:8.0 AS base
WORKDIR /app

# 安装音频相关依赖(支持语音播报)
RUN apt-get update && apt-get install -y --no-install-recommends \
    libasound2 \
    libasound2-dev \
    alsa-lib \
    alsa-utils \
    && rm -rf /var/lib/apt/lists/*

# 暴露端口 5005
EXPOSE 5005

# 设置环境变量以指定监听端口
ENV ASPNETCORE_URLS=http://+:5005

# 复制已发布的应用程序文件(VS 2022 publish后的文件)
COPY . .

# 给可执行文件添加执行权限
RUN chmod +x ./audioplay

# 创建存储文件的目录
RUN mkdir -p /home/audioplay/wwwroot/mp3

# 确保目录权限
RUN chmod -R 755 /home/audioplay

# 设置入口点
ENTRYPOINT ["./audioplay"]

# 注意:运行容器时需要挂载音频设备以支持语音播报
docker run -d --device /dev/snd:/dev/snd -p 5005:5005 -v /home/audioplay:/home/audioplay --add-host=host.docker.internal:host-gateway audioplay

5. 注意

整个过程踩了不少坑,这里重点讲一下

5.1 参数设置

{"input":"第一关,一号靶一枪,6.5秒完成。","voice":"zm_yunyang","response_format":"wav","download_format":"wav","stream":true,"speed":1,"return_download_link":true,"lang_code":"z"}

其中音频文件类型,最好选择式WAV格式,MP3在linux播放的时候,遇到了不少坑,我用的是C#语言,没有调用aplay之类的组件

WAV 格式:1 channels, 24000 Hz, 16 bits,用ffmpeg查一下可以看到这些信息

5.2 返回的音频流

wav文件头中缺少实际的长度,因为是流返回的问题,需要流播放或者自动补全长度,代码中我以实现,可以参考。

Audio bytes header: "52-49-46-46-FF-FF-FF-FF-57-41"

这是实际得到的wav文件的头,FF FF FF FF长度未指定。

5.3 音量控制异常

在windows下naudio音频控制没有问题,但是在linux alsa播放的时候,一旦设置音量过低,在设置高了之后,无法恢复,不明白什么鬼,目前直接100%音量,不同的机器应该不一样的效果

amixer -D hw:0 set Master 100%

直接设置linux机器音频输出为100%

查询机器音频,如下所示,可以看到音量是0-87,不同机器应该是不一样的

[root@localhost ~]# amixer -D hw:0 get Master
Simple mixer control 'Master',0
  Capabilities: pvolume pvolume-joined pswitch pswitch-joined
  Playback channels: Mono
  Limits: Playback 0 - 87
  Mono: Playback 87 [100%] [0.00dB] [on]

实际效果很不错,必须赞一个。

更多推荐