OA0

OA0 是一个探索 AI 的社区

现在注册

已注册用户请登录

名称： azure-ai-voicelive-py
描述： 使用 Azure AI Voice Live SDK (azure-ai-voicelive) 构建实时语音 AI 应用。适用于需要与 Azure AI 进行实时双向音频通信的 Python 应用场景，例如：语音助手、语音聊天机器人、实时语音翻译、语音驱动虚拟形象，或任何基于 WebSocket 与 AI 模型进行音频流式传输的应用。支持服务器端语音活动检测 (VAD)、轮次对话、函数调用、MCP 工具、虚拟形象集成和转录。
package: azure-ai-voicelive

Azure AI Voice Live SDK

通过双向 WebSocket 通信构建实时语音 AI 应用。

安装

pip install azure-ai-voicelive aiohttp azure-identity

环境变量

AZURE_COGNITIVE_SERVICES_ENDPOINT=https://<region>.api.cognitive.microsoft.com
# 用于 API 密钥认证（生产环境不推荐）
AZURE_COGNITIVE_SERVICES_KEY=<api-key>

身份认证

DefaultAzureCredential（推荐）:

from azure.ai.voicelive.aio import connect
from azure.identity.aio import DefaultAzureCredential

async with connect(
    endpoint=os.environ["AZURE_COGNITIVE_SERVICES_ENDPOINT"],
    credential=DefaultAzureCredential(),
    model="gpt-4o-realtime-preview",
    credential_scopes=["https://cognitiveservices.azure.com/.default"]
) as conn:
    ...

API 密钥:

from azure.ai.voicelive.aio import connect
from azure.core.credentials import AzureKeyCredential

async with connect(
    endpoint=os.environ["AZURE_COGNITIVE_SERVICES_ENDPOINT"],
    credential=AzureKeyCredential(os.environ["AZURE_COGNITIVE_SERVICES_KEY"]),
    model="gpt-4o-realtime-preview"
) as conn:
    ...

快速开始

import asyncio
import os
from azure.ai.voicelive.aio import connect
from azure.identity.aio import DefaultAzureCredential

async def main():
    async with connect(
        endpoint=os.environ["AZURE_COGNITIVE_SERVICES_ENDPOINT"],
        credential=DefaultAzureCredential(),
        model="gpt-4o-realtime-preview",
        credential_scopes=["https://cognitiveservices.azure.com/.default"]
    ) as conn:
        # 更新会话配置
        await conn.session.update(session={
            "instructions": "你是一个乐于助人的助手。",
            "modalities": ["text", "audio"],
            "voice": "alloy"
        })

        # 监听事件
        async for event in conn:
            print(f"事件类型: {event.type}")
            if event.type == "response.audio_transcript.done":
                print(f"转录文本: {event.transcript}")
            elif event.type == "response.done":
                break

asyncio.run(main())

核心架构

连接资源

VoiceLiveConnection 提供以下资源：

资源	用途	关键方法
`conn.session`	会话配置	`update(session=...)`
`conn.response`	模型响应	`create()`, `cancel()`
`conn.input_audio_buffer`	音频输入	`append()`, `commit()`, `clear()`
`conn.output_audio_buffer`	音频输出	`clear()`
`conn.conversation`	对话状态	`item.create()`, `item.delete()`, `item.truncate()`
`conn.transcription_session`	转录配置	`update(session=...)`

会话配置

from azure.ai.voicelive.models import RequestSession, FunctionTool

await conn.session.update(session=RequestSession(
    instructions="你是一个语音助手。",
    modalities=["text", "audio"],
    voice="alloy",  # 或 "echo", "shimmer", "sage" 等
    input_audio_format="pcm16",
    output_audio_format="pcm16",
    turn_detection={
        "type": "server_vad",
        "threshold": 0.5,
        "prefix_padding_ms": 300,
        "silence_duration_ms": 500
    },
    tools=[
        FunctionTool(
            type="function",
            name="get_weather",
            description="获取当前天气",
            parameters={
                "type": "object",
                "properties": {
                    "location": {"type": "string"}
                },
                "required": ["location"]
            }
        )
    ]
))

音频流式传输

发送音频 (Base64 PCM16)

import base64

# 读取音频片段（16 位 PCM，24kHz 单声道）
audio_chunk = await read_audio_from_microphone()
b64_audio = base64.b64encode(audio_chunk).decode()

await conn.input_audio_buffer.append(audio=b64_audio)

接收音频

async for event in conn:
    if event.type == "response.audio.delta":
        audio_bytes = base64.b64decode(event.delta)
        await play_audio(audio_bytes)
    elif event.type == "response.audio.done":
        print("音频播放完成")

事件处理

async for event in conn:
    match event.type:
        # 会话事件
        case "session.created":
            print(f"会话创建: {event.session}")
        case "session.updated":
            print("会话已更新")

        # 音频输入事件
        case "input_audio_buffer.speech_started":
            print(f"语音开始于 {event.audio_start_ms}ms")
        case "input_audio_buffer.speech_stopped":
            print(f"语音结束于 {event.audio_end_ms}ms")

        # 转录事件
        case "conversation.item.input_audio_transcription.completed":
            print(f"用户说: {event.transcript}")
        case "conversation.item.input_audio_transcription.delta":
            print(f"部分转录: {event.delta}")

        # 响应事件
        case "response.created":
            print(f"响应开始: {event.response.id}")
        case "response.audio_transcript.delta":
            print(event.delta, end="", flush=True)
        case "response.audio.delta":
            audio = base64.b64decode(event.delta)
        case "response.done":
            print(f"响应完成: {event.response.status}")

        # 函数调用
        case "response.function_call_arguments.done":
            result = handle_function(event.name, event.arguments)
            await conn.conversation.item.create(item={
                "type": "function_call_output",
                "call_id": event.call_id,
                "output": json.dumps(result)
            })
            await conn.response.create()

        # 错误
        case "error":
            print(f"错误: {event.error.message}")

常用模式

手动轮次模式（无 VAD）

await conn.session.update(session={"turn_detection": None})

# 手动控制轮次
await conn.input_audio_buffer.append(audio=b64_audio)
await conn.input_audio_buffer.commit()  # 用户轮次结束
await conn.response.create()  # 触发响应

中断处理

async for event in conn:
    if event.type == "input_audio_buffer.speech_started":
        # 用户打断 - 取消当前响应
        await conn.response.cancel()
        await conn.output_audio_buffer.clear()

对话历史

# 添加系统消息
await conn.conversation.item.create(item={
    "type": "message",
    "role": "system",
    "content": [{"type": "input_text", "text": "请简洁回答。"}]
})

# 添加用户消息
await conn.conversation.item.create(item={
    "type": "message",
    "role": "user", 
    "content": [{"type": "input_text", "text": "你好！"}]
})

await conn.response.create()

语音选项

语音	描述
`alloy`	中性、平衡
`echo`	温暖、对话式
`shimmer`	清晰、专业
`sage`	冷静、权威
`coral`	友好、活泼
`ash`	深沉、稳重
`ballad`	富有表现力
`verse`	适合讲故事

Azure 语音：使用 AzureStandardVoice、AzureCustomVoice 或 AzurePersonalVoice 模型。

音频格式

格式	采样率	使用场景
`pcm16`	24kHz	默认，高质量
`pcm16-8000hz`	8kHz	电话通信
`pcm16-16000hz`	16kHz	语音助手
`g711_ulaw`	8kHz	电话通信（美国）
`g711_alaw`	8kHz	电话通信（欧洲）

轮次检测选项

# 服务器端 VAD（默认）
{"type": "server_vad", "threshold": 0.5, "silence_duration_ms": 500}

# Azure 语义 VAD（更智能的检测）
{"type": "azure_semantic_vad"}
{"type": "azure_semantic_vad_en"}  # 英语优化
{"type": "azure_semantic_vad_multilingual"}

错误处理

from azure.ai.voicelive.aio import ConnectionError, ConnectionClosed

try:
    async with connect(...) as conn:
        async for event in conn:
            if event.type == "error":
                print(f"API 错误: {event.error.code} - {event.error.message}")
except ConnectionClosed as e:
    print(f"连接关闭: {e.code} - {e.reason}")
except ConnectionError as e:
    print(f"连接错误: {e}")

参考文档

详细 API 参考: 参见 references/api-reference.md
完整示例: 参见 references/examples.md
所有模型和类型: 参见 references/models.md

技能包地址：https://github.com/openclaw/skills/tree/main/skills/thegovind/azure-ai-voicelive-py/SKILL.md

52 次点击 ∙ 0 人收藏

登录后收藏

0 条回复

azure-ai-voicelive-py：构建实时语音 AI 应用程序