名称: azure-ai-voicelive-py
描述: 使用 Azure AI Voice Live SDK (azure-ai-voicelive) 构建实时语音 AI 应用。适用于需要与 Azure AI 进行实时双向音频通信的 Python 应用场景,例如:语音助手、语音聊天机器人、实时语音翻译、语音驱动虚拟形象,或任何基于 WebSocket 与 AI 模型进行音频流式传输的应用。支持服务器端语音活动检测 (VAD)、轮次对话、函数调用、MCP 工具、虚拟形象集成和转录。
package: azure-ai-voicelive
通过双向 WebSocket 通信构建实时语音 AI 应用。
pip install azure-ai-voicelive aiohttp azure-identity
AZURE_COGNITIVE_SERVICES_ENDPOINT=https://<region>.api.cognitive.microsoft.com
# 用于 API 密钥认证(生产环境不推荐)
AZURE_COGNITIVE_SERVICES_KEY=<api-key>
DefaultAzureCredential(推荐):
from azure.ai.voicelive.aio import connect
from azure.identity.aio import DefaultAzureCredential
async with connect(
endpoint=os.environ["AZURE_COGNITIVE_SERVICES_ENDPOINT"],
credential=DefaultAzureCredential(),
model="gpt-4o-realtime-preview",
credential_scopes=["https://cognitiveservices.azure.com/.default"]
) as conn:
...
API 密钥:
from azure.ai.voicelive.aio import connect
from azure.core.credentials import AzureKeyCredential
async with connect(
endpoint=os.environ["AZURE_COGNITIVE_SERVICES_ENDPOINT"],
credential=AzureKeyCredential(os.environ["AZURE_COGNITIVE_SERVICES_KEY"]),
model="gpt-4o-realtime-preview"
) as conn:
...
import asyncio
import os
from azure.ai.voicelive.aio import connect
from azure.identity.aio import DefaultAzureCredential
async def main():
async with connect(
endpoint=os.environ["AZURE_COGNITIVE_SERVICES_ENDPOINT"],
credential=DefaultAzureCredential(),
model="gpt-4o-realtime-preview",
credential_scopes=["https://cognitiveservices.azure.com/.default"]
) as conn:
# 更新会话配置
await conn.session.update(session={
"instructions": "你是一个乐于助人的助手。",
"modalities": ["text", "audio"],
"voice": "alloy"
})
# 监听事件
async for event in conn:
print(f"事件类型: {event.type}")
if event.type == "response.audio_transcript.done":
print(f"转录文本: {event.transcript}")
elif event.type == "response.done":
break
asyncio.run(main())
VoiceLiveConnection 提供以下资源:
| 资源 | 用途 | 关键方法 |
|---|---|---|
conn.session |
会话配置 | update(session=...) |
conn.response |
模型响应 | create(), cancel() |
conn.input_audio_buffer |
音频输入 | append(), commit(), clear() |
conn.output_audio_buffer |
音频输出 | clear() |
conn.conversation |
对话状态 | item.create(), item.delete(), item.truncate() |
conn.transcription_session |
转录配置 | update(session=...) |
from azure.ai.voicelive.models import RequestSession, FunctionTool
await conn.session.update(session=RequestSession(
instructions="你是一个语音助手。",
modalities=["text", "audio"],
voice="alloy", # 或 "echo", "shimmer", "sage" 等
input_audio_format="pcm16",
output_audio_format="pcm16",
turn_detection={
"type": "server_vad",
"threshold": 0.5,
"prefix_padding_ms": 300,
"silence_duration_ms": 500
},
tools=[
FunctionTool(
type="function",
name="get_weather",
description="获取当前天气",
parameters={
"type": "object",
"properties": {
"location": {"type": "string"}
},
"required": ["location"]
}
)
]
))
import base64
# 读取音频片段(16 位 PCM,24kHz 单声道)
audio_chunk = await read_audio_from_microphone()
b64_audio = base64.b64encode(audio_chunk).decode()
await conn.input_audio_buffer.append(audio=b64_audio)
async for event in conn:
if event.type == "response.audio.delta":
audio_bytes = base64.b64decode(event.delta)
await play_audio(audio_bytes)
elif event.type == "response.audio.done":
print("音频播放完成")
async for event in conn:
match event.type:
# 会话事件
case "session.created":
print(f"会话创建: {event.session}")
case "session.updated":
print("会话已更新")
# 音频输入事件
case "input_audio_buffer.speech_started":
print(f"语音开始于 {event.audio_start_ms}ms")
case "input_audio_buffer.speech_stopped":
print(f"语音结束于 {event.audio_end_ms}ms")
# 转录事件
case "conversation.item.input_audio_transcription.completed":
print(f"用户说: {event.transcript}")
case "conversation.item.input_audio_transcription.delta":
print(f"部分转录: {event.delta}")
# 响应事件
case "response.created":
print(f"响应开始: {event.response.id}")
case "response.audio_transcript.delta":
print(event.delta, end="", flush=True)
case "response.audio.delta":
audio = base64.b64decode(event.delta)
case "response.done":
print(f"响应完成: {event.response.status}")
# 函数调用
case "response.function_call_arguments.done":
result = handle_function(event.name, event.arguments)
await conn.conversation.item.create(item={
"type": "function_call_output",
"call_id": event.call_id,
"output": json.dumps(result)
})
await conn.response.create()
# 错误
case "error":
print(f"错误: {event.error.message}")
await conn.session.update(session={"turn_detection": None})
# 手动控制轮次
await conn.input_audio_buffer.append(audio=b64_audio)
await conn.input_audio_buffer.commit() # 用户轮次结束
await conn.response.create() # 触发响应
async for event in conn:
if event.type == "input_audio_buffer.speech_started":
# 用户打断 - 取消当前响应
await conn.response.cancel()
await conn.output_audio_buffer.clear()
# 添加系统消息
await conn.conversation.item.create(item={
"type": "message",
"role": "system",
"content": [{"type": "input_text", "text": "请简洁回答。"}]
})
# 添加用户消息
await conn.conversation.item.create(item={
"type": "message",
"role": "user",
"content": [{"type": "input_text", "text": "你好!"}]
})
await conn.response.create()
| 语音 | 描述 |
|---|---|
alloy |
中性、平衡 |
echo |
温暖、对话式 |
shimmer |
清晰、专业 |
sage |
冷静、权威 |
coral |
友好、活泼 |
ash |
深沉、稳重 |
ballad |
富有表现力 |
verse |
适合讲故事 |
Azure 语音:使用 AzureStandardVoice、AzureCustomVoice 或 AzurePersonalVoice 模型。
| 格式 | 采样率 | 使用场景 |
|---|---|---|
pcm16 |
24kHz | 默认,高质量 |
pcm16-8000hz |
8kHz | 电话通信 |
pcm16-16000hz |
16kHz | 语音助手 |
g711_ulaw |
8kHz | 电话通信(美国) |
g711_alaw |
8kHz | 电话通信(欧洲) |
# 服务器端 VAD(默认)
{"type": "server_vad", "threshold": 0.5, "silence_duration_ms": 500}
# Azure 语义 VAD(更智能的检测)
{"type": "azure_semantic_vad"}
{"type": "azure_semantic_vad_en"} # 英语优化
{"type": "azure_semantic_vad_multilingual"}
from azure.ai.voicelive.aio import ConnectionError, ConnectionClosed
try:
async with connect(...) as conn:
async for event in conn:
if event.type == "error":
print(f"API 错误: {event.error.code} - {event.error.message}")
except ConnectionClosed as e:
print(f"连接关闭: {e.code} - {e.reason}")
except ConnectionError as e:
print(f"连接错误: {e}")