Text-to-Speech (TTS)
The tts modality (formerly "audio generation"). Each native vendor lives
under tts/.
Basic usage
ElevenLabs:
import (
"github.com/joakimcarlsson/ai/model"
"github.com/joakimcarlsson/ai/tts"
ttselevenlabs "github.com/joakimcarlsson/ai/tts/elevenlabs"
)
client := ttselevenlabs.NewGeneration(
ttselevenlabs.WithAPIKey(os.Getenv("ELEVENLABS_API_KEY")),
ttselevenlabs.WithModel(model.ElevenLabsAudioModels[model.ElevenTurboV2_5]),
ttselevenlabs.WithVoiceID("EXAVITQu4vr4xnSDxMaL"), // Rachel
)
resp, err := client.GenerateAudio(ctx, "Hello, how are you today?")
os.WriteFile("output.mp3", resp.AudioData, 0644)
OpenAI:
import ttsopenai "github.com/joakimcarlsson/ai/tts/openai"
client := ttsopenai.NewGeneration(
ttsopenai.WithAPIKey(os.Getenv("OPENAI_API_KEY")),
ttsopenai.WithModel(model.OpenAIAudioModels[model.TTS1HD]),
ttsopenai.WithVoice("nova"),
ttsopenai.WithOutputFormat("mp3"),
)
Google Cloud, Azure Speech, Deepgram Aura follow the same shape.
Streaming
ElevenLabs and Deepgram stream chunked audio:
chunks, err := client.StreamAudio(ctx, "Hello world",
tts.WithOptimizeStreamingLatency(3),
)
for chunk := range chunks {
if chunk.Error != nil {
log.Fatal(chunk.Error)
}
if chunk.Done {
break
}
output.Write(chunk.Data)
}
The other vendors (tts/openai, tts/google, tts/azure) buffer the
non-streaming response into a single chunk for API parity.
Voice listing
voices, err := client.ListVoices(ctx)
for _, v := range voices {
fmt.Printf("%s — %s (%s)\n", v.VoiceID, v.Name, v.Category)
}
ElevenLabs voice settings
resp, err := client.GenerateAudio(ctx, "Expressive line",
tts.WithStability(0.75),
tts.WithSimilarityBoost(0.85),
tts.WithStyle(0.5),
tts.WithSpeakerBoost(true),
)
ElevenLabs alignment
tts/elevenlabs.Client also implements tts.ForcedAlignmentProvider. The
canonical alignment-enabled call:
resp, err := client.GenerateAudio(ctx, "Hello world",
tts.WithAlignmentEnabled(true),
)
for i, ch := range resp.Alignment.Characters {
fmt.Printf("%s: %.2fs - %.2fs\n",
ch,
resp.Alignment.CharacterStartTimesSeconds[i],
resp.Alignment.CharacterEndTimesSeconds[i],
)
}
For aligning existing audio against a transcript:
if fap, ok := client.(tts.ForcedAlignmentProvider); ok {
audio, _ := os.ReadFile("recording.mp3")
align, err := fap.GenerateForcedAlignment(ctx, audio,
"the spoken transcript")
for _, w := range align.Words {
fmt.Printf("%s: %.2fs - %.2fs (loss=%.4f)\n",
w.Text, w.Start, w.End, w.Loss)
}
}
The type assertion succeeds against the wrapper returned from
ttselevenlabs.NewGeneration because the wrapper preserves the optional
sub-interface when the inner concrete client implements it.