Speech-to-Text (STT)
The stt modality. Each native vendor lives under stt/.
Basic transcription
OpenAI Whisper:
import (
"github.com/joakimcarlsson/ai/model"
"github.com/joakimcarlsson/ai/stt"
sttopenai "github.com/joakimcarlsson/ai/stt/openai"
)
client := sttopenai.NewSpeechToText(
sttopenai.WithAPIKey(os.Getenv("OPENAI_API_KEY")),
sttopenai.WithModel(model.OpenAITranscriptionModels[model.Whisper1]),
)
audio, _ := os.ReadFile("audio.mp3")
resp, err := client.Transcribe(ctx, audio,
stt.WithLanguage("en"),
)
fmt.Println(resp.Text)
Deepgram, AssemblyAI, Google Cloud Speech, ElevenLabs Scribe follow the same shape via their respective vendor packages.
Translation (OpenAI only)
Streaming transcription
Deepgram, AssemblyAI, and ElevenLabs support real-time streaming over
WebSocket. The stt.SpeechToText interface exposes StreamTranscribe:
import sttdeepgram "github.com/joakimcarlsson/ai/stt/deepgram"
client := sttdeepgram.NewSpeechToText(
sttdeepgram.WithAPIKey(os.Getenv("DEEPGRAM_API_KEY")),
sttdeepgram.WithModel(model.DeepgramTranscriptionModels[model.DeepgramNova3]),
sttdeepgram.WithSmartFormat(true),
sttdeepgram.WithStreamInterimResults(true),
)
audioCh := make(chan []byte, 16)
go feedAudio(audioCh) // your PCM frame source
results, err := client.StreamTranscribe(ctx, audioCh,
stt.WithSampleRate(16000),
stt.WithChannels(1),
)
if err != nil {
log.Fatal(err)
}
for r := range results {
if r.Error != nil {
log.Fatal(r.Error)
}
if r.IsFinal {
fmt.Println("FINAL:", r.Text)
} else {
fmt.Println("interim:", r.Text)
}
}
SupportsStreaming
For vendors that don't stream:
if !client.SupportsStreaming() {
// fall back to batch Transcribe
resp, _ := client.Transcribe(ctx, audio)
}
stt/openai and stt/google return false. The streaming-capable vendors
(stt/deepgram, stt/assemblyai, stt/elevenlabs) return true.
Per-call options
stt.WithLanguage("en")
stt.WithPrompt("Domain-specific words: Claude, Anthropic, ...")
stt.WithResponseFormat("verbose_json") // OpenAI
stt.WithTimestampGranularities("word", "segment")
stt.WithFilename("audio.wav") // for format detection
stt.WithSampleRate(16000) // streaming only
stt.WithChannels(1) // streaming only
Vendor-specific options
Deepgram:
sttdeepgram.WithPunctuate(true)
sttdeepgram.WithDiarize(true)
sttdeepgram.WithSmartFormat(true)
sttdeepgram.WithKeyterms("Claude", "Anthropic")
sttdeepgram.WithStreamEndpointingMs(300)
AssemblyAI:
sttassemblyai.WithSpeakerLabels(true)
sttassemblyai.WithStreamEndOfTurnSilenceMs(700)
sttassemblyai.WithStreamFormatTurns(true)
ElevenLabs Scribe: