Building SpeechSDK: a tiny, provider-agnostic TTS library for TypeScript

I recently built SpeechSDK, a free, open-source toolkit for building AI audio applications with any text-to-speech model (github).

By changing just a model string, developers can easily switch between all major providers without breaking their application code.

Imagine Vercel's AI SDK but dedicated to audio.

import { generateSpeech } from '@speech-sdk/core';

const result = await generateSpeech({
  model: 'elevenlabs/eleven_v3',
  text: '[laugh] Oh that is so funny!',
  voice: 'EXAVITQu4vr4xnSDxMaL',
});

result.audio.uint8Array;  // Uint8Array
result.audio.base64;      // string (lazy)
result.audio.mediaType;   // "audio/mpeg"

Inspiration

Over the last 3+ years, Jellypod has generated tens of thousands of audio hours across multiple TTS models. We learned the hard way that each provider's API is slightly different, supports varying functionality, and lacks standardization, making model changes pretty difficult. ElevenLabs expects requests one way, Gemini responds in another, etc.

Recently, Mistral released Voxtral TTS, a 4B parameter text-to-speech model that I really wanted to test out. However, even if I liked the model, it wouldn't be an easy switch in production. I also didn't know a good tool that compares audio models side by side, which is why I built Audio Playground.

Although there were some existing libraries that could help out, they were either undermaintained, experimental (like the Vercel AI SDK's generateSpeech()), or deprioritized.

So I Built SpeechSDK

I needed SpeechSDK to be lightweight and accept any supported text-to-speech model string, a voice ID, and some input text. Under the hood, the SDK would create a standard SpeechProvider interface, which each unique provider would implement. The abstraction exposed a standard, provider-agnostic API for consumers, while each provider would implement their own unique behaviors.

The SDK had to be production-ready with proper error handling and retries, include minimal dependencies (if any), and work everywhere, whether that was in NodeJS or the browser.

For devs, a client can just call generateSpeech() and everything just works. And if you need any provider-specific options, those could be added to a providerOptions key as an untyped passthrough.

The SDK also includes lazy Base64 encoding (so it doesn't create a large Base64 version of the audio unless actually needed), no Node built-ins (so it works in the browser), and support for standardized model tags.

This last one's pretty cool because some models support adding words (i.e. audio tags) in brackets or SSML that impact the delivery of the generated speech. Some models like ElevenLabs V3 or Fish S2-pro support use brackets, others like OpenAI gpt-4o-mini-tts use a separate instructions input or SSML XML tags. The SpeechSDK standardizes all these to just use brackets (example: [whispers] Hello there.) and strips out tags for models that don't support this feature, like Mistral's Voxtral TTS.

The SDK's opinionated defaults improves the developer experience.

For example, model strings auto-resolve their API keys from the .env without manually setting it. Although you can actually create the client directly (giving engineers the ability to set the baseUrl or API key), if a developer just passes openai/tts-1-hd to generateSpeech(), the SDK automatically uses OPENAI_API_KEY and initialize the provider under the hood.

It just works.

Up Next

I don't think text-to-speech has been given enough love recently except maybe in AI call centers. However, I think that with the increasing demand of generative media and creative workflows, being able to easily switch between multiple providers becomes increasingly important.

Next time you're thinking about building with speech, I hope SpeechSDK becomes your go-to library. Any and all feedback welcome.