Google has released new text-to-speech models. It’s the same tech powering NotebookLM’s conversational podcasts. The models are called Gemini 2.5 Flash Preview TTS and Gemini 2.5 Pro Preview TTS.
You can try both in Google AI Studio. Or use the API.
First, you choose between single-speaker and multi-speaker audio. Then, choose one or two of the 30 available voices. If you want to have two speakers, you need to signal “Speaker 1” and “Speaker 2” in the input. You can give the model instructions like this:
"Make Speaker1 sound tired and bored, and Speaker2 sound excited and happy."
There is a limit to what Gemini will take in one go. The documentation says 8000 tokens (or 32.000 for a TTS session). That didn’t work for me for some ~reason~, without warning, my input was cut off after 5418 characters, that is 890 words or 1233 tokens.
But using the API, you can write a script to chunk the input and stitch the audio files back together. Which I did using Python and Google’s Colab.
The result? A complete TTS version of my latest newsletter issue—no editing, no tweaking, nothing. Just run and play. I spent less than 2 cents for this. Google charges $10 for 1 million tokens of output.