Instant voice cloning, on a MacBook Air, for free, no Elevenlabs
Chinese e‑commerce giant Alibaba has released new Qwen models for generating and cloning voices earlier this year. Which means: With only a couple of seconds of recorded material, we can generate a cloned voice recording. On a four year old MacBook Air. Instantly. At no cost.
This used to be the domain of Elevenlabs. The company has built security features into their platform in order to make stealing voices without consent harder. Now it’s just this one simple Terminal command.
If you want to try it for yourself, you’ll need a Mac, a voice recording, and a transcript. I’ve successfully used 25 seconds of historical audio and 10 seconds of clean studio audio.
uvx --from mlx-audio --prerelease=allow mlx_audio.tts.generate --model mlx-community/Qwen3-TTS-12Hz-1.7B-Base-4bit --ref_audio example.wav --lang_code English --ref_text "$(cat example.txt)" --text 'You cannot escape the future, but you can unsubscribe anytime.'
Okay, scary, what is happening here?
- “uvx” tells the system to download our tool into a temporary Python environment, run it, and then clean up afterward.
- Tool? We are looking for an audio package named “mlx-audio”. It uses a framework from Apple to run on Apple’s chips, on Apple Silicon, with hardware acceleration: “–from mlx-audio”. And we allow experimental versions of the software with “–prerelease=allow”.
- Next, we reach out to Hugging Face to download the actual model. The model is Qwen3-TTS with 1.7 billion parameters. It’s about 1.5 GB of data. We are using the quantized version which makes it easier to run on Macs with less RAM (like an 8GB or 16GB Air) without losing much quality – if you have more RAM, change the model to “Qwen3-TTS-12Hz-1.7B-Base-bf16”. The download increases to 4.5 GB.
- We are doing Zero-Shot Voice Cloning, so we supply the reference audio with “–ref_audio example.wav”.
- Languages supported: Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian. Just change the “–lang_code”.
- We have to tell the model what is being said in the example, “–ref_text “$(cat example.txt)”.
- What do we want to hear? We put it in “–text ‘this'”. Or we leave this part out, then this becomes an interactive terminal. The tool will wait for our input, generate audio, wait for more input, generate more audio, and so on.
Because we are using “uvx”, we clone a voice, generate a file, and clean up everything. If we want to generate more, we are starting from scratch. Only the downloaded models are saved in a central cache.
The result is stunningly good, but it won’t fool everyone. Which comes as a relief, honestly. But given what is currently happening, I’ll give it only a couple more months until we’re all fooled twice before we can even say “hi”.
Thanks to AI-tinkerer Jan Eggers who presented Qwen at the AI in Media event in Hamburg. Since then, he has chained two Qwen models together to automate the transcription part. You only need the example.wav, no example.txt necessary. The command looks like this:
uvx --from mlx-audio --prerelease=allow mlx_audio.stt.generate --model mlx-community/Qwen3-ASR-1.7B-5bit --audio "example.wav" --format txt --language en --output-path test --verbose && rm audio_000.wav || true && uvx --from mlx-audio --prerelease=allow mlx_audio.tts.generate --model mlx-community/Qwen3-TTS-12Hz-1.7B-Base-bf16 --ref_audio "example.wav" --lang_code English --ref_text "$(cat test.txt)" --text 'My voice is my passport. Verify me.'