Complete Guide to Running Qwen3-TTS on Mac Mini M4 | From Zero to Hero in 5 Steps

Complete Guide to Running Qwen3-TTS on Mac Mini M4 | From Zero to Hero in 5 Steps

Qwen3-TTS is Alibaba's latest text-to-speech model, but it's designed for NVIDIA GPUs by default. This guide walks you through running it successfully on a Mac Mini M4—from installing system dependencies and setting up Python, to adapting the code for Apple GPU (MPS). Perfect for Mac users, AI beginners, and developers who want to try out TTS models.

Complete Guide to Running Qwen3-TTS on Mac Mini M4

Who is this for? Anyone with a Mac Mini M4 who can open Terminal. No AI background needed—just follow along!

📝 This guide is based on the author’s actual testing on a Mac Mini M4. All steps have been verified to work.


📋 Before We Begin

Qwen3-TTS is Alibaba’s latest text-to-speech model, and it performs impressively well. However, it’s designed for NVIDIA GPUs by default, so running it on a Mac requires a few small adjustments.

The good news: not many changes needed, and I’ve already worked through all the pitfalls for you 🎉

The entire process takes 5 steps, approximately 15–30 minutes (most of which is spent waiting for the model to download).




Step 1: Install System Dependencies

macOS is missing some audio processing tools. Let’s install them using Homebrew.

Open Terminal and paste the following command:

brew install portaudio ffmpeg sox

⚠️ If you skip this step, you’ll get an error /bin/sh: sox: command not found later. You can come back then, but it’s better to get it done now.


Step 2: Set Up a Python Environment

Pick a directory for your project, then use Conda to create a clean Python 3.12 environment to avoid conflicts with other projects on your system.

# Create and activate the environment (only need to do this once)
conda create -n qwen3-tts python=3.12 -y
conda activate qwen3-tts

# Install core libraries
pip install -U qwen-tts

# Clone the official repository
git clone https://github.com/QwenLM/Qwen3-TTS.git
cd Qwen3-TTS
pip install -e .

💡 What’s a Conda environment? Think of it as a “separate room” where all this project’s dependencies live. They won’t interfere with other programs on your computer.


Step 3: Adapt the Code for M4 Chip ⭐ (Crucial!)

Up to this point, everything matches the GitHub instructions. But if you’re using a Mac with an M-series chip, things get a bit different from here.

This is where Mac users最容易 encounter issues. The official script assumes NVIDIA GPUs, so we need to modify two places to use Apple’s GPU (MPS) instead.

Open the file examples/test_model_12hz_base.py, find around line 50, and make these two changes:

Change A: Specify MPS as the device

# ❌ Original code (designed for NVIDIA)
# tts = Qwen3TTSModel.from_pretrained(..., attn_implementation="flash_attention_2")

# ✅ Change to this (adapted for Mac M4)
tts = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-Base",   # Note: removed trailing slash /
    torch_dtype=torch.bfloat16,          # M4 fully supports bfloat16, balancing precision and speed
    attn_implementation="sdpa",          # Mac-compatible attention mechanism, replaces flash_attention_2
    device_map="mps",                    # Force Apple GPU usage
)

Change B: Adapt synchronization for MPS

# ❌ Original code (NVIDIA only, will crash on Mac)
# torch.cuda.synchronize()

# ✅ Change to this (auto-detect which GPU to use)
if torch.cuda.is_available():
    torch.cuda.synchronize()
elif torch.backends.mps.is_available():
    torch.mps.synchronize()   # Correct command for Mac

🔧 Why make these changes? The M4 chip uses Apple’s own Metal framework (MPS), which is completely different from NVIDIA’s CUDA. The first change tells the model “use Apple GPU,” and the second ensures the sync wait command uses the correct Apple version.


Step 4: Download and Run the Model

The model file is approximately 4 GB, so make sure you have a stable internet connection.

cd examples
python test_model_12hz_base.py

🐢 Download is too slow? Try a mirror

export HF_ENDPOINT=https://hf-mirror.com
python test_model_12hz_base.py

If ❌ you get a SafetensorError?

This means the previous download was interrupted and the file is corrupted. The solution is simple:

  1. Open Finder and go to ~/.cache/huggingface/hub
  2. Delete the Qwen folder
  3. Re-run the script to re-download

Step 5: Verify the GPU is Working

Before running, you can quickly confirm that the M4’s GPU is recognized correctly:

import torch
print(torch.backends.mps.is_available())  # If True, you're all set ✅

🎉 Success!

If everything goes well, after running the script, a new folder will be created in the examples/ directory containing the generated audio files.


📎 Complete Reference Code

Below is the full code with all Mac adaptations, plus multi-language merged output and speed control features. You can save it as a .py file and use it directly:

import os
import torch
import soundfile as sf
import numpy as np
# Ensure 'qwen_tts' is installed/present in the environment
from qwen_tts import Qwen3TTSModel

# ================= 1. Initialization (Setup) =================

# Auto-detect the hardware.
# "mps" = Mac (Apple Silicon), "cuda" = NVIDIA GPU, "cpu" = Standard Processor
if torch.backends.mps.is_available():
    device = "mps"   # Mac M1/M2/M3/M4...
elif torch.cuda.is_available():
    device = "cuda"  # NVIDIA GPU
else:
    device = "cpu"   # Standard CPU

print(f"Using device: {device}")

# Define where to save the results
OUT_DIR = "qwen3_slow_output"
os.makedirs(OUT_DIR, exist_ok=True)

print("Loading model... (This might take a minute)")

# Loading the model from Hugging Face
model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-Base",
    torch_dtype=torch.bfloat16,
    attn_implementation="sdpa",
    device_map=device,
)
print("Model loaded successfully!")

# ================= 2. Reference Audio Settings =================
# This is the voice the model will mimic (clone).

# Option A: Use a URL (Official Qwen Example)
ref_audio_url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/clone_2.wav"

# Option B: Use a local file (Uncomment the line below to use your own file)
# ref_audio_url = "./my_voice.wav"

# CRITICAL: This text MUST match what is said in the reference audio exactly.
# If this is wrong, the quality will be bad.
ref_text_content = "Okay. Yeah. I resent you. I love you. I respect you. But you know what? You blew it! And thanks to you."

# ================= 3. Content to Generate =================
# Tip: To make the speech slower and clearer, we add punctuation (like , . ...)
# This forces the model to pause between words.

segments = [
    {
        "lang": "Chinese",
        # Original: 大家好...
        # Trick: Added commas to slow it down.
        "text": "大家好,这个视频是,分享如何在Mac Mini上,部署Qwen.3-TTS,运行官方例子程序,希望你们喜欢。",
        "temp": 0.7,
    },
    {
        "lang": "English",
        # Original: This video is about...
        # Trick: Added "..." and extra commas for a relaxed pace.
        "text": "Hello everyone! In this video, I'll share how to deploy Qwen.3-TTS on a Mac Mini and run the official demos. I hope you enjoy it.",
        "temp": 0.7,
    },
    {
        "lang": "Japanese",
        # Trick: Added extra Japanese commas (、)
        "text": "皆さん、こんにちは。この動画では、Mac MiniでQwen.3-TTSを導入し、公式デモを動かす方法をシェアします。気に入っていただけると嬉しいです。",
        "temp": 0.7,
    },
    {
        "lang": "Korean",
        # Trick: Added breaks between concepts.
        "text": "안녕하세요 여러분. 이번 영상에서는 맥 미니(Mac Mini)에 Qwen.3-TTS를 구축하고, 공식 예제를 실행하는 방법을 공유해 드리겠습니다. 유익한 시간이 되시길 바랍니다.",
        "temp": 0.7,
    },
    {
        "lang": "German",
        "text": "Hallo zusammen! In diesem Video zeige ich euch, wie man Qwen.3-TTS auf einem Mac Mini deployt und die offiziellen Demos ausführt. Ich hoffe, es gefällt euch.",
        "temp": 0.6,
    },
    {
        "lang": "French",
        "text": "Bonjour à tous ! Dans cette vidéo, je vais partager comment déployer Qwen.3-TTS sur un Mac Mini et lancer les démos officielles. J'espère qu'elle vous plaira.",
        "temp": 0.8,
    }
]

# ================= 4. Generation Loop =================
all_audio_parts = []
final_sr = None # Sample rate

print("Starting audio generation...")

for i, seg in enumerate(segments):
    print(f"[{i+1}/{len(segments)}] Generating {seg['lang']} segment...")

    # Try to use the 'speed' parameter if the model supports it
    try:
        wavs, sr = model.generate_voice_clone(
            text=seg['text'],
            language=seg['lang'],
            ref_audio=ref_audio_url,
            ref_text=ref_text_content,
            temperature=seg['temp'],
            speed=0.85,  # 0.85 = 85% speed (Slower)
        )
    except TypeError:
        # If 'speed' causes an error, remove it and just use the text tricks
        print(f"  (Note: Speed parameter not supported, using standard speed for {seg['lang']})")
        wavs, sr = model.generate_voice_clone(
            text=seg['text'],
            language=seg['lang'],
            ref_audio=ref_audio_url,
            ref_text=ref_text_content,
            temperature=seg['temp'],
        )

    # Process the audio data
    audio_data = wavs[0]
    if isinstance(audio_data, torch.Tensor):
        audio_data = audio_data.cpu().numpy()

    all_audio_parts.append(audio_data)
    if final_sr is None: final_sr = sr

# ================= 5. Merging Audio =================
print("Merging all segments...")

# Create a silence gap between languages
# For better listening experience, duration is set to 0.3s (adjust as needed)
silence_duration = 0.3
silence_samples = int(silence_duration * final_sr)
silence_data = np.zeros(silence_samples, dtype=np.float32)

final_sequence = []
for part in all_audio_parts:
    final_sequence.append(part)
    final_sequence.append(silence_data) # Add silence after each part

# Remove the very last silence block
if final_sequence:
    final_sequence.pop()

full_audio = np.concatenate(final_sequence)

# ================= 6. Save Output =================
final_path = os.path.join(OUT_DIR, "Final_Slow_Mix.wav")
sf.write(final_path, full_audio, final_sr)

print("="*30)
print(f"Done! Audio saved to:\n{final_path}")
print("="*30)

🛠️ Quick Troubleshooting Guide

SymptomCauseSolution
sox: command not foundMissing system dependenciesRun brew install from Step 1
SafetensorErrorModel file download interruptedDelete ~/.cache/huggingface/hub/Qwen and retry
torch.cuda crash errorUsing NVIDIA-specific commandsCheck if Change B in Step 3 was applied
Download slow / timeoutHuggingFace access restrictedSet mirror and retry
Weird driver errorsOccasional Apple Silicon driver issuesRestart your computer - fixes 90% of odd issues