For years, "speech-to-text" meant uploading your voice to a server. Not anymore. Apple Silicon Macs are fast enough to run state-of-the-art models locally — which changes the math on privacy, cost, and reliability. Here's how local speech-to-text actually works.
What "local" really means
Local (or "on-device") speech-to-text runs the recognition model on your own machine. Your microphone audio is processed by the Neural Engine and turned into text without ever leaving the Mac. No upload, no server copy, no per-minute meter running.
The models that make it possible
- Whisper: OpenAI's open model, accurate across 90+ languages.
- Parakeet: NVIDIA's model, often faster and very accurate on English.
Both run comfortably on Apple Silicon. The reason it's so quick is covered in whisper.cpp on Apple Silicon, and we compare the two in Whisper vs Parakeet.
Why on-device beats the cloud
| Local | Cloud | |
|---|---|---|
| Privacy | Audio stays on device | Uploaded |
| Cost | No per-minute fee | Metered / subscription |
| Offline | Works anywhere | Needs internet |
| Longevity | Can't be shut off | Service can change |
The full trade-off is in local Whisper vs cloud transcription and the benefits of offline transcription.
Setting it up
The easiest path to local speech-to-text on a Mac is an app that bundles the models and a hotkey workflow. Sotto runs Whisper and Parakeet on-device, types into any app when you press a hotkey, and saves every recording so you can re-transcribe with a better model later. It's $49 one-time, no subscription. For a step-by-step, see the voice dictation setup guide.
Bottom line
Local speech-to-text is no longer a compromise — it's usually the better option. You get privacy, no recurring cost, and reliability that doesn't depend on someone's servers staying online. If you want it set up in minutes, try Sotto.