Whisper comes in five sizes. Bigger isn't always better—the right choice depends on your hardware, audio quality, and accuracy needs.
Model Sizes at a Glance
| Model | Size | Speed | Accuracy |
|---|---|---|---|
| Tiny | 75MB | ~32x | Basic |
| Base | 142MB | ~16x | Good |
| Small | 466MB | ~6x | Better |
| Medium | 1.5GB | ~2x | Great |
| Large | 2.9GB | ~1x | Best |
When to Use Each
Tiny & Base
Quick drafts, real-time-ish transcription, or when accuracy isn't critical. Good for clear audio with single speakers in quiet environments.
Small
Best balance for most users. Handles moderate noise and accents well. Fast enough for comfortable use, accurate enough for most needs.
Medium
Challenging audio: accents, background noise, multiple speakers. Worth the extra time when accuracy matters and audio quality is imperfect.
Large
Maximum accuracy for difficult audio. Professional transcription, legal/medical content, or when every word must be correct.
Hardware Considerations
- Apple Silicon: All models run well, Large is practical
- Intel Mac: Stick to Small or Medium
- RAM: Large needs 4GB+ free
The .en Models
English-only variants (tiny.en, base.en, etc.) are slightly more accurate for English and slightly faster. Use these if you only transcribe English.
Choose Your Model
Sotto lets you switch models based on your needs. All sizes included. $29 once.
Get Sotto