Not all transcription AI is equal. Understanding model architectures helps you choose the right tool and settings for your use case.
How Whisper Works
Whisper is a transformer-based model trained on 680,000 hours of multilingual audio. It converts audio into text tokens using attention mechanisms that understand context.
Model Sizes Explained
| Model | Parameters | RAM | Speed |
|---|---|---|---|
| Tiny | 39M | ~1GB | Fastest |
| Base | 74M | ~1GB | Fast |
| Small | 244M | ~2GB | Moderate |
| Medium | 769M | ~5GB | Slower |
| Large | 1.5B | ~10GB | Slowest |
Choosing the Right Model
- Real-time dictation: Tiny or Base for speed
- Transcribing recordings: Medium or Large for accuracy
- Non-English languages: Medium+ for best results
Multiple Models Built-in
Sotto includes multiple Whisper models—switch based on your needs. $29 one-time.
Get Sotto