SottoSotto
Back to blog
AImodelswhispertechnicaltranscription

Understanding AI Models for Speech Transcription

Learn how AI transcription models work. Compare Whisper variants, understand model sizes, and choose the right model for your needs.

K
November 26, 20257 min read

Not all transcription AI is equal. Understanding model architectures helps you choose the right tool and settings for your use case.

How Whisper Works

Whisper is a transformer-based model trained on 680,000 hours of multilingual audio. It converts audio into text tokens using attention mechanisms that understand context.

Model Sizes Explained

ModelParametersRAMSpeed
Tiny39M~1GBFastest
Base74M~1GBFast
Small244M~2GBModerate
Medium769M~5GBSlower
Large1.5B~10GBSlowest

Choosing the Right Model

  • Real-time dictation: Tiny or Base for speed
  • Transcribing recordings: Medium or Large for accuracy
  • Non-English languages: Medium+ for best results

Multiple Models Built-in

Sotto includes multiple Whisper models—switch based on your needs. $29 one-time.

Get Sotto
K

About Kitze

Creator of Sotto and indie developer building tools for productivity. Passionate about local AI and privacy-first software.

Follow on Twitter