STT works best on desktop
Speech recognition requires WebGPU/WASM and works best on desktop browsers. For the best experience, please use a desktop or laptop computer.
About Whisper STT
Whisper is OpenAI's state-of-the-art speech recognition model, now running entirely in your browser. It supports 99 languages with automatic language detection, and produces highly accurate transcriptions with word-level timestamps.
Choose from three model sizes: Tiny (~40MB, fastest), Base (~76MB, good balance), or Small (~240MB, best quality). The model automatically uses WebGPU when available, falling back to WASM for broad compatibility. Audio is processed in a background thread so the UI stays responsive.
Audio never leaves your browser โ all processing happens locally. Export transcriptions as plain text, SRT subtitles, or WebVTT captions.
Getting Started with Whisper STT
Transcribe audio to text directly in your browser. No API key, no signup โ just upload audio and get accurate transcription with word-level timestamps.
1. Choose Model Size
Tiny (~40MB) for quick tests, Base (~76MB) for balanced speed and accuracy, Small (~240MB) for best quality. Start with Tiny to verify your setup.
2. Upload or Record Audio
Upload an audio file (WAV, MP3, WebM, etc.) or record directly in the browser. The tool decodes audio in a background worker for smooth performance.
3. Transcribe
Whisper auto-detects the spoken language and produces word-level timestamps. Streaming mode shows results in real-time as transcription progresses.
4. Export Results
Download as plain text, SRT subtitles, or WebVTT captions. SRT and VTT include word-level timestamps for video captioning.
Tips for Accurate Transcription
Use clear audio. Low background noise and clear speech produce the best results. If possible, use a good microphone and record in a quiet environment.
Choose the right model. Tiny works well for quick drafts. For production use โ subtitles, meeting notes, accessibility โ use Small for highest accuracy.
Let auto-detection work. Whisper detects the spoken language automatically. If you have multilingual audio, let it auto-detect rather than forcing a language.
Use WebGPU. Chrome and Edge with WebGPU support transcribe significantly faster. The tool auto-selects the best available backend.