STT works best on desktop

Speech recognition requires WebGPU/WASM and works best on desktop browsers. For the best experience, please use a desktop or laptop computer.

40-240
tiny/base/small
MB model
99
languages
supported
39M-244M
Whisper
params
WebGPU
+WASM fallback
GPU+CPU

About Whisper STT

Whisper is OpenAI's state-of-the-art speech recognition model, now running entirely in your browser. It supports 99 languages with automatic language detection, and produces highly accurate transcriptions with word-level timestamps.

Choose from three model sizes: Tiny (~40MB, fastest), Base (~76MB, good balance), or Small (~240MB, best quality). The model automatically uses WebGPU when available, falling back to WASM for broad compatibility. Audio is processed in a background thread so the UI stays responsive.

Audio never leaves your browser โ€” all processing happens locally. Export transcriptions as plain text, SRT subtitles, or WebVTT captions.

Try our TTS tool: Kokoro TTS (54 voices ยท Best quality) ยท Kitten TTS (8 voices ยท Lightest) ยท Piper TTS (25 voices ยท Fastest CPU)

Getting Started with Whisper STT

Transcribe audio to text directly in your browser. No API key, no signup โ€” just upload audio and get accurate transcription with word-level timestamps.

1. Choose Model Size

Tiny (~40MB) for quick tests, Base (~76MB) for balanced speed and accuracy, Small (~240MB) for best quality. Start with Tiny to verify your setup.

2. Upload or Record Audio

Upload an audio file (WAV, MP3, WebM, etc.) or record directly in the browser. The tool decodes audio in a background worker for smooth performance.

3. Transcribe

Whisper auto-detects the spoken language and produces word-level timestamps. Streaming mode shows results in real-time as transcription progresses.

4. Export Results

Download as plain text, SRT subtitles, or WebVTT captions. SRT and VTT include word-level timestamps for video captioning.

Tips for Accurate Transcription

1.

Use clear audio. Low background noise and clear speech produce the best results. If possible, use a good microphone and record in a quiet environment.

2.

Choose the right model. Tiny works well for quick drafts. For production use โ€” subtitles, meeting notes, accessibility โ€” use Small for highest accuracy.

3.

Let auto-detection work. Whisper detects the spoken language automatically. If you have multilingual audio, let it auto-detect rather than forcing a language.

4.

Use WebGPU. Chrome and Edge with WebGPU support transcribe significantly faster. The tool auto-selects the best available backend.