Browser STT workspace

Whisper Speech to Text

Upload audio or video, record from your microphone, or load a direct media URL. Transcribe privately in your browser and export TXT, SRT, or VTT.

Private transcription Subtitle exports Whisper models

Switch Tool TTS + STT

🎙️ Kokoro TTS 54 voices · best quality 🐱 Kitten TTS 8 voices · lightest 🔊 Piper TTS 25 voices · CPU fast 🎤 Supertonic TTS 5 languages · local 📝 Whisper STT 99 langs · captions

40-240

tiny/base/small

MB model

languages

supported

TXT/SRT

+ WebVTT

exports

WebGPU

+WASM fallback

GPU+CPU

STT works best on desktop

Speech recognition uses WebGPU/WASM. Desktop Chrome or Edge gives the most reliable result.

About Whisper STT

Whisper is OpenAI's state-of-the-art speech recognition model, now running entirely in your browser. It supports 99 languages with automatic language detection, and produces highly accurate transcriptions with word-level timestamps.

Choose from three model sizes: Tiny (~40MB, fastest), Base (~76MB, good balance), or Small (~240MB, best quality). The model automatically uses WebGPU when available, falling back to WASM for broad compatibility. Audio is processed in a background thread so the UI stays responsive.

Audio never leaves your browser — all processing happens locally. Export transcriptions as plain text, SRT subtitles, or WebVTT captions.

Try our TTS tool: Kokoro TTS (54 voices · Best quality) · Kitten TTS (8 voices · Lightest) · Piper TTS (25 voices · Fastest CPU) · Supertonic TTS (5 languages · Local)

Getting Started with Whisper STT

Transcribe audio to text directly in your browser. No API key, no signup — just upload audio and get accurate transcription with word-level timestamps.

1. Choose Model Size

Tiny (~40MB) for quick tests, Base (~76MB) for balanced speed and accuracy, Small (~240MB) for best quality. Start with Tiny to verify your setup.

2. Upload or Record Audio

Upload an audio file (WAV, MP3, WebM, etc.) or record directly in the browser. The tool decodes audio in a background worker for smooth performance.

3. Transcribe

Whisper auto-detects the spoken language and produces word-level timestamps. Streaming mode shows results in real-time as transcription progresses.

4. Export Results

Download as plain text, SRT subtitles, or WebVTT captions. SRT and VTT include word-level timestamps for video captioning.

Tips for Accurate Transcription

Use clear audio. Low background noise and clear speech produce the best results. If possible, use a good microphone and record in a quiet environment.

Choose the right model. Tiny works well for quick drafts. For production use — subtitles, meeting notes, accessibility — use Small for highest accuracy.

Let auto-detection work. Whisper detects the spoken language automatically. If you have multilingual audio, let it auto-detect rather than forcing a language.

Use WebGPU. Chrome and Edge with WebGPU support transcribe significantly faster. The tool auto-selects the best available backend.