Tired of seven-minute voice notes from family or colleagues? Drop the file into a free Windows app and read the transcript in seconds. Supports 96 languages, audio never uploaded.
No signup, no upload, no per-minute fee.
If you only use WhatsApp on your phone, grab the desktop client from whatsapp.com or the Microsoft Store. Open it, scan the QR code with your phone, and your full chat history syncs in. WhatsApp Desktop gives you a reliable right-click Save As menu on voice notes that WhatsApp Web in a browser often does not.
Find the voice message in your chat. Right-click on it and pick Save As. WhatsApp Desktop will offer a file name and save the audio to your Downloads folder as either .opus or .ogg. Both are standard Opus-codec files and StarWhisper handles them natively. You do not need to convert anything.
Download StarWhisper from the homepage. The installer is small and the setup walks you through a one-time model download so the app can work offline afterward. The free tier covers 500 words per day and 3,500 per week, enough for typical personal use without a Pro plan.
Open StarWhisper and drag the .ogg or .opus file from File Explorer onto the window. The app picks the language automatically and starts transcribing. A typical 30-second voice note finishes in two to five seconds on a modern CPU. With an NVIDIA GPU it is effectively instant.
The text appears in the StarWhisper window. Copy it to clipboard, paste it into a chat or doc, or save it to a .txt file. The voice note is now searchable, skimmable, quotable text. You never had to listen to the whole thing.
Specific reasons, not vague benefits.
Default Local Mode runs OpenAI Whisper on your own machine. No upload, no third-party storage, no servers seeing your family group chat.
Whether the voice note is in Spanish, Hindi, Arabic, Mandarin, Polish, or any of 96 supported languages, StarWhisper picks the language automatically.
WhatsApp's .opus and .ogg files load directly. No third-party converter, no online MP3 ripper, no pasted command-line ffmpeg invocations.
One-time model download, then full offline operation. Useful for flights, sensitive recordings, or anywhere you do not trust the network.
Covers around 5 to 10 typical voice notes a day with no signup wall, no credit card, no trial countdown. Free tier details here.
NVIDIA GPU owners get effectively instant transcription via CUDA. GPU support details.
WhatsApp voice messages have a particular problem. They are convenient for the sender, who can monologue while walking, but they are inefficient for the receiver, who has to find headphones, put them in, and listen at real-time speed to extract maybe twenty seconds of actual information. A six-minute voice note from a relative often contains one date, one question, and a lot of context. Reading the transcript in fifteen seconds is a strictly better experience.
The other reason: searchability. Once a voice note is transcribed, you can search your chat history for the words inside it. WhatsApp's own search only indexes text messages, so months of voice notes become an opaque black box. Saving transcripts to a notes app or document means your voice-note information becomes retrievable later. People who get a lot of voice notes from a particular contact (a parent, a manager, a project lead) report that converting them to text changes the relationship with the chat itself.
Cloud transcription services exist, but most charge per minute, ask you to upload sensitive personal audio to their servers, and require a signup with a credit card. The math gets bad quickly: at 10 cents per minute and ten voice notes a week averaging two minutes each, that is 8 dollars a month for what is genuinely a small task. The StarWhisper approach is a free local install that handles unlimited free-tier transcription up to the daily word cap. For most casual WhatsApp users that cap is never reached.
The fastest path is WhatsApp Desktop on the same Windows PC as StarWhisper. Once linked, every voice note in every chat is right-clickable to save. This is the recommended setup for anyone who plans to transcribe voice notes more than occasionally.
Already covered in the steps above. Right-click, Save As, drag into StarWhisper. Two clicks of friction. This works for every voice note in any chat, individual or group, as long as you have the desktop app linked.
On Android, long-press the voice note, tap the three-dot menu, choose Share, and send to your own email address as an attachment. On iPhone, long-press the voice note, tap Forward, then the share-arrow icon, then choose Mail. Open Gmail or Outlook on Windows, download the attachment, and drag the resulting file into StarWhisper. The file usually arrives as .opus on Android or .m4a on iPhone. StarWhisper handles both.
For batch transcription of months of voice notes, open the chat on your phone, go to chat settings, choose Export Chat, and pick the option to include media. WhatsApp produces a zip file with every audio attachment as .opus. Transfer the zip to your PC, extract it, and drop the folder onto StarWhisper. The app will process every voice note in sequence and label each transcript by file name. This is what people use when migrating years of family chat audio into searchable text.
StarWhisper's free plan gives you 500 words per day, capped at 3,500 words per week. A typical 60-second WhatsApp voice note transcribes to around 150 words of text. That math works out to roughly 3 to 5 voice notes per day on the free tier, or 20 to 25 per week. For most personal WhatsApp use, this is enough.
If you run a small business through WhatsApp Business, get a high volume of voice notes from clients, or do bulk historical transcription, the limits will start to bite. The Pro plan is 10 dollars per month or 80 dollars per year and removes the word cap entirely. Pro plan details and pricing are on the dedicated page. There is also a free 7-day trial that unlocks unlimited access if you want to verify it works for your workload before paying.
Free Local Mode and Pro Local Mode produce identical transcripts. The Pro plan does not get a different or smarter model. It just removes the word cap and adds some workflow features (custom hotkeys, vocabulary, priority cloud fallback if you opt in). For anyone who only wants to read the occasional long voice note from a parent, the free tier is genuinely sufficient.
Voice notes from friends and family are some of the most personal audio data on your phone. They contain medical complaints, relationship drama, opinions about coworkers, family secrets, and offhand comments people would not want preserved on a server somewhere. Uploading that audio to a cloud transcription service means a third party gets a copy.
StarWhisper runs in Local Mode by default. The audio file you drag in is decoded on your CPU or GPU, the Whisper model on your hard drive does the transcription, and the resulting text appears on screen. Nothing is uploaded. Nothing is logged on a remote server. Nothing is reviewed by humans for quality assurance. You can verify this yourself by unplugging your network connection before processing a file; the transcription still works.
Cloud Mode exists as an opt-in toggle in Settings if you specifically want to use the OpenAI Whisper API for a small accuracy improvement on edge cases. It is clearly labeled, off by default, and never silently switched on. For sensitive personal voice notes, just leave the default settings alone. For the deeper privacy story, see the privacy and offline architecture page.
Transcription speed depends on your hardware and the length of the voice note. Rough numbers from the Whisper medium model on common machines:
| Hardware | 30-sec voice note | 2-min voice note | 10-min voice note |
|---|---|---|---|
| Modern laptop CPU (i7 or Ryzen 7) | 2 to 5 sec | 10 to 20 sec | 1 to 2 min |
| NVIDIA RTX 3060 (CUDA) | under 1 sec | 2 to 4 sec | 10 to 20 sec |
| NVIDIA RTX 4090 (CUDA) | under 1 sec | under 1 sec | 5 to 8 sec |
| Older CPU (5+ years) | 5 to 10 sec | 30 to 60 sec | 3 to 6 min |
The Whisper model size also matters. StarWhisper defaults to a balanced choice (medium) but you can switch to the smaller (faster, slightly less accurate) or larger (slower, more accurate) models in Settings. For voice notes, the default is almost always fine. The big quality gap is between built-in Windows dictation and Whisper, not between Whisper model sizes.
Honest disclosure of where it works less well. First, very noisy audio. Voice notes recorded outdoors in heavy traffic or wind will see accuracy drop from 95-plus percent to maybe 80 percent. The transcript will still be readable, but you might see a few wrong words. Second, heavy code-switching mid-sentence. If a voice note flips between two languages every other word, Whisper sometimes picks one and transliterates the other. Third, very strong regional dialects in certain languages. Standard Spanish from Spain, Mexico, and Argentina all work well; very thick rural dialects can confuse the model.
For all of these, the workaround is the same: try the transcription and accept that the result will be a useful first draft rather than a perfect record. For most personal voice notes the accuracy is well past good enough.
There is also no built-in speaker diarization for group-chat voice notes that have multiple voices in one recording (rare, but it happens). StarWhisper transcribes everything as a single block of text. You can manually split it after the fact if you need that.
If you found this useful, the same pipeline works for other audio types. Many people install StarWhisper to handle WhatsApp voice notes and then discover they also want it for interview transcription, podcast transcription, or meeting transcription. The drag-and-drop file flow is the same; only the audio source changes. There is also a real-time dictation mode for typing into any app by voice, which is a separate use case but the same install.
Use the same local Whisper workflow for saved audio files on Windows.
Get .m4a files off your iPhone and into a free Windows transcription tool.
The same workflow for iPhone Voice Memos, podcasts, and recorded calls.
How creators use local transcription for podcasts, scripts, and social.