Windows Voice Typing (Win+H) uses Microsoft's pre-transformer speech models. Accuracy on clear English hovers around 88 percent. Accents break it. Other languages break it. OpenAI Whisper is the modern alternative, accuracy around 98 percent on clear English, strong on accents and 96 languages, and it runs free, locally, on the same Windows PC.
Same microphone, same Windows PC, two different speech models.
Microsoft's built-in speech recognition is convenient but uses an older speech recognition stack. Accuracy on clear American English benchmarks around 88 percent (one error every nine words). On accented English it falls into the 70s. On most non-English languages it is unusable for actual writing. It is free, it is built in, it works for grocery lists.
Whisper is a modern transformer speech recognition model from OpenAI, trained on 680,000 hours of audio. Independent benchmarks put accuracy around 97 to 98 percent on clear English, with strong performance on accents and 96 languages. StarWhisper bundles Whisper into a free Windows app that runs locally on your PC. Same microphone. Substantially better text.
Specific accuracy differences you will notice on day one
Indian English, Scottish, Singaporean, South African, Caribbean, Australian. Whisper was trained on all of them. Win+H was trained primarily on American English and shows it. The gap is much larger than the headline 10 points.
Whisper handles 96 languages. Win+H supports a much shorter list and accuracy varies widely. For German, French, Spanish, Mandarin, Japanese, Korean, Hindi, Arabic, Russian, and most others, the gap is functionally the difference between usable and unusable.
Whisper handles programming terms, medical vocabulary, legal language, and scientific terminology more accurately because the training corpus included that content. Win+H tends to autocorrect technical words into common English equivalents.
Names of people, places, brands, products. Whisper preserves more of them. Win+H frequently mangles non-English names or substitutes a phonetic guess.
Whisper holds context across sentences and produces more coherent paragraphs. Win+H is optimized for short utterances and tends to lose the thread on multi-sentence dictation.
Whisper inserts punctuation contextually and respects sentence boundaries. Win+H requires you to say "comma" and "period" explicitly, which slows down natural speech and produces awkward transcripts.
Microsoft has shipped speech recognition on Windows for over twenty years. Windows Vista had Windows Speech Recognition (WSR), the keyboard-driven dictation tool that almost nobody used. Windows 10 added a Voice Typing redesign in 2017, accessible via the Win+H hotkey. Windows 11 polished the UI further. What has not changed in any meaningful way is the underlying speech model.
The underlying acoustic model in Windows Voice Typing dates to the pre-transformer era. It uses recurrent neural network architectures trained on a relatively small corpus of mostly American English. By contrast, the field has moved on twice over: first to transformer-based models, then to massive-scale multilingual pretraining. Whisper is the most prominent open example of the second wave, with 680,000 hours of training data across 96 languages.
The accuracy gap is structural, not a tuning problem. Microsoft is presumably working on next-generation speech, but for now, the built-in Windows tool sits on top of older tech. If you have ever wondered why dictation on your Pixel phone or your iPhone feels more accurate than on your Windows laptop, it is the same explanation: those phones run newer models.
The accuracy difference shows up immediately on real sentences. Below are typical examples from user reports. The spoken column is what was said. The Win+H column is the verbatim output. The Whisper column is what StarWhisper produced from identical audio.
| Spoken | Win+H output | Whisper (StarWhisper) output |
|---|---|---|
| "The deployment went to staging at 3 PM" | the deployment went to staging at three p m | The deployment went to staging at 3 PM. |
| "Schedule a meeting with Aoife on Thursday" | schedule a meeting with eva on Thursday | Schedule a meeting with Aoife on Thursday. |
| "The patient reported intermittent dyspnea" | the patient reported intermittent disney | The patient reported intermittent dyspnea. |
| "Refactor the auth middleware to use JWT tokens" | refactor the off middleware to use jay w t tokens | Refactor the auth middleware to use JWT tokens. |
| "Send the contract to [email protected]" | send the contract to monara at example dot com | Send the contract to [email protected]. |
These examples are not cherry-picked. They are representative of the kind of error you see if you dictate for any length of time with anything other than the most generic American English vocabulary.
The accuracy difference is not magic, it is architecture and scale. Whisper is a sequence-to-sequence transformer trained end-to-end on a massive, diverse audio corpus. StarWhisper bundles the Whisper model and runs it on your Windows PC locally.
OpenAI trained Whisper on roughly 680,000 hours of audio collected from the web, including 117,000 hours of multilingual data and 125,000 hours of translation data. This is roughly two orders of magnitude more than what the older Microsoft stack was trained on. Larger and more diverse training data is the single biggest reason Whisper handles accents, technical vocabulary, and non-English languages well.
Whisper uses an encoder-decoder transformer, the same general architecture as GPT and modern translation models. This architecture is much better at long-range context than the recurrent models that dominated speech recognition through the 2010s. It is why Whisper produces coherent paragraphs while older systems produce coherent sentences and lose the thread between them.
Whisper was trained jointly on multiple speech tasks: transcription, translation, language identification, voice activity detection. This multitask setup produces a model that is robust in conditions where any single-task model would degrade. In practice it means Whisper handles silent gaps, background noise, and language switching gracefully.
Because Whisper is open source and reasonably sized, it fits on a consumer Windows machine and runs at usable speeds on CPU. That is why StarWhisper can package it as a free local tool. No cloud subscription is involved, no audio leaves your PC, and the accuracy advantage applies regardless of internet connectivity. The full detail of how the model runs locally is on the privacy and offline features page.
Windows Voice Typing is free, it is built in, it ships on every Windows 10 and 11 machine, and it requires zero setup. For the case where you want to dictate a single sentence into a text box and you do not care about accents, technical vocabulary, or non-English, it works. Many users get genuine value from it on phones too, where the equivalent built-in dictation is also good enough for short messages.
If your dictation needs are limited to "occasional short sentence in Notepad, in clear American English, with no proper nouns," there is no reason to install anything else. The friction of installing a separate app is not worth it for one sentence every few weeks.
| Capability | Windows Voice Typing (Win+H) | StarWhisper (Whisper) |
|---|---|---|
| Clear English accuracy | ~88% | ~97-98% |
| Accented English | Weak | Strong |
| Non-English languages | Limited | 96 languages |
| Technical / medical / legal vocabulary | Mangled | Preserved |
| Auto punctuation | Manual ("comma", "period") | Automatic |
| Auto numerals (3 PM vs three p m) | No | Yes |
| Audio leaves your device | Yes (Microsoft cloud) | No (Local Mode) |
| Works offline | No | Yes |
| GPU acceleration | No | NVIDIA CUDA + Vulkan |
| Cost | Free, built-in | Free up to 500 wpd, $10/mo unlimited |
| Hotkey | Win+H (fixed) | Configurable |
| Works in any text field | Most | All |
You do not have to choose. Both can coexist. Here is the simplest path.
Most users find that within a week they stop pressing Win+H entirely because the accuracy difference is large enough that the built-in tool becomes annoying by comparison. If you want a deeper comparison of the two tools side by side, the dedicated StarWhisper vs Windows Voice Typing page covers the trade-offs in more detail.
Whisper is a real neural network and it does want some compute to run quickly, but the requirements are modest by 2026 standards.
For older or lower-spec machines, StarWhisper picks the right Whisper model size automatically. The small model runs in real time on basically any modern Windows laptop, even integrated graphics. The medium and large models are slower but more accurate and benefit from GPU. Vulkan is available as a cross-vendor GPU path for AMD and Intel cards.
If your reason for asking "why is Windows dictation so bad" is that you want a free local fix that respects your hardware, the answer is yes, this works on machines you already own. There is more detail on the professional accuracy features page.
This is a common Win+H complaint. The fix from Microsoft's support docs is usually to reset speech permissions or reinstall language packs. If you have hit this multiple times and want a more stable tool, installing a separate dictation app is a reasonable workaround. StarWhisper runs independently of the Windows speech stack, so it does not break in the same ways.
Win+H does not auto-punctuate by default. You can enable a setting called "auto-punctuation" in some recent Windows builds but the behavior is inconsistent. Whisper handles punctuation contextually based on sentence structure, so spoken pauses become commas, ends become periods, and so on, without manual intervention.
This is the single most common complaint and the one with the largest fix. Whisper handles accented English at near-native speaker accuracy. If your accent is anything other than American, the gap is large enough that switching to a Whisper-based tool feels like getting glasses for the first time.
Win+H works in most standard Windows text fields but has edge cases in particular apps. StarWhisper uses the same paste mechanism as any other Windows IME, so it works wherever your keyboard works, including in apps where Win+H fails. This applies to Word, Outlook, Chrome address bars, Slack, and so on. The dedicated offline voice dictation FAQ walks through the compatibility list.
The free plan covers 500 words per day, which is enough to evaluate the accuracy difference on real work for a week or two. If you find yourself using dictation heavily (writers, researchers, content creators, anyone who produces more than a few thousand words per day), Pro is $10 per month or $80 per year. There is no per-seat math and no upsell tier. Pricing detail on the homepage pricing section.
For writers in particular, the speed of Whisper-based dictation is the main attraction once accuracy is no longer the blocker. See voice to text for writers for the long-form writing workflow specifically.
Detailed side-by-side comparison of the two tools.
How StarWhisper achieves 97-98% accuracy on real-world dictation.
Setup, hotkeys, compatible apps, and offline behavior on Windows.
Long-form dictation workflows for authors, bloggers, and journalists.