Dictation means speaking text in real time into an app. Transcription means converting an existing audio recording to text. They are different jobs that often need different tools. StarWhisper handles both with the same Whisper engine, on Windows, locally, free for personal use.
Both turn speech into text. The difference is when the speech happens.
You speak right now and the words appear in your active app at that moment. The audio is produced live, transcribed in real time, and you see the result immediately. The point is the text in the field, not the audio.
You have an audio file already (a meeting recording, a podcast episode, a voice memo, an interview MP3). You need it as a text document for searching, editing, or reading. The audio is the input, the text file is the output.
Six concrete things to know about how dictation and transcription coexist in one app.
Both dictation and file transcription use the same OpenAI Whisper model running locally on your PC. Accuracy is identical, language support is the same 96 languages, and the offline guarantee is the same. There is no second account or upgrade to use both.
Bind a push-to-talk hotkey once. Anywhere you can type, you can dictate. Hold to talk, release to stop. The transcribed text pastes into the active text field via the Windows IME mechanism. Works in Word, Outlook, Slack, browsers, every app.
For file transcription, drag any audio file (MP3, WAV, M4A, FLAC, OGG, WebM) into the StarWhisper window. The app processes the file and produces a text transcript. You can save it as TXT, copy to clipboard, or open in your editor.
Both modes run locally by default. Audio never leaves your PC. Optional cloud mode exists for users who want it, but the default is full local processing. This matters more for transcription (you may be processing sensitive meeting audio) but it also covers dictation.
The 500 words per day and 3,500 per week free tier covers dictation and transcription combined. There is no separate quota or paywall for transcription. Pro at $10 per month removes the cap entirely for both modes.
Transcription speeds up dramatically on NVIDIA GPUs via CUDA. A one-hour recording transcribes in 2 to 5 minutes on a mid-range RTX card, versus 10 to 20 minutes on CPU only. Dictation feels instant on either path.
Most of the confusion between dictation and transcription dissolves if you ask the right starting question.
That is the entire decision tree. Almost every confusion between the two terms comes from missing one of those three categories.
Historically, "dictation" and "transcription" were used somewhat interchangeably because the same human (often a secretary) might do both. A boss would dictate a letter to a secretary, who would type it as the boss spoke. That is dictation, but the secretary's act of writing it down was sometimes called transcription. Later, a recording device captured the boss's voice and the secretary "transcribed" the recording. That is transcription proper.
In the software era, the two became architecturally distinct. Dictation software runs in the background, listens for a hotkey, captures live audio, transcribes it instantly, and pastes the text where you are typing. Transcription software opens a saved audio file and produces a transcript document. The user experience is completely different even if the underlying speech recognition engine is the same.
Modern tools have re-merged the two in the sense that a single app can offer both modes. StarWhisper is one example. Wispr Flow on Mac is another. But the underlying tasks are still different: one is a live input replacement, the other is a batch audio-to-text conversion.
Common scenarios people ask about, and which mode they actually want.
| Scenario | Mode | Why |
|---|---|---|
| Drafting an email or article | Dictation | You are producing the text now |
| Writing notes during your own meeting | Dictation | Your own commentary, your own choice of words |
| Turning a saved Zoom recording into notes | Transcription | The audio already exists, you want a text file |
| Getting an interview MP3 into a transcript | Transcription | File in, text out |
| Live meeting captions for everyone speaking | Meeting bot | You are listening, not the speaker |
| Voice memos converted to text | Transcription | Recording exists already |
| Sending a Slack message hands-free | Dictation | Real-time output into the chat field |
| Podcast episode to show notes | Transcription | Audio file in, text out |
| Writing code comments by voice | Dictation | Live, into your IDE |
| YouTube video to article draft | Transcription | Video has audio, you want text |
There is a third category that does not fit cleanly into either dictation or transcription, and it is the source of a lot of confusion. Meeting transcription bots like Otter, Fireflies, and Zoom's own live transcription join a meeting as an attendee, listen to everyone speak, and produce a live transcript with speaker labels. They are arguably "live transcription," but they are not dictation because you are not the speaker.
Meeting bots are the right tool when you are attending a meeting you do not control, you want a record of what everyone said, and speaker labels matter so you can attribute quotes correctly. They are the wrong tool when you are dictating your own thoughts, drafting an email, or processing a single audio file you already have.
StarWhisper does not bot meetings. It is push-to-talk dictation plus file transcription. For meeting bot functionality, you need a separate category of tool. There is a fuller breakdown at StarWhisper vs Otter if you want to see the side-by-side. If you mostly need meeting capture, Otter or similar is a better fit. If you mostly need to draft your own text and occasionally transcribe a recording, StarWhisper covers both.
OpenAI Whisper is a sequence-to-sequence model that takes audio as input and produces text as output. It does not care whether the audio came from a microphone right now or from a file you recorded last week. The model is the same, the processing is the same, and the accuracy is the same. The difference is the wrapper around the model.
For dictation, the wrapper is a hotkey listener, a real-time audio capture pipeline, and a Windows IME hook that pastes the result. For transcription, the wrapper is a file picker, an audio decoder that handles MP3 or M4A or whatever format, and a save-to-file step. The model in the middle is identical.
This is why StarWhisper can ship both modes for the same price and with the same accuracy guarantees. It is also why, when Whisper improves (and it does, regularly), both modes get better at the same time. The economics work because Whisper is open source and the model runs on your hardware, not in someone else's cloud.
StarWhisper produces a single continuous transcript. It does not identify which person said what. If you transcribe a panel discussion or a multi-person interview, you will get all the words in order but without "Speaker 1:" or "Speaker 2:" labels. The technical name for that feature is speaker diarization, and Whisper does not include it natively. Some tools layer a separate diarization model on top of Whisper to add labels, and that is a reasonable workflow for multi-speaker content, but StarWhisper itself does not. For single-speaker dictation (you) and single-track recordings (your own voice memo, a one-person podcast), this does not matter. For multi-speaker meetings where attribution matters, you want either a tool with diarization built in or a meeting bot that captures separate audio streams per speaker.
For the use cases StarWhisper is designed for (personal dictation and single-track file transcription), none of these limits matter. For multi-speaker meeting analysis with summaries, a different category of tool fits better.
The most productive users tend to combine the two modes in a single working day.
Throughout the day, dictate emails, Slack messages, notes, and draft documents using the push-to-talk hotkey. The text goes straight into the relevant app. At the end of the day, batch-transcribe any meeting recordings you collected (your own voice memos, a Zoom recording you exported, a phone call you captured). The transcripts go into your notes app for later reference. One app, two modes, zero context switching.
Dictate first drafts of blog posts and social posts using the hotkey. Record podcast episodes separately, then drop the audio file into StarWhisper to generate show notes and a full transcript for SEO. See voice-to-text for content creators for more on this workflow. Both modes feed the same content pipeline.
Conduct interviews with a recorder app, then transcribe the audio files with StarWhisper to get text you can search, quote, and code. See how to transcribe meetings for a step-by-step on the file transcription side, and how to convert MP3 to text for the file format walkthrough. Use dictation for your own research notes during the interview write-up.
If you landed here, you are probably in one of these situations.
Dictation and transcription are two different jobs that sometimes share an engine and sometimes do not. Dictation is for producing new text right now, at the speed of your speech, into whatever app you are using. Transcription is for converting a recording you already have into a text document. Most knowledge workers need both at some point, which is why StarWhisper ships both modes in one app with one license.
If you only need one of the two, that is fine. Use dictation if you spend your day producing text. Use transcription if your day is full of recorded meetings, interviews, or voice memos that need to become text. If both are part of your work, you only need one tool. StarWhisper runs on Windows 10 and 11, is free for personal use, $10 per month for unlimited Pro, and uses the same Whisper engine for both modes.
Step-by-step guide to turning recorded meetings into text.
File transcription walkthrough for MP3 and other audio formats.
Dictation app vs meeting bot. Different categories, different fits.
How creators use both modes for show notes and drafts.