Two Different Jobs, Often Confused

Dictation vs Transcription:
What's the Difference?

Dictation means speaking text in real time into an app. Transcription means converting an existing audio recording to text. They are different jobs that often need different tools. StarWhisper handles both with the same Whisper engine, on Windows, locally, free for personal use.

Download Free for Windows
Microsoft Store
  • Trusted by Windows
  • 30-second setup
Dictation
Hold hotkey, speak, text appears
Real time. Into any text field. For email, notes, drafts.
Transcription
Drop audio file, get text out
Batch. From MP3, WAV, M4A. For meetings, interviews, podcasts.

The Quick Difference

Both turn speech into text. The difference is when the speech happens.

Dictation

Live speech, into a text field

You speak right now and the words appear in your active app at that moment. The audio is produced live, transcribed in real time, and you see the result immediately. The point is the text in the field, not the audio.

  • Trigger: push-to-talk hotkey
  • Input: your microphone, right now
  • Output: text in whatever app has focus
  • Use cases: email, notes, Slack, drafts, articles, chat
  • Audio is usually not saved (the text is the artifact)
Transcription

Existing recording, into a transcript

You have an audio file already (a meeting recording, a podcast episode, a voice memo, an interview MP3). You need it as a text document for searching, editing, or reading. The audio is the input, the text file is the output.

  • Trigger: drag the audio file into the app
  • Input: a saved audio or video file
  • Output: a text or markdown transcript file
  • Use cases: meeting notes, interview transcripts, podcast show notes
  • The audio file is preserved alongside the transcript

StarWhisper Does Both

Six concrete things to know about how dictation and transcription coexist in one app.

Same Whisper engine

Both dictation and file transcription use the same OpenAI Whisper model running locally on your PC. Accuracy is identical, language support is the same 96 languages, and the offline guarantee is the same. There is no second account or upgrade to use both.

One hotkey for dictation

Bind a push-to-talk hotkey once. Anywhere you can type, you can dictate. Hold to talk, release to stop. The transcribed text pastes into the active text field via the Windows IME mechanism. Works in Word, Outlook, Slack, browsers, every app.

Drag and drop for transcription

For file transcription, drag any audio file (MP3, WAV, M4A, FLAC, OGG, WebM) into the StarWhisper window. The app processes the file and produces a text transcript. You can save it as TXT, copy to clipboard, or open in your editor.

Local and offline

Both modes run locally by default. Audio never leaves your PC. Optional cloud mode exists for users who want it, but the default is full local processing. This matters more for transcription (you may be processing sensitive meeting audio) but it also covers dictation.

Free tier covers both

The 500 words per day and 3,500 per week free tier covers dictation and transcription combined. There is no separate quota or paywall for transcription. Pro at $10 per month removes the cap entirely for both modes.

GPU acceleration if you have one

Transcription speeds up dramatically on NVIDIA GPUs via CUDA. A one-hour recording transcribes in 2 to 5 minutes on a mid-range RTX card, versus 10 to 20 minutes on CPU only. Dictation feels instant on either path.

The decision tree, in one question

Most of the confusion between dictation and transcription dissolves if you ask the right starting question.

Ask yourself: is the audio happening right now, or did it already happen?

  1. Right now (you are about to speak): you want dictation. The text is the goal, the audio is just the input method. Hold a hotkey, speak, the text appears in your active app. Use cases: drafting an email, writing notes, sending a Slack message, drafting an article.
  2. Already happened (you have a recording): you want transcription. The audio is the input, the text file is the output. Drag the file into a transcription tool. Use cases: turning a meeting recording into notes, getting an interview into a quotable transcript, generating podcast show notes from an episode.
  3. Right now but you cannot speak (you are listening, not talking): you want meeting-bot transcription. A tool like Otter joins the meeting as an attendee and transcribes everyone, live, with speaker labels. This is a third category that is closer to live transcription than to dictation. StarWhisper does not do this.

That is the entire decision tree. Almost every confusion between the two terms comes from missing one of those three categories.

Where the confusion comes from

Historically, "dictation" and "transcription" were used somewhat interchangeably because the same human (often a secretary) might do both. A boss would dictate a letter to a secretary, who would type it as the boss spoke. That is dictation, but the secretary's act of writing it down was sometimes called transcription. Later, a recording device captured the boss's voice and the secretary "transcribed" the recording. That is transcription proper.

In the software era, the two became architecturally distinct. Dictation software runs in the background, listens for a hotkey, captures live audio, transcribes it instantly, and pastes the text where you are typing. Transcription software opens a saved audio file and produces a transcript document. The user experience is completely different even if the underlying speech recognition engine is the same.

Modern tools have re-merged the two in the sense that a single app can offer both modes. StarWhisper is one example. Wispr Flow on Mac is another. But the underlying tasks are still different: one is a live input replacement, the other is a batch audio-to-text conversion.

Case-by-case examples

Common scenarios people ask about, and which mode they actually want.

Scenario Mode Why
Drafting an email or article Dictation You are producing the text now
Writing notes during your own meeting Dictation Your own commentary, your own choice of words
Turning a saved Zoom recording into notes Transcription The audio already exists, you want a text file
Getting an interview MP3 into a transcript Transcription File in, text out
Live meeting captions for everyone speaking Meeting bot You are listening, not the speaker
Voice memos converted to text Transcription Recording exists already
Sending a Slack message hands-free Dictation Real-time output into the chat field
Podcast episode to show notes Transcription Audio file in, text out
Writing code comments by voice Dictation Live, into your IDE
YouTube video to article draft Transcription Video has audio, you want text

The third category: meeting bots

There is a third category that does not fit cleanly into either dictation or transcription, and it is the source of a lot of confusion. Meeting transcription bots like Otter, Fireflies, and Zoom's own live transcription join a meeting as an attendee, listen to everyone speak, and produce a live transcript with speaker labels. They are arguably "live transcription," but they are not dictation because you are not the speaker.

Meeting bots are the right tool when you are attending a meeting you do not control, you want a record of what everyone said, and speaker labels matter so you can attribute quotes correctly. They are the wrong tool when you are dictating your own thoughts, drafting an email, or processing a single audio file you already have.

StarWhisper does not bot meetings. It is push-to-talk dictation plus file transcription. For meeting bot functionality, you need a separate category of tool. There is a fuller breakdown at StarWhisper vs Otter if you want to see the side-by-side. If you mostly need meeting capture, Otter or similar is a better fit. If you mostly need to draft your own text and occasionally transcribe a recording, StarWhisper covers both.

Why the same engine works for both

OpenAI Whisper is a sequence-to-sequence model that takes audio as input and produces text as output. It does not care whether the audio came from a microphone right now or from a file you recorded last week. The model is the same, the processing is the same, and the accuracy is the same. The difference is the wrapper around the model.

For dictation, the wrapper is a hotkey listener, a real-time audio capture pipeline, and a Windows IME hook that pastes the result. For transcription, the wrapper is a file picker, an audio decoder that handles MP3 or M4A or whatever format, and a save-to-file step. The model in the middle is identical.

This is why StarWhisper can ship both modes for the same price and with the same accuracy guarantees. It is also why, when Whisper improves (and it does, regularly), both modes get better at the same time. The economics work because Whisper is open source and the model runs on your hardware, not in someone else's cloud.

Speaker labels, diarization, and what StarWhisper does not do

Single-track audio, no speaker identification

StarWhisper produces a single continuous transcript. It does not identify which person said what. If you transcribe a panel discussion or a multi-person interview, you will get all the words in order but without "Speaker 1:" or "Speaker 2:" labels. The technical name for that feature is speaker diarization, and Whisper does not include it natively. Some tools layer a separate diarization model on top of Whisper to add labels, and that is a reasonable workflow for multi-speaker content, but StarWhisper itself does not. For single-speaker dictation (you) and single-track recordings (your own voice memo, a one-person podcast), this does not matter. For multi-speaker meetings where attribution matters, you want either a tool with diarization built in or a meeting bot that captures separate audio streams per speaker.

Specific limits of StarWhisper's transcription

  • No speaker labels. Single continuous text output, no diarization.
  • No automatic punctuation editing pass. Whisper adds punctuation in real time, but it is not a separate cleanup step.
  • No live meeting bot. StarWhisper does not join Zoom, Teams, or Google Meet calls.
  • No automatic summary or chapter generation. You get the raw transcript, not an LLM-summarized version.
  • No real-time translation during dictation. Translate after the fact if needed.

For the use cases StarWhisper is designed for (personal dictation and single-track file transcription), none of these limits matter. For multi-speaker meeting analysis with summaries, a different category of tool fits better.

Common workflows that use both

The most productive users tend to combine the two modes in a single working day.

The hybrid knowledge worker

Throughout the day, dictate emails, Slack messages, notes, and draft documents using the push-to-talk hotkey. The text goes straight into the relevant app. At the end of the day, batch-transcribe any meeting recordings you collected (your own voice memos, a Zoom recording you exported, a phone call you captured). The transcripts go into your notes app for later reference. One app, two modes, zero context switching.

The content creator

Dictate first drafts of blog posts and social posts using the hotkey. Record podcast episodes separately, then drop the audio file into StarWhisper to generate show notes and a full transcript for SEO. See voice-to-text for content creators for more on this workflow. Both modes feed the same content pipeline.

The researcher

Conduct interviews with a recorder app, then transcribe the audio files with StarWhisper to get text you can search, quote, and code. See how to transcribe meetings for a step-by-step on the file transcription side, and how to convert MP3 to text for the file format walkthrough. Use dictation for your own research notes during the interview write-up.

Who this page is for

If you landed here, you are probably in one of these situations.

  • You have been hearing "dictation" and "transcription" used interchangeably and you want to know if they are the same thing. They are not. Dictation is live, transcription is from a recording.
  • You are picking a tool and you do not know which feature to look for. Decide whether the audio is live or saved, then pick the matching mode.
  • You are evaluating whether StarWhisper covers your use case. If you need either or both, yes.
  • You are deciding between Otter (meeting bot) and a dictation tool. They are different categories, see StarWhisper vs Otter for the full breakdown.

The honest verdict

Dictation and transcription are two different jobs that sometimes share an engine and sometimes do not. Dictation is for producing new text right now, at the speed of your speech, into whatever app you are using. Transcription is for converting a recording you already have into a text document. Most knowledge workers need both at some point, which is why StarWhisper ships both modes in one app with one license.

If you only need one of the two, that is fine. Use dictation if you spend your day producing text. Use transcription if your day is full of recorded meetings, interviews, or voice memos that need to become text. If both are part of your work, you only need one tool. StarWhisper runs on Windows 10 and 11, is free for personal use, $10 per month for unlimited Pro, and uses the same Whisper engine for both modes.

Frequently Asked Questions

Which do I need, dictation or transcription?
Ask one question: are you producing new text right now or processing audio that already exists? If the text is appearing as you speak (writing an email, drafting notes, sending a Slack message), that is dictation and you want a push-to-talk tool. If you have a meeting recording, a podcast file, or an interview MP3 and you need it as text, that is transcription and you want a file-based tool. StarWhisper does both, with the same Whisper engine.
Can I use both dictation and transcription?
Yes, and many users do. A typical pattern: dictate emails and notes live during the day using a push-to-talk hotkey, then transcribe any meeting recordings or voice memos at the end of the day in batch. The two workflows are independent, the same app handles both, and the same Whisper engine produces the text. There is no extra cost or configuration to enable one or the other.
Does StarWhisper do both?
Yes. Real-time dictation works in every Windows text field through a push-to-talk hotkey. File transcription works by dragging an audio file (MP3, WAV, M4A, FLAC, OGG, WebM, and others) into the app, which produces a text transcript. The same Whisper model handles both, so accuracy is the same and language support is the same 96 languages. There is no separate tier or upgrade required to use both.
What audio file formats can I transcribe?
StarWhisper accepts MP3, WAV, M4A, FLAC, OGG, WebM, AAC, WMA, and most other common audio formats. It will also extract audio from video files like MP4 and MOV. There is no per-file size limit imposed by the app, though very large files take longer to process. Processing speed depends on the model size and whether you have a GPU. Most one-hour recordings transcribe in 5 to 15 minutes on a typical PC.
Can I record AND transcribe at the same time, live?
Yes, that is what push-to-talk dictation is. You speak, the audio is recorded and transcribed in real time, and the text appears in your active text field. The audio is not saved by default because the transcription is what you wanted. If you also want to keep the audio, you can use a separate voice recorder app alongside StarWhisper or configure StarWhisper to retain audio files for later batch transcription.
What about speaker labels?
StarWhisper does not do speaker diarization, which is the technical name for identifying which person said what. Whisper produces a single continuous transcript. If you need speaker labels (for example, when transcribing a panel discussion or interview with multiple voices), you need a tool with diarization built in. For single-speaker dictation or single-track audio, speaker labels are not needed.
What about Otter? Is that the same thing?
Otter is a different category. Otter is primarily a meeting transcription bot that joins your Zoom, Google Meet, or Teams calls as an attendee and transcribes the meeting in real time, with speaker labels and a summary. It is built for meetings you are participating in but not driving. StarWhisper is dictation plus file transcription, not a meeting bot. For meetings you attend but do not speak in, Otter or a similar bot fits better. For meetings where you want to dictate your own notes during, StarWhisper is the right tool.
What does it cost?
StarWhisper is free for personal use up to 500 words per day and 3,500 per week, covering both dictation and file transcription. Pro is $10 per month or $80 per year for unlimited. There is a 7-day free trial of Pro. Compared to Otter at $17 per month or Rev at $0.10 per audio minute, the pricing is substantially lower because Whisper is open source and runs locally on your PC instead of consuming cloud compute.

One app, both modes, free

Dictation and transcription powered by the same local Whisper engine. $0 for personal use, $10/month Pro.

Download StarWhisper Free