Have a podcast, recording, or voice memo as MP3? Drag the file into a free Windows app, get the transcript in minutes. No upload, no per-minute fees, no file length limit.
Drag, wait, copy. Works on any MP3 of any length.
Grab the free installer for Windows 10 or 11 from the StarWhisper homepage. The setup takes about two minutes and includes a one-time download of the Whisper model so the app can work fully offline after install. No signup, no credit card, no email confirmation.
Open StarWhisper. Open File Explorer and find your MP3. Drag the file onto the StarWhisper window. The app auto-detects the language. If you prefer, you can also click an Import button and browse to the file. Both methods produce the same result.
StarWhisper processes the audio locally on your CPU or GPU. Rough speed: roughly 10 times faster than real-time on a modern laptop CPU and 50 times faster on an NVIDIA GPU with CUDA enabled. A one-hour MP3 transcribes in about 6 to 12 minutes on CPU or 1 to 2 minutes on a mid-range GPU. Progress shows in real time.
The transcript appears in the StarWhisper window when processing finishes. Copy the full text to clipboard, save it as a .txt file, or export with optional timestamps if you need to cross-reference the original audio. Paste into Notion, Word, Google Docs, your CMS, or wherever the text needs to go.
Specific advantages, not vague benefits.
Rev charges 25 cents per minute for human transcription and 10 cents per minute for AI. A 90-minute podcast episode is 9 to 22.50 dollars. StarWhisper is free, same accuracy as Rev's AI tier.
Cloud services often cap upload size (Otter.ai limits free tier to 30 minutes per file). StarWhisper handles audiobooks, full-day conference recordings, and multi-hour interviews with no cap.
MP3, WAV, M4A, OGG, OPUS, FLAC, WMA, plus audio extraction from MP4 and other video formats. No conversion step needed.
Default Local Mode runs Whisper on your own CPU or GPU. The MP3 is decoded locally, transcribed locally, and the result is written to your drive. Nothing uploaded. Privacy details.
Whisper supports 96 languages including all major European, Asian, and Middle Eastern languages. Language detection is automatic. Language support.
NVIDIA GPU owners install the CUDA pack and transcription speed jumps roughly 5x. A one-hour MP3 becomes a two-minute job. GPU acceleration details.
MP3 is still the universal portable audio format. Podcasts publish MP3 feeds, lectures end up as MP3 downloads, voice memos export as MP3 from many recorders, and historical archives of interviews and conference recordings exist almost exclusively as MP3. Anyone trying to read what is inside one of those files runs into the same problem: there is no built-in MP3-to-text tool on Windows.
The path most people take, often by default, is to upload the MP3 to a cloud transcription service. Otter, Trint, Rev, Sonix, Happy Scribe, and Descript all offer this. They work, but they have three downsides: per-minute fees that add up fast on long files, upload time and bandwidth cost on slow connections, and privacy concerns for sensitive recordings. The fourth downside is rarer but worse: some services have file-size or file-length limits on free tiers that quietly cap what you can actually transcribe without paying.
The alternative most technical users discover and then never go back from is local transcription. Install a small app, drag the MP3 in, get the text. StarWhisper is the most popular Windows option for this. The model is OpenAI Whisper, which is the same underlying technology powering many cloud transcription services; you just run it locally instead of paying a vendor to run it for you.
Concrete numbers from common situations:
Drop a 60-minute MP3 onto StarWhisper. Wait 6 to 12 minutes on a typical laptop CPU, or 1 to 2 minutes with an NVIDIA GPU. Copy the resulting 8,000-word transcript into your notes app or content management system. Total time including setup: under 15 minutes for a first-time user, under 2 minutes for a returning user.
Drop a 90-minute interview MP3. Processing time on modern CPU: 10 to 18 minutes. Output: roughly 12,000 words of plain-text transcript. Edit lightly, paste into your draft article. Free; no per-minute fees that would have made this a 9 to 22.50 dollar transcription on Rev.
Drop a folder of 30 old recordings. StarWhisper queues them. Walk away. Come back to 30 transcripts, each saved with the matching file name. For freelance archivists or researchers digitizing audio, this is the workflow that replaces sitting at a transcription pedal for weeks.
Drop a recorded voice memo (often M4A from a phone, but MP3 works the same). 5-minute memo becomes 700 words of text in under a minute. Useful for capturing ideas while walking and then having a searchable record.
Honest numbers. On clear English audio from a quality microphone (podcast, professional interview), Whisper achieves roughly 95 to 99 percent accuracy. This matches or beats the AI tier of Rev, the automated transcription of Otter, and the standard tier of most cloud services.
Accuracy drops on:
For comparison, human transcription (Rev human at 25 cents per minute) sits around 99 percent on clear audio and degrades less on edge cases. The trade-off is the cost: free local transcription handles 90 percent of real-world MP3 use cases at quality good enough to publish or search. For the remaining 10 percent where edge-case accuracy is critical, paid human transcription still has a role.
StarWhisper does not require you to convert the file before transcribing. The supported formats:
| Format | Common source | Supported |
|---|---|---|
| MP3 | Podcasts, downloads | Yes |
| WAV | Pro audio, studio recordings | Yes |
| M4A | iPhone Voice Memos, Zoom audio_only | Yes |
| AAC | iTunes, some podcasts | Yes |
| OGG / OPUS | WhatsApp, Telegram voice notes | Yes |
| FLAC | Lossless archives | Yes |
| WMA | Older Windows recordings | Yes |
| MP4 (video) | YouTube, Zoom video | Yes, audio extracted |
| MOV / AVI / MKV | Other video | Yes, audio extracted |
For related conversion workflows, see how to convert M4A to text for iPhone Voice Memos specifically, or how to convert WAV to text for studio recordings.
Many MP3 files contain content people would not want sitting on a third-party server. Recorded interviews under NDA. Customer support calls. Therapy session recordings. Personal voice memos with private thoughts. Researcher recordings of human subjects who consented to local processing only. Investigative-journalism source recordings.
Cloud transcription services upload all of this to their infrastructure. Even with strong privacy policies, the audio sits on someone else's hardware. For the categories above, that is often unacceptable.
StarWhisper Local Mode keeps the entire pipeline on your device. Decoding the MP3 happens on your CPU. The Whisper model runs on your CPU or GPU. The resulting text is written to your hard drive. Nothing leaves the device unless you choose to share it. This satisfies the privacy requirement for the use cases above and removes the legal and ethical question marks that come with cloud transcription of sensitive content.
For full privacy architecture details, see the privacy and offline architecture page. For working with audio in regulated industries specifically, see the HIPAA compliance FAQ, the voice-to-text for therapists page, or the voice-to-text for researchers page.
The free tier of StarWhisper gives you 500 words per day and 3,500 words per week of transcribed output. A typical 60-minute MP3 produces roughly 8,000 words. That means a single long episode exceeds the daily free cap. You can still process the file (no length limit on the file itself), but only the first ~500 words will count toward today's free allocation.
For casual users who transcribe one short MP3 every few days, the free tier is enough. For anyone who routinely processes long-form audio (podcasters, journalists, researchers, content creators) the Pro plan removes the cap. It is 10 dollars per month or 80 dollars per year. Full Pro details and pricing. There is also a 7-day free trial that unlocks unlimited use if you want to verify the workflow on a long file before paying.
Free and Pro use the same Whisper model and produce identical transcripts. Pro just removes the word cap and adds workflow features like custom vocabulary and priority cloud fallback (if you opt in). For pure MP3-to-text use, the only practical difference is the daily limit.
The MP3 workflow described above is the same pattern as several adjacent guides. If your file is an iPhone voice memo, see how to convert M4A to text. If it is a recorded interview, see how to transcribe interviews. If it is a podcast episode, see how to transcribe podcasts. If it is a Zoom call recording, see the Zoom call transcription guide. If it is a sermon or lecture, see how to transcribe sermons or how to transcribe lectures. All of these use the same drag-and-drop flow; only the source audio changes.
The same workflow for iPhone Voice Memos and Zoom audio files.
Process uncompressed studio recordings and pro audio files.
How fully-local transcription works and why it matters for privacy.
The specific workflow for journalist and researcher interview audio.