Studio sessions, podcast recordings, voiceover masters, audiobook stems. Drag the .wav file into a free Windows app, get the transcript locally. Handles 96k, 48k, 24-bit, multi-channel. No upload, no per-minute fees.
Drag, wait, copy. Any sample rate, any bit depth, any duration.
Grab the free installer for Windows 10 or 11 from the StarWhisper homepage. Setup takes about two minutes and includes a one-time download of the Whisper model so the app can work fully offline after install. No signup, no credit card.
Open StarWhisper. Drag your .wav file from File Explorer onto the window. StarWhisper reads WAV at any sample rate (8k through 96k), any bit depth (16, 24, or 32-bit float), mono or stereo. No need to render down to a specific format first; the app handles whatever your DAW exported.
StarWhisper processes the audio locally on your CPU or GPU. Speed: roughly 10 times faster than real-time on a modern laptop CPU and 50 times faster on an NVIDIA GPU with CUDA. A one-hour WAV transcribes in about 6 to 12 minutes on CPU or 1 to 2 minutes on a mid-range GPU. Progress shows in real time.
The transcript appears in the StarWhisper window when processing finishes. Copy the full text to clipboard, save it as a .txt file, or export as SRT or VTT with timestamps for subtitle workflows. Paste into your show notes, episode page, audiobook script, or wherever the transcript needs to go.
Specific advantages for pro audio workflows.
Cloud services often re-encode large WAV uploads to a lossy intermediate to save bandwidth. StarWhisper reads the original 24-bit lossless file directly off disk.
Many cloud transcription services reject WAVs over a few hundred megabytes on free tiers or charge extra above a threshold. StarWhisper handles files up to the 4 GB WAV format limit.
96k, 88.2k, 48k, 44.1k, 32k, 24k, 16k, 8k all supported. Internal resampling to 16k is high-quality and transcription accuracy is identical regardless of source rate.
Default Local Mode runs Whisper on your own CPU or GPU. The WAV is decoded locally, transcribed locally, the result is written to your drive. Useful for NDA studio sessions. Privacy details.
Generate caption-ready files directly from the WAV. Useful for adding accessible transcripts to podcast video versions or building subtitle tracks. Subtitle guide.
NVIDIA GPU owners install the CUDA pack and transcription speed jumps roughly 5x. A one-hour WAV becomes a two-minute job. GPU details.
WAV (Waveform Audio File Format) is the uncompressed, lossless container that audio professionals have used since the early 1990s. Every DAW exports to WAV. Every studio archive stores masters as WAV. Every voiceover deliverable spec asks for WAV. Every podcast production chain ends in a WAV master before it gets encoded to MP3 or AAC for distribution.
The reason is simple: WAV preserves the original audio bit-for-bit. There is no codec, no compression artifact, no quality decay. When you record a podcast episode, voiceover session, or audiobook narration into your DAW and bounce a 48 kHz 24-bit WAV, that file is the canonical reference for every future decision. You make MP3s for distribution from the WAV, never the other way around.
The problem with transcribing WAVs is that the same properties that make them ideal for pro audio make them awkward for online tools. Files are large (a one-hour 48k stereo 24-bit WAV is around 1.0 GB). Upload times are long. Many cloud transcription services either cap file size on free tiers, charge extra for large files, or silently re-encode the upload to a lossy intermediate. StarWhisper handles WAV as a first-class format on your own machine, free, offline, with no upload step.
Pro audio workflows obsess over sample rate, bit depth, and channel layout for good reason: those choices affect mastering, mixing, and the final listener experience. For transcription specifically, the picture is much simpler.
Whisper was trained on 16 kHz audio. StarWhisper internally downsamples your WAV to 16 kHz using a high-quality resampler before running the model. A 96 kHz WAV produces the same transcript as a 16 kHz version of the same audio. Human speech occupies a frequency range from roughly 100 Hz to 8 kHz; everything above 8 kHz is consonant detail and ambient air, none of which Whisper uses for word recognition. Higher sample rates are wasted on transcription.
16-bit, 24-bit, and 32-bit float all transcribe identically. Bit depth controls dynamic range, which matters for music production but not for word recognition. Whisper internally works in 32-bit float regardless of source bit depth.
Whisper is a mono model. StarWhisper downmixes stereo and multi-channel WAVs to mono before processing. For most podcast and voiceover content this is fine. For interview recordings where each speaker is on a separate channel, the cleanest approach is to split the WAV into per-channel mono files using any audio editor, then transcribe each separately. This gives you per-speaker transcripts that you can label and combine in post.
Common pro audio scenarios with realistic timing:
Bounce a 60-minute episode as a 48k stereo 24-bit WAV (roughly 1 GB on disk). Drop on StarWhisper. CPU processing time: 6 to 12 minutes. GPU processing time: 1 to 2 minutes. Output: roughly 8,000 words of plain-text transcript suitable for show notes. Light edit, publish. For more detail on the podcast-specific workflow, see voice-to-text for podcasters or how to transcribe podcasts.
Voiced chapter, exported as 48k mono 24-bit WAV, runtime 45 minutes (about 500 MB). Drop on StarWhisper. Output: roughly 6,500 words of chapter text. Useful for cross-checking the narration against the original script to catch fluffs or missed lines.
Three-host interview, each host on a separate WAV channel after the session. Export each channel as a mono WAV. Drop each onto StarWhisper individually. Result: three per-speaker transcripts that you can combine with timecode and speaker labels in post. Total processing time scales linearly: three 60-minute mono WAVs take about 18 to 36 minutes on CPU or 3 to 6 minutes on GPU.
Voice actor records 10 takes of a 30-second commercial spot. Each take exported as a separate WAV. Drop the folder on StarWhisper. The app queues all takes and produces a labeled transcript file for each. Useful for the agency review pass to compare take-to-take wording.
Many people end up with both: a WAV master and an MP3 distribution copy. For transcription, the answer is almost always the WAV.
The WAV is the closest to source, so it has the cleanest signal Whisper can work with. The MP3 has been through a lossy codec which mildly muddies high-frequency consonant edges and adds compression artifacts in transient sections. Whisper handles both well, but on borderline audio (heavy accents, fast speech, technical vocabulary) the WAV produces a measurably better transcript.
The exception is when the WAV is large and inconvenient and the MP3 is good enough. For a clean studio podcast episode where the source mic was solid and the room was treated, the MP3 transcript is functionally identical to the WAV transcript. Use whichever is faster to grab. If you want the transcript for archival or accessibility purposes, use the WAV. For a related workflow starting from MP3, see how to convert MP3 to text. If your source is an iPhone voice memo or a QuickTime export, see how to convert M4A to text.
Studio WAV files often contain content that should not sit on a third-party server:
Cloud transcription services upload the WAV to their infrastructure regardless of what is in it. Their privacy policies may be strong, but the file still leaves your control. For the categories above, that is often unacceptable on contract, ethics, regulatory, or legal grounds.
StarWhisper Local Mode keeps everything on your device. The WAV is decoded by the app, the Whisper model runs on your CPU or GPU, the transcript is written to your hard drive. Nothing leaves the machine. For deeper detail, see privacy and offline architecture and how to transcribe audio offline. For specific regulated industries, see HIPAA compliance FAQ or voice-to-text for lawyers.
The free tier of StarWhisper provides 500 words per day and 3,500 words per week of transcribed output. A typical 60-minute WAV produces roughly 8,000 words. That means a single long episode exceeds the daily free cap. You can still process the file (no limit on file size or duration itself), but only the first ~500 words count toward today's allocation.
For occasional studio work (one episode every few weeks, the rare voiceover session), the free tier is enough. For production-volume work (weekly podcast episodes, daily audiobook recording sessions, regular interview transcription), the Pro plan removes the cap. It is 10 dollars per month or 80 dollars per year. Full Pro details and pricing. A 7-day free trial of Pro is available if you want to verify the workflow on a long WAV before paying.
Free and Pro use the same Whisper model and produce identical transcripts. Pro just removes the word cap and adds workflow features like custom vocabulary (useful for industry-specific terminology) and priority cloud fallback (if you opt in). For pure WAV transcription, the only practical difference is the daily output ceiling.
The same workflow for the world's most common compressed audio format.
For iPhone Voice Memos, QuickTime audio, and YouTube downloads.
Use SRT/VTT export to generate captions for podcast video versions.
The full studio podcast transcription workflow, end to end.