WAV to Text Guide

How to Convert
WAV to Text for Free
Studio Audio, Lossless, Offline

Studio sessions, podcast recordings, voiceover masters, audiobook stems. Drag the .wav file into a free Windows app, get the transcript locally. Handles 96k, 48k, 24-bit, multi-channel. No upload, no per-minute fees.

Download for Windows
Microsoft Store
  • Trusted by Windows
  • Quick 30-second setup
"Studio session 48kHz 24-bit, take 3..."

Four Steps from WAV to Text

Drag, wait, copy. Any sample rate, any bit depth, any duration.

1

Download StarWhisper

Grab the free installer for Windows 10 or 11 from the StarWhisper homepage. Setup takes about two minutes and includes a one-time download of the Whisper model so the app can work fully offline after install. No signup, no credit card.

2

Drag the WAV onto the StarWhisper window

Open StarWhisper. Drag your .wav file from File Explorer onto the window. StarWhisper reads WAV at any sample rate (8k through 96k), any bit depth (16, 24, or 32-bit float), mono or stereo. No need to render down to a specific format first; the app handles whatever your DAW exported.

3

Wait for the transcription

StarWhisper processes the audio locally on your CPU or GPU. Speed: roughly 10 times faster than real-time on a modern laptop CPU and 50 times faster on an NVIDIA GPU with CUDA. A one-hour WAV transcribes in about 6 to 12 minutes on CPU or 1 to 2 minutes on a mid-range GPU. Progress shows in real time.

4

Copy or export the transcript

The transcript appears in the StarWhisper window when processing finishes. Copy the full text to clipboard, save it as a .txt file, or export as SRT or VTT with timestamps for subtitle workflows. Paste into your show notes, episode page, audiobook script, or wherever the transcript needs to go.

Why Local WAV Transcription Beats Cloud Tools

Specific advantages for pro audio workflows.

Full quality, no upload re-encode

Cloud services often re-encode large WAV uploads to a lossy intermediate to save bandwidth. StarWhisper reads the original 24-bit lossless file directly off disk.

No file-size cap

Many cloud transcription services reject WAVs over a few hundred megabytes on free tiers or charge extra above a threshold. StarWhisper handles files up to the 4 GB WAV format limit.

Handles any sample rate

96k, 88.2k, 48k, 44.1k, 32k, 24k, 16k, 8k all supported. Internal resampling to 16k is high-quality and transcription accuracy is identical regardless of source rate.

Audio stays on your machine

Default Local Mode runs Whisper on your own CPU or GPU. The WAV is decoded locally, transcribed locally, the result is written to your drive. Useful for NDA studio sessions. Privacy details.

SRT and VTT export for subtitles

Generate caption-ready files directly from the WAV. Useful for adding accessible transcripts to podcast video versions or building subtitle tracks. Subtitle guide.

GPU acceleration if you have it

NVIDIA GPU owners install the CUDA pack and transcription speed jumps roughly 5x. A one-hour WAV becomes a two-minute job. GPU details.

Why WAV Is the Format Studios and Pro Workflows Default To

WAV (Waveform Audio File Format) is the uncompressed, lossless container that audio professionals have used since the early 1990s. Every DAW exports to WAV. Every studio archive stores masters as WAV. Every voiceover deliverable spec asks for WAV. Every podcast production chain ends in a WAV master before it gets encoded to MP3 or AAC for distribution.

The reason is simple: WAV preserves the original audio bit-for-bit. There is no codec, no compression artifact, no quality decay. When you record a podcast episode, voiceover session, or audiobook narration into your DAW and bounce a 48 kHz 24-bit WAV, that file is the canonical reference for every future decision. You make MP3s for distribution from the WAV, never the other way around.

The problem with transcribing WAVs is that the same properties that make them ideal for pro audio make them awkward for online tools. Files are large (a one-hour 48k stereo 24-bit WAV is around 1.0 GB). Upload times are long. Many cloud transcription services either cap file size on free tiers, charge extra for large files, or silently re-encode the upload to a lossy intermediate. StarWhisper handles WAV as a first-class format on your own machine, free, offline, with no upload step.

Sample Rate, Bit Depth, and Channel Count: What Actually Matters for Transcription

Pro audio workflows obsess over sample rate, bit depth, and channel layout for good reason: those choices affect mastering, mixing, and the final listener experience. For transcription specifically, the picture is much simpler.

Sample rate does not affect transcription accuracy

Whisper was trained on 16 kHz audio. StarWhisper internally downsamples your WAV to 16 kHz using a high-quality resampler before running the model. A 96 kHz WAV produces the same transcript as a 16 kHz version of the same audio. Human speech occupies a frequency range from roughly 100 Hz to 8 kHz; everything above 8 kHz is consonant detail and ambient air, none of which Whisper uses for word recognition. Higher sample rates are wasted on transcription.

Bit depth does not affect transcription accuracy

16-bit, 24-bit, and 32-bit float all transcribe identically. Bit depth controls dynamic range, which matters for music production but not for word recognition. Whisper internally works in 32-bit float regardless of source bit depth.

Channel count matters slightly

Whisper is a mono model. StarWhisper downmixes stereo and multi-channel WAVs to mono before processing. For most podcast and voiceover content this is fine. For interview recordings where each speaker is on a separate channel, the cleanest approach is to split the WAV into per-channel mono files using any audio editor, then transcribe each separately. This gives you per-speaker transcripts that you can label and combine in post.

What the WAV Workflow Looks Like in Practice

Common pro audio scenarios with realistic timing:

Single podcast episode

Bounce a 60-minute episode as a 48k stereo 24-bit WAV (roughly 1 GB on disk). Drop on StarWhisper. CPU processing time: 6 to 12 minutes. GPU processing time: 1 to 2 minutes. Output: roughly 8,000 words of plain-text transcript suitable for show notes. Light edit, publish. For more detail on the podcast-specific workflow, see voice-to-text for podcasters or how to transcribe podcasts.

Audiobook chapter

Voiced chapter, exported as 48k mono 24-bit WAV, runtime 45 minutes (about 500 MB). Drop on StarWhisper. Output: roughly 6,500 words of chapter text. Useful for cross-checking the narration against the original script to catch fluffs or missed lines.

Interview recording with multi-track stems

Three-host interview, each host on a separate WAV channel after the session. Export each channel as a mono WAV. Drop each onto StarWhisper individually. Result: three per-speaker transcripts that you can combine with timecode and speaker labels in post. Total processing time scales linearly: three 60-minute mono WAVs take about 18 to 36 minutes on CPU or 3 to 6 minutes on GPU.

Voiceover session deliverable

Voice actor records 10 takes of a 30-second commercial spot. Each take exported as a separate WAV. Drop the folder on StarWhisper. The app queues all takes and produces a labeled transcript file for each. Useful for the agency review pass to compare take-to-take wording.

WAV vs MP3: When to Transcribe From Which

Many people end up with both: a WAV master and an MP3 distribution copy. For transcription, the answer is almost always the WAV.

The WAV is the closest to source, so it has the cleanest signal Whisper can work with. The MP3 has been through a lossy codec which mildly muddies high-frequency consonant edges and adds compression artifacts in transient sections. Whisper handles both well, but on borderline audio (heavy accents, fast speech, technical vocabulary) the WAV produces a measurably better transcript.

The exception is when the WAV is large and inconvenient and the MP3 is good enough. For a clean studio podcast episode where the source mic was solid and the room was treated, the MP3 transcript is functionally identical to the WAV transcript. Use whichever is faster to grab. If you want the transcript for archival or accessibility purposes, use the WAV. For a related workflow starting from MP3, see how to convert MP3 to text. If your source is an iPhone voice memo or a QuickTime export, see how to convert M4A to text.

Privacy: Why Local WAV Transcription Matters for Studio Work

Studio WAV files often contain content that should not sit on a third-party server:

  • Unreleased music or podcast pre-mixes under NDA with the label or network
  • Interview source recordings where the subject signed a confidentiality agreement
  • Voiceover and audiobook narration in production for a client who has not approved release
  • Legal and corporate deposition audio recorded by a court reporter or paralegal
  • Medical and therapy session recordings handled by a clinician
  • Field recordings of human subjects collected under an IRB protocol that prohibits cloud upload

Cloud transcription services upload the WAV to their infrastructure regardless of what is in it. Their privacy policies may be strong, but the file still leaves your control. For the categories above, that is often unacceptable on contract, ethics, regulatory, or legal grounds.

StarWhisper Local Mode keeps everything on your device. The WAV is decoded by the app, the Whisper model runs on your CPU or GPU, the transcript is written to your hard drive. Nothing leaves the machine. For deeper detail, see privacy and offline architecture and how to transcribe audio offline. For specific regulated industries, see HIPAA compliance FAQ or voice-to-text for lawyers.

Pricing: When the Free Tier Is Enough and When Pro Makes Sense

The free tier of StarWhisper provides 500 words per day and 3,500 words per week of transcribed output. A typical 60-minute WAV produces roughly 8,000 words. That means a single long episode exceeds the daily free cap. You can still process the file (no limit on file size or duration itself), but only the first ~500 words count toward today's allocation.

For occasional studio work (one episode every few weeks, the rare voiceover session), the free tier is enough. For production-volume work (weekly podcast episodes, daily audiobook recording sessions, regular interview transcription), the Pro plan removes the cap. It is 10 dollars per month or 80 dollars per year. Full Pro details and pricing. A 7-day free trial of Pro is available if you want to verify the workflow on a long WAV before paying.

Free and Pro use the same Whisper model and produce identical transcripts. Pro just removes the word cap and adds workflow features like custom vocabulary (useful for industry-specific terminology) and priority cloud fallback (if you opt in). For pure WAV transcription, the only practical difference is the daily output ceiling.

Frequently Asked Questions

What sample rates does StarWhisper support for WAV files?
All common professional sample rates: 96 kHz, 88.2 kHz, 48 kHz, 44.1 kHz, 32 kHz, 24 kHz, 16 kHz, and 8 kHz. StarWhisper automatically downsamples to 16 kHz internally (the rate Whisper was trained on) using a high-quality resampler. The downsampling is lossless from a transcription standpoint: Whisper does not gain accuracy from higher sample rates because human speech occupies a frequency range well below 8 kHz. Bringing in a 96k studio file produces the same transcript as bringing in a 16k file of the same audio.
What about multi-channel WAVs (stereo, 5.1, ambisonic)?
StarWhisper handles stereo WAV files by downmixing to mono before transcription, since Whisper is a mono model. For straight dialogue recordings, this is fine and often actually improves accuracy because crosstalk between channels averages out. For surround formats (5.1, 7.1, ambisonic), StarWhisper takes the center channel where dialogue normally lives. If your multi-channel WAV has different speakers on different channels and you want per-speaker transcripts, the cleanest approach is to split the WAV into per-channel mono files (any audio editor can do this) and transcribe each separately.
Does StarWhisper handle 24-bit vs 16-bit WAV any differently?
Both work, no extra steps. Bit depth affects the dynamic range that can be represented in the file but does not affect Whisper's transcription accuracy in any meaningful way. A 24-bit studio recording transcribes identically to a 16-bit version of the same audio (Whisper internally converts to 32-bit float anyway during processing). 32-bit float WAV files are also supported. The format choice on the recording side should be driven by your audio engineering needs, not by transcription concerns.
How long does it take to transcribe a one-hour WAV file?
Roughly 6 to 12 minutes on a typical Windows laptop CPU, and 1 to 2 minutes on an NVIDIA GPU with CUDA enabled. The format (WAV, MP3, M4A) does not affect processing time; only the audio duration does. WAV files are larger on disk (a one-hour 48k stereo 24-bit WAV is about 1.0 GB) but read just as fast from a local SSD. Cloud transcription services often charge extra or refuse to accept WAVs over a certain size; local transcription has no such limit.
Will this handle pro studio recordings (podcasts, voiceover, audiobook sessions)?
Yes, and they are the best-case scenario for accuracy. Studio recordings made in a treated room with a quality condenser mic and a single clear speaker are exactly what Whisper handles best, expect 97 to 99 percent accuracy on standard English. Multi-host podcasts and interview recordings are also strong, especially when each speaker is on a separate channel that you can transcribe individually. Voiceover sessions, audiobook narration, and clean podcast dialogue produce near-publication-quality transcripts that often need only light copy editing.
Can I export SRT or VTT for subtitles?
Yes. StarWhisper supports SRT and VTT subtitle export with per-segment timestamps. This is useful for adding captions to video, building accessible podcast pages with synchronized transcripts, or generating subtitle tracks for streaming-platform uploads. The timestamps are derived from the same processing pass that produces the transcript, so they line up correctly with the audio. For a workflow that focuses specifically on the video-subtitle case, see the related guide on how to add subtitles to video for free.
Is there a file-size limit on the WAV?
No hard limit imposed by StarWhisper. WAV files have a 4 GB cap built into the format itself (because of the 32-bit size field in the WAV header), so a single WAV cannot exceed roughly 4 hours at 48k stereo 24-bit, or 6 hours at 48k mono 24-bit. For longer single recordings, use BWF (Broadcast Wave Format) or RF64 which extend the size limit, or split the file. StarWhisper handles files up to the format limit without issue; only free-tier word-count caps on the transcript output apply.
Does the audio leave my computer?
No, not in default Local Mode. The WAV is decoded by the app, processed by the Whisper model on your CPU or GPU, and the resulting transcript is written to your hard drive. Nothing is uploaded to OpenAI, to StarWhisper, or to any third party. You can verify this by disconnecting from the network before processing a file. This makes StarWhisper a strong fit for studio recordings under NDA, confidential interviews, unreleased music or podcast pre-mixes, and any session audio that should not sit on a third-party server.
Is StarWhisper really free for WAV transcription?
Yes. The free tier provides 500 words per day and 3,500 words per week of transcribed output, with no credit card and no signup wall. For occasional WAV transcription (a podcast episode here and there, a single voiceover session) the free tier is enough. For routine long-form studio work (full episodes weekly, multi-hour audiobook sessions, daily journalism), the Pro plan removes the cap at 10 dollars per month or 80 dollars per year. Free and Pro produce identical transcripts using the same Whisper model.
Will the transcript handle sound effects, music, or non-speech audio?
Whisper transcribes the speech and tends to ignore pure music and ambient sound. If a WAV contains stretches of background music with speech over it, the speech will still be transcribed (sometimes with slightly lower accuracy if the music is loud). Pure music sections often produce empty transcript output or occasional speculative text. For studio podcast or interview recordings, this is rarely a problem since the music typically sits at low volume under the dialogue. Sound effects and non-speech audio are similarly mostly ignored.
What if my WAV has been recorded with a noise floor or hum?
Whisper is fairly robust to mild noise, electrical hum, room tone, and HVAC noise. Accuracy on lightly noisy studio audio (clean mic with manageable background) typically lands in the 92 to 97 percent range, only a few points below pristine. Heavy noise (loud music, multiple loud speakers crosstalking, traffic) drops to 80 to 90 percent. For best results, clean up obvious noise in your DAW before exporting the WAV. A simple noise gate, high-pass filter, and notch on the hum frequency usually buys 1 to 3 points of transcription accuracy at zero quality cost to the audio.

Convert Any WAV to Text in Minutes

Free download. Drag studio recordings in, get a full transcript locally at any sample rate. No upload, no per-minute fees.

Download StarWhisper for Windows