We measured whisper.cpp transcription speed across every Whisper model size and three compute engines (CPU, NVIDIA CUDA, and Vulkan) on Windows. The short version: on a modern GPU, offline speech-to-text is effectively instant, and you do not need an NVIDIA card to get there.
On a modern GPU every model size runs at or faster than real time, so live dictation never lags. Notice how close CUDA and Vulkan are: the cross-vendor Vulkan path is not a downgrade.
| Model | CUDA time (11s clip) | CUDA speed | Vulkan time (11s clip) | Vulkan speed |
|---|---|---|---|---|
| tiny | 0.32s | 34.3x | 0.31s | 35.5x |
| base | 0.41s | 27.2x | 0.42s | 26.5x |
| small (default) | 0.84s | 13.1x | 0.82s | 13.4x |
| medium | 1.91s | 5.7x | 2.02s | 5.4x |
| large-v3 | 3.90s | 2.8x | 3.73s | 3.0x |
A fast CPU handles the small models comfortably, but the compute cost climbs steeply with model size. This is exactly why StarWhisper defaults to a right-sized model and uses your GPU when one is available.
| Model | CPU time (11s clip) | CPU speed | Verdict for live dictation |
|---|---|---|---|
| tiny | 1.98s | 5.6x | Comfortably real time |
| base | 4.54s | 2.4x | Real time |
| small | 20.4s | 0.5x | Slower than real time |
| medium | 73.6s | 0.1x | GPU recommended |
| large-v3 | 133.5s | 0.1x | GPU required in practice |
For interactive voice typing, the small model on a GPU is the sweet spot: near-perfect accuracy for everyday dictation with sub-second latency you never feel. The larger models are worth it for difficult audio or file transcription, but only with a GPU. If you are on a laptop with no discrete GPU, the tiny and base models keep dictation responsive, and StarWhisper picks a sensible default for your machine automatically. Because Vulkan performs like CUDA here, StarWhisper can accelerate on NVIDIA, AMD, and Intel GPUs, not just one vendor.
jfk.wav clip shipped with whisper.cpp, so anyone can run the same test.whisper-cli, the same engine StarWhisper bundles, with the CPU, CUDA, and Vulkan builds.whisper_print_timings, best of two runs per configuration to exclude one-time load variance. Real-time factor = 11.0s of audio divided by processing seconds.Get StarWhisper free for Windows
StarWhisper runs Whisper entirely on your own machine, no audio leaves your device, and it picks the right model and engine for your hardware automatically.