Everything happening between your microphone and the text appearing at your cursor. No hand-waving — just the actual signal chain, algorithms, and design decisions.
// No vibe-coded slop. Every stage of this pipeline is purpose-built — from the DSP chain to the injection layer. No Electron wrappers, no cloud dependencies, no half-measures.
Multi-channel input from any mic configuration is mixed to mono with proper gain staging. Each channel is summed and divided by the channel count, preventing amplitude inflation that degrades transcription on stereo or surround microphones.
A slow-moving average estimator (alpha 0.999) continuously tracks and subtracts the DC component. This eliminates the constant voltage offset present in many consumer USB audio interfaces, which biases silence detection and reduces effective dynamic range.
A recurrent neural network trained on thousands of noise environments runs on every audio frame in real time. Strips keyboard clatter, fan hum, room reverb, and background conversation while preserving speech. Based on the nnnoiseless Rust port of Xiph.org's RNNoise — no cloud, no GPU needed, negligible CPU cost. Toggleable in Audio settings if you're in a quiet room and want zero processing.
An envelope-following AGC normalizes speech volume after noise suppression. Targets -20 dBFS with up to 10x gain, using separate attack (15 ms) and release (150 ms) time constants. Quiet late-night mumbling gets boosted to the same level as energetic daytime speech — the transcription model sees a consistent signal regardless of how loudly you're speaking. Running post-denoise means the gain calculation operates on clean signal rather than amplified background noise, which is the conventional order for ASR pipelines.
A first-order high-pass filter removes energy below the speech band — mechanical keyboard vibration, desk rumble, HVAC hum, wind noise, handling noise. Applied at the native device sample rate before resampling for full frequency resolution. The 80 Hz cutoff sits below even the deepest human voice fundamentals (~85 Hz bass).
Power line interference at 50 Hz (Europe, Asia) or 60 Hz (Americas) couples into nearly every consumer microphone — especially USB mics with cheap power circuitry. Two narrow biquad notch filters (Q=20, ~3 Hz wide each) surgically remove both fundamentals while leaving the speech band completely untouched. The 80 Hz HPF rolls off mains hum but doesn't kill it; the notches do. Always-on, both regions covered, no configuration.
A Bristow-Johnson cookbook biquad filter applies +6 dB of gain above 2.5 kHz — the frequency band where plosive bursts (P, T, K at 2–5 kHz), fricatives (F, S at 3–4 kHz), and sibilants live. Consumer webcam and headset mics roll off exactly in this range, which is why dictation tools confuse 'feeds' with 'beads' or 'speed' with 'feed'. The shelf boost restores what cheap mics lose.
A 2:1 ratio compressor with -20 dBFS threshold replaces the simple tanh limiter. Attack is 5 ms (fast enough to catch plosives), release is 50 ms (smooth enough to avoid pumping). Unlike a hard limiter that flattens everything above a threshold, the compressor preserves the natural dynamics of speech while preventing clipping from coughs, desk bumps, or sudden volume changes. A +3 dB makeup gain after compression restores the signal level the compressor pulls down, so the model receives audio at a consistent target level without the squashed sound of aggressive limiting.
A first-order FIR filter (y[n] = x[n] - 0.65·x[n-1]) tilts the spectrum toward high frequencies, boosting the consonant band relative to vowel energy. Standard ASR pre-emphasis uses α=0.97, but Whisper and Parakeet are trained on natural speech — that aggressive a coefficient destroys the signal. The 0.65 coefficient was tuned through iterative live testing to maximize consonant discrimination without degrading overall recognition.
Audio is resampled from the device's native rate (44.1/48 kHz) to 16 kHz using a windowed-sinc interpolation kernel with a Blackman window. Same class of resampler used in professional audio software. The Blackman window provides -58 dB sidelobe attenuation, eliminating aliasing artifacts that degrade recognition of sibilants and fricatives. For comparison: linear interpolation (used by most dictation tools) aliases at -13 dB.
Before every transcription pass, audio is normalized to a consistent RMS level. This compensates for mic gain variation, speaking distance, and voice volume — whispering close to the mic produces the same signal to the model as speaking normally from two feet away. Gain is capped at 10x to avoid amplifying ambient noise.
Speakey ships two transcription backends. Faster-whisper uses a Python subprocess with CTranslate2 and CUDA for NVIDIA GPUs. Parakeet TDT uses ONNX Runtime with DirectML for any DX12 GPU (AMD, Intel, NVIDIA) or falls back to CPU. The setup wizard auto-detects your hardware and picks the best engine — you can override it in Settings anytime.
CTranslate2 is a C++ inference engine optimized for Transformer models. All computation runs on the GPU in float16 precision — typically 4-8x faster than OpenAI's reference Whisper implementation. Runs in a separate Python subprocess with a binary multiplexed protocol (stdin for audio + commands, stdout for JSON events). Process isolation means a crash in transcription can't take down the app.
NVIDIA's Parakeet TDT 0.6B v3 model runs via ONNX Runtime with DirectML, supporting any DX12 GPU — AMD, Intel, or NVIDIA — or pure CPU. Written entirely in Rust with no Python dependency. Model files are ~370 MB, downloaded once on first use. Delivers accurate transcription across a wide range of hardware.
Both engines run partial transcription passes while you speak. Faster-whisper runs partials at 150–400 ms intervals depending on model size. Parakeet runs at 250 ms on GPU, 400 ms on CPU. Each pass re-transcribes the full audio buffer, so later partials benefit from increasing context and are progressively more accurate.
Every Whisper segment includes a no_speech_prob score — segments scoring above 0.6 are filtered, eliminating phantom words from ambient noise, breathing, or keyboard sounds. Faster-whisper also runs Silero VAD as a pre-filter to identify speech regions before invoking the full model. Parakeet handles silence through its streaming architecture.
Both partial (200 ms) and final (300 ms) transcription passes append silence padding to the audio buffer before inference. This gives the model room to properly finalize the last spoken word — without it, trailing words are frequently truncated, especially those ending in unvoiced consonants (t, k, p, s).
Audio chunks are only forwarded to the transcription engine when RMS exceeds a threshold, with a 150 ms trailing grace period to capture trailing syllables. This prevents silent audio from accumulating in the buffer, which degrades accuracy when speech finally starts.
In toggle mode, words are only injected when they appear at the same position in two consecutive partial transcriptions. This prevents early-word instability ("No" flickering to "Now") from causing incorrect injection while still maintaining real-time output.
Window duration, warmup periods, and silence thresholds are tuned per model size. Tiny models get 15-second windows with 1-partial warmup. Large models get 8-second windows with 2-partial warmup. These profiles balance latency against the model's processing speed.
In toggle mode, speech pauses trigger automatic window flush and reset, keeping transcription windows sentence-sized for optimal accuracy. A separate max-duration cap forces reset on continuous speech to prevent quality degradation from long contexts. A 3-second silence reset prevents dead air from bloating the buffer.
Text is only injected from the final flush result, never from in-flight partial transcripts. This eliminates a class of race conditions where a partial arrives just before the flush response, causing the same text to be typed twice.
Your vocabulary lives in plain text files — one .txt per category, stored in your AppData folder. Add words by typing them, dropping files onto the dictionary panel, or clicking 'learn' on a correction. Categories are just files, so you can share domain-specific word lists (medical, legal, gaming) by copying a single file.
A CSV-based replacement engine lets you define exact text substitutions — fix persistent mistranscriptions, expand abbreviations, or enforce house style. Rules are applied after transcription, before text injection. Supports case-sensitive and whole-word matching.
Click any transcribed word to see correction suggestions from your dictionary. Matching uses three algorithms in parallel: Levenshtein edit distance for typo-like errors, Double Metaphone for phonetic similarity, and Soundex as a fallback. This catches errors that pure string matching misses — 'clod' matches 'Claude' because they sound alike, even though the spelling is completely different.
Both transcription engines return per-word confidence scores. Words scoring below 0.7 are automatically checked against your dictionary using tight Levenshtein matching. If a match is found, the word is silently replaced before injection. This means once you add a word to your dictionary, the engine self-corrects without you lifting a finger.
ASR models sometimes generate phantom text from silence or noise — repetitive phrases, YouTube outros, or gibberish. Speakey runs every transcript through multiple detection heuristics before injection: trigram repetition analysis (rejects if >40% of 3-word sequences repeat), bigram diversity scoring (rejects if unique bigram ratio < 0.15), single-word artifact detection, and known phrase matching. This catches the subtle hallucinations that simpler filters miss, like 'Thank you for watching. Thank you for watching. Thank you for...'
Inverse Text Normalization converts spoken forms to written forms using regex-based rules. "twenty three dollars and fifty cents" becomes "$23.50", "three forty five PM" becomes "3:45 PM", "sixty five percent" becomes "65%". Handles compound numbers up to 999,999, ordinals, decimals, time expressions, currency, percentages, **emails** ("geisler dot kyle at gmail dot com" → "[email protected]"), **URLs with paths** ("https colon slash slash speakey dot io slash buy" → "https://speakey.io/buy"), and **dates with year-words** ("May fifteenth nineteen seventy-seven" → "May 15th, 1977"). Tuned to Parakeet's hybrid output format — even when the model emits literal periods attached to the next word with a leading space, post-processing reconstructs the intended written form. No external model needed; pure pattern matching that runs in microseconds. Toggleable in Dictionary settings.
Common speech disfluencies — "um", "uh", "hmm", "like", "you know", "I mean" — are stripped from the transcript before injection. Configurable in settings. Keeps your dictated text clean without you having to consciously avoid filler words.
A filesystem watcher monitors your dictionary folder for changes. Edit a .txt file in Notepad, drop a word list in, or sync from another machine — Speakey picks up changes within seconds. No restart, no re-import. Cross-category deduplication runs automatically on every change.
CapsLock is the single most useless key on a standard keyboard. It occupies prime home-row real estate — the same width as a Shift key, roughly 2% of total key area — and its only contribution to computing is making people accidentally yell in group chats. It's been freeloading on the home row since the IBM Model M in 1984. Speakey gives it a worthy purpose: hold it to talk. It's right where your left pinky already rests, it's easy to hold, and nobody was using it. The key is fully suppressed by default — it never toggles caps, never lights the LED, just activates dictation. No more accidental SHOUTING IN SLACK. You can remap the hotkey to any key if you want, but honestly, CapsLock had this coming.
The hotkey system uses a raw Win32 low-level keyboard hook via direct FFI to SetWindowsHookExW. Unlike rdev::grab() or RIDEV_INPUTSINK which get blocked or cause crashes with kernel-level anti-cheat (EAC, BattlEye, Vanguard), this hook coexists peacefully — it only observes events and never interferes with game input. Speakey won't get you flagged or banned.
When CapsLock suppression is enabled (it's on by default, because why wouldn't it be), CapsLock key events are consumed by the hook and never reach the OS. The LED never toggles, the state never changes. The key is fully neutralized as a caps toggle and repurposed entirely as a dictation trigger. On startup, if CapsLock was left on from before Speakey was running, the hook forces it off with a synthetic key press/release pair.
For the three people who actually need CapsLock to toggle caps: when suppression is disabled, a fallback mechanism detects CapsLock LED state after key release and fires a synthetic keybd_event pair to reset it. An AtomicBool flag prevents the hook from re-processing its own synthetic events.
Text is posted directly to the focused control's message queue via PostMessageW with WM_CHAR messages. Uses GetGUIThreadInfo to resolve the actual text input control within the foreground window, not just the window itself. Full Unicode support through native UTF-16.
The Rust backend handles audio capture, keyboard hooks, window management, and text injection. When using faster-whisper, CUDA transcription runs in a separate Python subprocess — a crash cannot take down the app, and GPU memory is cleanly freed on subprocess restart. When using Parakeet, the ONNX model runs in a Rust worker thread with the same isolation guarantees.
Audio and control commands share a single stdin pipe using a magic-byte framing protocol (0xCAFEBABE + 4-byte command ID). Sub-millisecond command delivery without interrupting audio flow. Transcription events return as newline-delimited JSON on stdout.
The Rust audio callback processes samples in-place through an 11-stage filter chain (mono mixdown, DC removal, RNNoise, AGC, HPF, 50/60 Hz notches, high-shelf EQ, compressor, makeup gain, pre-emphasis, sinc resampler, silence pre-padding) with minimal allocation. The resampler pre-allocates its output buffer. Processed samples are sent to the transcription thread via a lock-free channel.
The entire pipeline — microphone capture through GPU transcription to text insertion — runs locally with zero network calls. No audio data, transcripts, or telemetry leave the machine. The app functions identically with the network adapter disabled.
Both engines support 25 languages: English, Spanish, French, German, Italian, Portuguese, Dutch, Polish, Russian, Ukrainian, Czech, Romanian, Swedish, Norwegian, Finnish, Hungarian, Chinese, Japanese, Korean, Arabic, Turkish, Hindi, Vietnamese, Thai, and Indonesian. Language selection is passed to the model at inference time — no separate model download required.
All of this runs on your machine, on your hardware, with your data staying yours.
No release notes available.