On-Device Hotword Detection on macOS With CoreML: A Practical Guide
How to ship a private, battery-friendly wake-word detector on macOS. Data collection, a compact CNN, CoreML conversion, AVAudioEngine plumbing, and the gotchas nobody warns you about.
A wake word is a deceptively simple feature. Listen for a phrase, trigger when you hear it, ignore everything else. The moment you try to ship one you realise the challenge is not the model. It is the data, the runtime plumbing, and the tail of weird environments real users have.
This is a practical guide to shipping an on-device hotword detector on macOS with CoreML. It is the same pipeline we use inside Voxlit, adapted to stand alone as a tutorial. If you want the product context, see Voxlit Under the Hood.
The model is the easy part. The hard parts are data, battery, and route changes.
You need more examples than you think.
Good hotword data has three ingredients.
- Positive examples. The wake phrase, said many ways. Different speakers, accents, speaking speeds, distances from the mic. Aim for at least 2,000 clips. More if you can.
- Hard negatives. Phrases that sound almost like the wake word. If your wake word is "Hey Voxlit", collect "Hey, let's", "Hey, box it", "Hey folks, it's...". These are the clips that destroy precision.
- Background. Hours of it. Keyboard typing, cafe noise, music, video calls. The model will spend 99 percent of its runtime on background, so it had better know what background is.
We use a data augmentation pipeline that randomly mixes positives with background at varying SNRs. One real positive turns into twenty usable training samples.
Mel spectrograms, computed at the edge.
Raw audio is a bad input for a small model. Mel filterbank features compress the signal into something a tiny network can learn from.
- Window size 25ms, hop 10ms. Standard for speech.
- 40 mel bins, log compressed.
- Per-utterance mean and variance normalisation.
- Final input is a 40 x 120 spectrogram, roughly 1.2 seconds of context.
The exact same feature extraction runs on the client at inference time. Any mismatch between training and inference features will tank accuracy invisibly.
Small CNN, no transformer.
For a single keyword, a small depthwise-separable CNN beats anything bigger for this workload. You want low latency, low power, and a model that fits on the Neural Engine.
import torch.nn as nn
class HotwordCNN(nn.Module):
def __init__(self):
super().__init__()
self.backbone = nn.Sequential(
nn.Conv2d(1, 32, 3, padding=1), nn.ReLU(),
nn.MaxPool2d(2),
nn.Conv2d(32, 64, 3, padding=1, groups=32), nn.ReLU(),
nn.Conv2d(64, 64, 1), nn.ReLU(),
nn.AdaptiveAvgPool2d(1),
)
self.head = nn.Linear(64, 1)
def forward(self, x):
x = self.backbone(x).flatten(1)
return self.head(x).squeeze(-1)- Total parameters: around 60K.
- Quantised size: 220KB.
- Inference on the Neural Engine: under 2ms per window.
PyTorch to .mlpackage.
import coremltools as ct
import torch
model.eval()
example = torch.randn(1, 1, 40, 120)
traced = torch.jit.trace(model, example)
mlmodel = ct.convert(
traced,
inputs=[ct.TensorType(name="mel", shape=example.shape)],
compute_units=ct.ComputeUnit.ALL,
minimum_deployment_target=ct.target.macOS14,
)
mlmodel.save("Hotword.mlpackage")Set compute_units=ALL so CoreML can route to the Neural Engine where available, with a CPU fallback on older machines.
Quantisation
8-bit weight quantisation shrinks the model fourfold with no measurable accuracy drop on a binary classifier at this size. Do it at the end of training, not the start.
From audio buffer to trigger.
import AVFoundation
import CoreML
final class HotwordDetector {
private let model = try! Hotword(configuration: MLModelConfiguration())
private var melBuffer = MelBuffer(windowMs: 1200, hopMs: 200)
func process(_ buffer: AVAudioPCMBuffer) {
guard let frames = melBuffer.consume(buffer) else { return }
let input = HotwordInput(mel: frames.toMLMultiArray())
if let out = try? model.prediction(input: input), out.score > 0.85 {
NotificationCenter.default.post(name: .hotwordFired, object: nil)
}
}
}- Run the model on every 200ms hop. Overlapping windows catch phrases that straddle frame boundaries.
- Debounce triggers. After a fire, ignore new triggers for 2 seconds so a single utterance does not stack.
- Cache the
MLModelinstance. Reloading it per call is expensive and unnecessary.
What nobody tells you.
- Bluetooth headsets change the input sample rate on connect and disconnect. Subscribe to
AVAudioSession.routeChangeNotificationand rebuild the engine when the rate changes. - Microphone permission on macOS is per-app and cached by tccd. If you change the app bundle ID mid-development, you will have to reset permissions.
- The Neural Engine is not available on Intel Macs. Your CPU path matters. Benchmark it.
- Screen savers and sleep states pause
AVAudioEnginewithout telling you. Listen for media services reset notifications and restart on wake. - Users with thick accents will hate a single-threshold detector. Ship a "sensitivity" slider that maps to the decision threshold and let them tune it.
Two numbers, always.
Track two metrics. False accepts per hour of background audio, and miss rate on real positive utterances. Everything else is noise.
- Target: fewer than 1 false accept per 24 hours of background. Users find frequent false triggers unforgivable.
- Target: under 5 percent miss rate on clean positive samples. Users forgive the occasional miss because they can simply repeat.
You have a hotword detector.
The model is small, private, and fast. The data is where you will spend most of your time, and the runtime plumbing is where you will ship most of your bug fixes. If you are building voice-first tools on macOS and want a second opinion on your pipeline, write to hello@eagerhq.com.
Voxlit Under the Hood: How We Built a Voice-First AI Agent for macOS
A full engineering breakdown of Voxlit. CoreML hotword detection, streaming STT over WebSocket, the tool-enabled agent, and the Go cloud backend that ties it together.
14 min read →Field NotesInside EagerHQ: The Projects We Build, Open Source and Otherwise
A field report on everything we ship from the EagerHQ workshop. Voxlit and Patchbay under the hood, Webnite on stage, and the principles that hold them together.
9 min read →TechnicalPatchbay: Browser-to-Browser Audio in 800 Lines of Code
A WebRTC deep dive. How Patchbay ships peer-to-peer audio with Cloudflare Durable Objects for signaling, AudioWorklet for zero-jank routing, and lossless PCM for musicians.
12 min read →