EagerHQ
← Back to BlogTutorial13 min read

On-Device Hotword Detection on macOS With CoreML: A Practical Guide

How to ship a private, battery-friendly wake-word detector on macOS. Data collection, a compact CNN, CoreML conversion, AVAudioEngine plumbing, and the gotchas nobody warns you about.

By Rajdeep ChaudhariTechnical

A wake word is a deceptively simple feature. Listen for a phrase, trigger when you hear it, ignore everything else. The moment you try to ship one you realise the challenge is not the model. It is the data, the runtime plumbing, and the tail of weird environments real users have.

This is a practical guide to shipping an on-device hotword detector on macOS with CoreML. It is the same pipeline we use inside Voxlit, adapted to stand alone as a tutorial. If you want the product context, see Voxlit Under the Hood.

The model is the easy part. The hard parts are data, battery, and route changes.

01 / Data

You need more examples than you think.

Good hotword data has three ingredients.

  • Positive examples. The wake phrase, said many ways. Different speakers, accents, speaking speeds, distances from the mic. Aim for at least 2,000 clips. More if you can.
  • Hard negatives. Phrases that sound almost like the wake word. If your wake word is "Hey Voxlit", collect "Hey, let's", "Hey, box it", "Hey folks, it's...". These are the clips that destroy precision.
  • Background. Hours of it. Keyboard typing, cafe noise, music, video calls. The model will spend 99 percent of its runtime on background, so it had better know what background is.

We use a data augmentation pipeline that randomly mixes positives with background at varying SNRs. One real positive turns into twenty usable training samples.

02 / Features

Mel spectrograms, computed at the edge.

Raw audio is a bad input for a small model. Mel filterbank features compress the signal into something a tiny network can learn from.

  • Window size 25ms, hop 10ms. Standard for speech.
  • 40 mel bins, log compressed.
  • Per-utterance mean and variance normalisation.
  • Final input is a 40 x 120 spectrogram, roughly 1.2 seconds of context.

The exact same feature extraction runs on the client at inference time. Any mismatch between training and inference features will tank accuracy invisibly.

03 / Model

Small CNN, no transformer.

For a single keyword, a small depthwise-separable CNN beats anything bigger for this workload. You want low latency, low power, and a model that fits on the Neural Engine.

python
import torch.nn as nn

class HotwordCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.backbone = nn.Sequential(
            nn.Conv2d(1, 32, 3, padding=1), nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(32, 64, 3, padding=1, groups=32), nn.ReLU(),
            nn.Conv2d(64, 64, 1), nn.ReLU(),
            nn.AdaptiveAvgPool2d(1),
        )
        self.head = nn.Linear(64, 1)

    def forward(self, x):
        x = self.backbone(x).flatten(1)
        return self.head(x).squeeze(-1)
  • Total parameters: around 60K.
  • Quantised size: 220KB.
  • Inference on the Neural Engine: under 2ms per window.
04 / CoreML conversion

PyTorch to .mlpackage.

python
import coremltools as ct
import torch

model.eval()
example = torch.randn(1, 1, 40, 120)
traced = torch.jit.trace(model, example)

mlmodel = ct.convert(
    traced,
    inputs=[ct.TensorType(name="mel", shape=example.shape)],
    compute_units=ct.ComputeUnit.ALL,
    minimum_deployment_target=ct.target.macOS14,
)
mlmodel.save("Hotword.mlpackage")

Set compute_units=ALL so CoreML can route to the Neural Engine where available, with a CPU fallback on older machines.

Quantisation

8-bit weight quantisation shrinks the model fourfold with no measurable accuracy drop on a binary classifier at this size. Do it at the end of training, not the start.

05 / Swift plumbing

From audio buffer to trigger.

swift
import AVFoundation
import CoreML

final class HotwordDetector {
  private let model = try! Hotword(configuration: MLModelConfiguration())
  private var melBuffer = MelBuffer(windowMs: 1200, hopMs: 200)

  func process(_ buffer: AVAudioPCMBuffer) {
    guard let frames = melBuffer.consume(buffer) else { return }
    let input = HotwordInput(mel: frames.toMLMultiArray())
    if let out = try? model.prediction(input: input), out.score > 0.85 {
      NotificationCenter.default.post(name: .hotwordFired, object: nil)
    }
  }
}
  • Run the model on every 200ms hop. Overlapping windows catch phrases that straddle frame boundaries.
  • Debounce triggers. After a fire, ignore new triggers for 2 seconds so a single utterance does not stack.
  • Cache the MLModel instance. Reloading it per call is expensive and unnecessary.
06 / Gotchas

What nobody tells you.

  • Bluetooth headsets change the input sample rate on connect and disconnect. Subscribe to AVAudioSession.routeChangeNotification and rebuild the engine when the rate changes.
  • Microphone permission on macOS is per-app and cached by tccd. If you change the app bundle ID mid-development, you will have to reset permissions.
  • The Neural Engine is not available on Intel Macs. Your CPU path matters. Benchmark it.
  • Screen savers and sleep states pause AVAudioEngine without telling you. Listen for media services reset notifications and restart on wake.
  • Users with thick accents will hate a single-threshold detector. Ship a "sensitivity" slider that maps to the decision threshold and let them tune it.
07 / Measuring

Two numbers, always.

Track two metrics. False accepts per hour of background audio, and miss rate on real positive utterances. Everything else is noise.

  • Target: fewer than 1 false accept per 24 hours of background. Users find frequent false triggers unforgivable.
  • Target: under 5 percent miss rate on clean positive samples. Users forgive the occasional miss because they can simply repeat.
08 / Ship it

You have a hotword detector.

The model is small, private, and fast. The data is where you will spend most of your time, and the runtime plumbing is where you will ship most of your bug fixes. If you are building voice-first tools on macOS and want a second opinion on your pipeline, write to hello@eagerhq.com.

Found it useful? Pass it on.
#CoreML#macOS#Swift#Machine Learning#Audio#On-Device AI
Got something to build?
Cloud, SaaS, web, or agentic AI. If it ships, we want to build it.
hello@eagerhq.com →