Fixing the 'Static of Doom': A Deep Dive into Real-time PCM Streaming
At Intervu, the goal is to provide a seamless, high-fidelity AI interview experience. But recently, production users reported a critical issue: instead of the clear, natural voice of the AI interviewer, they were met with loud, heavy static.
I called it the “Static of Doom.”
What made it particularly frustrating was that it sounded perfect on localhost. It only broke in production. Here is the post-mortem of how I tracked down a sneaky bug in the audio pipeline. And of course AI was a great help in debugging this issue.
The Symptoms
- Clear audio on
localhost. - Immediate “white noise” or heavy static on the production domain (
intervu.dev). - Playback was noticeably slower or “choppy.”
Investigation Phase 1: The Sample Rate & Header Suspicion
The initial theory was that the data format was incorrect. Intervu uses Deepgram for low-latency Text-to-Speech (TTS), requesting raw linear16 PCM at 24kHz.
In production, the AI sometimes sounded like it was speaking in half-speed slow motion, a classic symptom of a 48kHz stream incorrectly being played at 24kHz.
The stream appeared to include an unexpected header—specifically the 44 bytes of a standard WAV header. The hypothesis was that this metadata was being played as audio (causing a static burst) and potentially disrupting the stream alignment or sample rate.
The Fix (That Didn’t Work): I hardened the backend provider to verify stream headers and strip them if present. However, it didn’t solve the core problem: the overwhelming wall of static remained.
Investigation Phase 2: The “One Byte” Disaster
The static noise persisted as the stream continued. The real mystery: why would it work locally in my dev environment but fail over the internet?
The answer lies in Network MTU and Proxy Buffering.
Real-time audio is sent in chunks. For 16-bit PCM, every “sample” is exactly 2 bytes.
- On
localhost, packets are large and arrive intact. Chunks are almost always even-numbered (e.g., 4096 bytes). - In production, data passes through Nginx and multiple internet routers. These proxies often re-buffer or “slice” data to optimize throughput.
If Nginx—or any other intervening layer—splits a chunk at an odd byte offset (e.g., sending 1025 bytes), catastrophe strikes.
Because the frontend was expecting pairs of bytes, receiving 1025 bytes meant the last byte was half of a sample. The next chunk would then start with the second half of that sample. Every single sample for the rest of the stream was now “byte-shifted.”
In PCM, if you swap the high and low bits of a 16-bit integer, you don’t get “slightly worse” audio—you get pure, unbridled noise.
The Solution: The leftoverBuffer
The fix was a small but crucial piece of logic in the React frontend. I implemented a “residue” or leftoverBuffer for the stream reader.
Now, instead of the audio dissolving into static, the console logs show the system actively “healing” the stream:
⚠️ 🎵 [TTS] Detected odd chunk size (16321 bytes). Buffering 1 byte for alignment.
🎵 [TTS] Prefilling chunk with 1 leftover byte (Alignment Fix)
By detecting that a chunk like 16321 bytes (which contains a partial 16-bit sample) has arrived, the trailing byte is sliced off, stored, and “stitched” to the front of the next chunk.
// If the current chunk has an odd number of bytes, it splits a 16-bit sample
if (value.length % 2 !== 0) {
// Save the last byte for the next chunk
leftoverBuffer = value.slice(-1);
value = value.slice(0, -1);
}
This ensures that the Web Audio API always receives perfectly aligned pairs of bytes, keeping the “Static of Doom” at bay even when the internet is slicing the data into irregular pieces.
Lessons Learned
- Proxies are not transparent: Your local network environment is a lie. Production proxies will slice your streams in ways you didn’t anticipate.
- Alignment is everything: When dealing with raw binary data, a single 1-byte shift is the difference between a voice and a vacuum cleaner.
The “Static of Doom” is now officially dead. Happy interviewing!