VoIP Call Quality Issues: How to Diagnose the Five Failure Areas and Fix Them

A glowing audio waveform transitioning from healthy blue to a distorted red fault section with an amber warning indicator above it, representing VoIP call quality troubleshooting and audio problem diagnosis on a voice network

When a customer reports a bad call, the first useful thing to know is which kind of bad. A call that never connected is a signaling problem. A call that connected but had no audio is a media path problem. A call that connected, had audio in both directions, and still sounded wrong is what this article covers: an audio quality issue, where every area underneath the audio appears to be functioning and the audible result is still poor.

Audio quality complaints almost always resolve to a single underlying cause: one of five quality areas is failing along the call path. Each area leaves a different fingerprint in the per-call media stats, each sounds different to the listener, and each has its own fix. The diagnostic discipline is to match the symptom to the area, then apply the fix that addresses that specific cause.

This article picks up where the broader VoIP troubleshooting guide hands off and goes deep on each area. For the operational side of catching these issues before subscribers do, the VoIP monitoring best practices guide is the companion reference.

Key Terms and Concepts
A quick-reference glossary for terms used throughout this article.
MOS (Mean Opinion Score)The 1-to-5 perceptual quality score derived from jitter, packet loss, and one-way delay using the E-model in ITU-T G.107. The companion monitoring guide covers thresholds and per-trunk dashboards.
RTCP (RTP Control Protocol)Carries the per-call quality reports the endpoints and SBC exchange alongside the media. RTCP fields are usually the first evidence to read on a quality complaint. RTCP-XR (extended reports) is used by modern endpoints to pass actual MOS scores and burst metrics.
JitterThe variation in packet arrival time on the receiving side. Voice codecs expect packets at regular intervals, and variation forces the receiver to compensate or skip.
Jitter bufferThe small receive-side queue that absorbs variation in packet arrival by holding packets briefly before playing them out. Sizing it is a trade-off between dropout and added delay.
PLC (Packet Loss Concealment)The receive-side technique of synthesizing a replacement for a missing packet from the surrounding audio. G.711 typically uses simple PLC methods such as waveform repetition, while low-bitrate predictive codecs often implement more advanced model-based PLC that can give better perceptual results under packet loss.
Tandem encodingWhat happens when audio is encoded, decoded, and re-encoded multiple times along the path. Each pass through a lossy codec compounds quality loss.
DSCP (Differentiated Services Code Point)Marks IP packets so intermediate hops can prioritize voice over other traffic. DSCP only helps when every hop along the path honors it.
One-way delayThe mouth-to-ear latency in a single direction. ITU-T G.114 sets the conversational comfort threshold at 150 ms one-way.
ERL (Echo Return Loss)Quantifies how much the unwanted echo signal is attenuated relative to the original. Low ERL on a leg with an analog hybrid is the most common echo cause.

What “Bad Audio” Actually Means: The Five Areas

A complaint of “the call sounded bad” almost always resolves to one of five distinct quality problems. Treating them as separate is the difference between a diagnostic that works and a diagnostic that randomly changes settings until the next ticket arrives.

The five areas are packet loss, jitter, one-way delay, codec and transcoding artifacts, and echo. Each has a different audible signature, a different evidence trail in RTCP and CDR, and a different remediation path. More than one area can fail on the same call, but in operational practice one area is almost always the dominant contributor and fixing it brings the call back to acceptable quality on its own.

The rest of this article walks each area in order: what produces the failure, how it sounds, how to confirm it from the evidence on hand, and what to actually change.

Area 1: Packet Loss

Voice is real-time. There is no retransmission. A packet that does not arrive in time is gone, and the receiver has to either synthesize a replacement or play silence for that interval. Any sustained loss above roughly 1 percent becomes audible, and above 3 percent the call is hard to follow.

The audible signature is clipped words, short silent gaps, occasional pops or onset artifacts, and a “robotic” feel as PLC tries to fill in. Subscribers describe it as “the call kept cutting out” or “every few seconds I lost a word.”

The evidence to read first is the RTCP receiver report from each side, or the equivalent loss percentage in the SBC’s per-call CDR. If loss is above 1 percent on one direction and near zero on the other, the path in the loss-affected direction is the suspect. If loss is bilateral, both directions share a congested hop, usually somewhere in the middle of the carrier path.

The causes cluster into a small set in real operations. WAN congestion on a specific carrier route is the most common, and it usually shows up as loss spiking during business hours on calls routed through that trunk while other trunks stay clean. Microbursts on an oversubscribed link produce the same audible result but with loss bursts too short to see in averages, only visible in PCAP. MTU mismatches and IP fragmentation tend to destroy specific packet sizes systematically. Route flap on the upstream BGP path produces brief but severe loss windows that recur. Wi-Fi on the LAN side is its own loss source and is almost never visible from the SBC.

The fix paths are roughly in this order. Verify DSCP marking is being applied on the SBC interface and that the marking is actually honored by every intermediate hop. Many operators discover during a quality investigation that DSCP was configured years ago and a router upgrade somewhere along the path stopped respecting it. Re-route the affected calls through an alternate trunk to confirm the loss is path-specific, not platform-wide. If the SBC supports it, enable forward error correction (FEC) on the leg facing the lossy network. Look at the load on the upstream link and shed non-voice traffic if voice and bulk transfer share a queue. For Wi-Fi loss, the only real fix is wired Ethernet on the affected handsets, which is a LAN-team conversation rather than an SBC change.

A specific note on codec behaviour under loss. G.711 (PCM) sends raw, uncompressed audio samples and has no native PLC built into the standard bitstream, so when a packet is lost a chunk of raw audio is lost with it. G.729 and similar predictive codecs include PLC mechanisms that attempt to maintain continuity of the speech signal when frames are lost.

Area 2: Jitter

If packet loss is about packets that never arrive, jitter is about packets that arrive at the wrong time. Voice codecs send a packet every 20 milliseconds (at typical 20 ms ptime), and the receiver expects them on that cadence. When inter-arrival time varies, the receive-side jitter buffer absorbs the variation up to its configured size and then either drops late packets or stretches the audio to wait for them.

The audible signature is warbly, unstable audio with occasional dropouts, sometimes described as “robotic” or “underwater.” It often coexists with low-level packet loss because a jitter buffer that has exceeded its capacity drops the late-arriving packets, which look like loss in the RTCP report.

The evidence to read is the jitter field in the RTCP receiver report and the CDR. Anything below 20 ms is comfortable for nearly any buffer. Between 20 and 50 ms, perceived quality depends heavily on buffer configuration. Above 50 ms, MOS will drop below 3.5 for most codecs regardless of buffer tuning.

The causes are different from packet loss causes, which is part of why the two areas separate cleanly. Variable queue depth on an intermediate hop is the largest source: a router that prioritizes voice when its queue is short but ignores priority when the queue fills. Inconsistent QoS scheduling on a virtualized hop produces the same effect inside a hypervisor. CPU starvation on a virtualized PBX, SBC, or media gateway introduces jitter even when the underlying network is clean, because packets sit in the host’s network stack waiting for vCPU time. Asymmetric routing where the two halves of the call take different paths can produce one-way jitter that the bidirectional RTCP report makes hard to attribute.

The jitter buffer is itself a frequent contributor to perceived quality, in both directions. Under-sized, it drops late packets and produces audible dropouts. Over-sized, it adds latency that the listener then experiences as Area 3 (one-way delay) and reports as a separate problem on a subsequent ticket. Most modern receivers run adaptive jitter buffers that resize within configured bounds. An adaptive buffer will not just sit at its 200 ms ceiling; it adjusts to the actual observed jitter and tends to hover around 60 to 80 ms, the initial depth plus some margin. The 200 ms ceiling is a cap, not a target, and unnecessary latency only appears when the buffer is poorly tuned or the algorithm aggressively over-buffers.

The fix paths begin with the jitter buffer itself if the receiver is under operator control. Set adaptive bounds that match the actual path profile measured over a representative sample. If the variable hop is identifiable from traceroute or per-hop telemetry, apply or fix QoS on that hop. If the SBC or PBX is virtualized and CPU starvation is suspected, pin the vCPUs and reserve memory, or move the workload to dedicated hardware. If the path crosses a known asymmetric routing region (multi-homed enterprise customers are the usual offender), anchoring media on the SBC splits the call into two distinct, manageable network legs, which isolates the jitter to a specific segment and prevents it from compounding end to end.

Area 3: Latency (One-Way Delay)

End-to-end delay is the quietest of the five areas because it does not necessarily make the audio sound bad. What it does is make the conversation unworkable. Above 150 ms one-way, callers begin to talk over each other; above 250 ms one-way, the conversation feels like a satellite call; above 400 ms, normal speech is impossible. ITU-T G.114 defines these thresholds, and they apply to the entire path from mouth to ear, not just the operator’s segment.

The audible signature is not distortion. The caller and callee report awkward pauses, talkover, repeated “are you there?”, and a general sense the conversation is out of sync. They will rarely describe the audio itself as bad.

The evidence to look for is the one-way delay field in the SBC’s quality stats. The operator only sees and controls the segment of the path that crosses the SBC. The remaining segments (LAN-side codec processing, jitter buffer playout, far-end carrier path) have to be estimated or measured separately.

The causes are largely additive. Geographic distance contributes physical propagation delay (roughly 1 ms per 100 km on fiber, more if the path makes routing detours). Encoding and decoding cycles add 10 to 30 ms per pass depending on the codec, with each transcoding hop adding another full cycle. The jitter buffer itself is a delay element, typically 40 to 100 ms on a well-tuned path and substantially more if it is over-sized. Satellite legs add 240 to 280 ms each way. Mobile access (3G especially, but LTE and 5G as well) adds 50 to 150 ms on the radio side that the wireline operator cannot reduce.

The fix paths are about shortening the contributors the operator controls. Reduce transcoding hops by adjusting per-NAP codec policy so the SBC negotiates a common codec where possible instead of bridging two different ones. Deploy a regional SBC closer to the customer base so the operator’s segment of the path is short. Re-tune the jitter buffer if it has been measured as the dominant contributor. For a hosted contact center serving callers across continents, splitting traffic into multiple regional SBCs is often the only path to acceptable conversational delay.

Area 4: Codec and Transcoding Artifacts

Sometimes the call has clean network conditions, normal jitter, low loss, reasonable one-way delay, and the audio still sounds wrong. The remaining suspect is the codec itself, or the chain of codecs the audio has passed through.

The audible signatures vary. Tandem encoding through a low-bitrate codec produces a hollow, processed quality that subscribers describe as “tinny” or “phone-y.” Wideband collapse to narrowband produces a sudden drop in fidelity that is obvious to anyone who has been hearing the same caller on a wideband path previously. CS-ACELP codecs (G.729 family) handle clean voice well and degrade sharply on background noise, music on hold, sibilants, and any in-band signaling like fax tones or in-band DTMF.

The evidence is in the SDP and the CDR. The SBC’s call trace shows the codec negotiated on each leg of the call. If the inbound leg is G.711, the outbound leg is G.729, and a third hop downstream re-encodes back to G.711, that is two tandem passes through a lossy codec on one call. The MOS field in the CDR will reflect the compounding loss even though every other quality metric reads clean. The G.711 vs G.729 comparison goes deep on the per-codec MOS arithmetic.

The causes follow from how SDP negotiation works. When each side offers an overlapping codec, the call proceeds without transcoding. When the offers do not overlap, the SBC has to transcode, which means a full decode and re-encode cycle for every 20 ms of audio for the duration of the call. Tandem encoding occurs when the same call is transcoded again at a downstream hop, usually because a different operator’s policy forces a different codec on the next leg. Wideband collapse happens when any narrowband leg is anywhere in the path; Opus or AMR-WB on the endpoints does not survive a G.711 transit segment in the middle.

The fix paths center on codec policy. Configure per-NAP codec preferences so that the SBC negotiates a common codec wherever both sides support one, eliminating the transcoding hop entirely. For premium routes where bandwidth is not the constraint, force G.711 end-to-end to avoid the perceptual loss of compressed codecs. Where transcoding is unavoidable, push it to a single point in the path and prevent downstream legs from re-encoding. For mobile-to-IP paths, the SBC AMR to G.711 transcoding guide covers the specific quality and DSP considerations.

Area 5: Echo

Echo is the area most likely to be misdiagnosed as something else. Subscribers complain that they hear themselves, or that the far party reports hearing themselves, and the first instinct is often to look at codec or network metrics. None of them will show anything wrong, because echo is a media-path artifact that lives at a different area than the four already covered.

The audible signature is the caller hearing a delayed copy of their own voice. It only affects one direction of the call at a time: the party whose audio is being echoed back to them experiences the problem, and the other party does not. Echo that was always present at low levels becomes audible when latency on the path increases, because the human ear stops perceiving echo when the delay drops below roughly 25 ms (the audio fuses with the original speech) and starts perceiving it sharply above 50 ms.

The evidence is harder to read from RTCP and CDR. MOS may or may not reflect the echo, depending on whether the per-call quality estimation includes echo measurement. The clearer evidence is the one-sided nature of the complaint: party A hears their own voice, party B hears nothing wrong, and the metrics for both halves of the call look normal.

The causes are almost always at the analog boundary. A 2-wire to 4-wire conversion at an analog hybrid produces electrical echo that should be cancelled by an echo canceller on the trunk-side gateway. When the canceller is missing, mis-tuned, or has insufficient tail length for the path, residual echo leaks through. The other source is acoustic echo on the endpoint: an open speakerphone, a poorly-tuned headset, or a handset held away from the ear, with the far end’s audio leaking back into the local microphone.

The fix paths begin with identifying the leg that introduces the hybrid. PSTN gateways and TDM-to-IP boundaries are the usual offenders. Verify the echo canceller is enabled on that leg, that its tail length is configured for the path delay, and that ERL (echo return loss) and ERLE (echo return loss enhancement) measurements are within acceptable ranges. If the echo only appeared on calls that were previously clean, look for a recent change that added latency to the path; the echo was always there at low levels and the new latency unmasked it. For acoustic echo, the only real fix is changing how the endpoint is used or replacing the headset.

The Diagnostic Workflow

Reading the five areas in order gives a reliable decision tree for any active complaint.

Start with the RTCP and CDR evidence for the affected call. Read four numbers: loss percentage, jitter, one-way delay (or RTT divided by two), and MOS. If loss is above 1 percent in either direction, the problem is Area 1 and the next step is finding the path the lossy direction takes. If jitter is above 20 ms and trending toward 50 ms, the problem is Area 2 and the next step is identifying the variable hop or the under-tuned jitter buffer. If one-way delay is above 150 ms, the problem is Area 3 and the next step is decomposing the delay into its additive contributors. If all four numbers are normal but MOS is below 3.5, the problem is Area 4, and the call trace will show the codec negotiation on each leg. If the complaint specifies that one party hears themselves while the other does not, and the other metrics are clean, the problem is Area 5, and the analog boundary on the affected direction is the suspect.

When more than one area is above threshold, fix the dominant contributor first and re-measure. A call with both 2 percent loss and 30 ms jitter will read better after the loss is fixed even if the jitter is unchanged; chasing both simultaneously confuses the evidence.

When the evidence on hand cannot conclusively identify a area, the next escalation is a per-call PCAP on the SBC interface for the duration of the complaint, opened in Wireshark with the RTP analysis tools. PCAP is the ground truth when CDR and RTCP disagree.

What Does NOT Live in the SBC

The SBC has visibility into the segment of the call that crosses its interfaces and very little visibility into anything else. That boundary matters for two practical reasons.

LAN-side endpoint problems are outside the SBC’s measurement window. A handset on a congested Wi-Fi link, a softphone running on a CPU-starved laptop, a headset with degraded acoustic echo cancellation, or a misconfigured AGC on the endpoint will all produce quality complaints that look unattributable from the SBC. The SBC can prove the call was clean on its interface; the next step is on the LAN team or the endpoint vendor.

Far-side carrier paths are also outside the measurement window. If a carrier downstream is dropping packets in their core, the SBC will see the loss in the inbound RTCP report but will have no detail beyond that. The best package to give a downstream carrier when escalating a core network loss issue is a dual-ended PCAP, or a capture taken right at the SBC ingress/egress boundary. Vague tickets sit in the queue; tickets with concrete evidence move faster. For the broader picture of when and how to package an escalation, the VoIP troubleshooting guide covers carrier escalation in depth.

How ProSBC Helps Resolve VoIP Call Quality Issues Faster

ProSBC is built around the operational reality that quality complaints arrive and have to be diagnosed quickly. Per-call MOS is calculated natively at the SBC, with the underlying jitter, loss, and one-way delay fields exposed in every CDR record for both text and RADIUS export. The breakdown into areas is what makes the diagnostic workflow above usable without external probes or synthesis tools.

Live Wireshark-compatible packet capture can be enabled per call or per NAP without restarting anything, which means the PCAP evidence is available the moment a ticket comes in. SIP call trace shows the negotiated codec on each leg, surfacing tandem encoding and codec mismatches that produce Area 4 complaints. The programmable API routing engine supports per-NAP codec policy, which is the practical control point for managing transcoding hops across a multi-carrier network. Hardware transcoding via Tmedia is available for environments where DSP-grade quality on AMR, Opus, or G.729 transcodes is required at scale.

For providers who want the dashboard area without building it themselves, Monitoring as a Service brings the per-call quality metrics into a managed dashboard with real-time alerts. For operators who want the diagnostic work taken off their plate entirely, the Managed Service tier includes ProSBC+ with 1+1 high availability, 24×7 Level 3 support, ongoing configuration changes, and continuous monitoring.

Resolve VoIP Quality Issues Faster with ProSBC

ProSBC is a carrier-grade, software-based Session Border Controller with the diagnostic surface that voice operations need: per-call MOS with the underlying breakdown, live packet capture, full CDR output, web-based SIP trace, and programmable per-NAP codec policy through the API routing engine. The full diagnostic workflow above can be walked on every call without external probes.

ProSBC is available on AWS, Microsoft Azure, VMware, KVM/Proxmox, and bare metal. A 30-day free trial with 500 concurrent sessions provides a full diagnostic environment for evaluation, and the permanent 3-session ProSBC Lab license is available immediately for testing and validation work.

Prefer to evaluate on your own first? Start your 30-day free trial.