VoIP and AI: Voice AI Agents Through SBCs

Two AI call center agents with illuminated AI visors and headsets representing voice AI agents for VoIP call handling.

Voice AI agents have moved from demo videos to live phone numbers. AI receptionists answer calls for dental offices, outbound dialers from Vapi, Retell, and Bland reach thousands of consumers per hour, and contact centers route the first leg of every customer call through a speech-to-text and language-model pipeline before a human ever speaks. The voice AI cluster is the part of the stack that gets the marketing attention. The Session Border Controller is the part that decides whether any of it actually works on the PSTN.

This article covers what changes when the endpoint on the other side of your SIP trunk is a language model instead of a human, where the SBC sits in that flow, and the practical configuration choices that affect latency, attestation, fraud exposure, and capacity planning. It is written for voice infrastructure operators evaluating how to bring AI agents into production, not for AI product teams choosing a voicebot vendor.

Key Terms and Concepts

A quick-reference glossary for terms used throughout this article.

Voice AI agentAn autonomous caller or callee driven by a speech-to-text engine, a language model, and a text-to-speech engine, connected to the phone network through a SIP or WebRTC session. Voice AI agents handle inbound IVR, outbound dialing, and human-agent assist workloads on commercial platforms such as Vapi, Retell, Bland, ElevenLabs, OpenAI Realtime, and LiveKit-class voice clusters.

ASR (Automatic Speech Recognition)The transcription stage of the voice AI pipeline. Converts inbound audio into tokens the language model can read. Round-trip latency from caller voice to ASR output is one of the three components of perceived response delay.

TTS (Text-to-Speech)The synthesis stage that turns the language model’s response into audio sent back to the caller. Modern TTS adds 200–600 ms to total response latency depending on streaming behavior.

Barge-inLetting a caller interrupt the bot mid-sentence and have the bot stop speaking. Requires full-duplex media handling and voice-activity detection on both legs; the SBC must not anchor the media in a way that breaks the barge-in signal path.

WebRTC-to-SIP bridgeThe translation between the AI platform’s preferred transport (often WebRTC with Opus and DTLS-SRTP) and the carrier’s SIP delivery (UDP or TLS, G.711, SDES-SRTP). The SBC performs both transport translation and codec handling at this boundary.

Concurrent sessionA single active call leg consuming SBC capacity. For an AI dialer, concurrent session count tracks the burst peak of simultaneous calls, not the daily total. AI dialer bursts behave differently from human contact-center traffic and require different capacity headroom.

AttestationThe A, B, or C level signal carried in a STIR/SHAKEN PASSporT token that indicates how confident the originating provider is in the calling number. AI-originated outbound calls inherit the attestation level of whichever provider signs the call at the network edge.

NAP (Network Access Point)The TelcoBridges term for a configured SIP peer or trunk group inside ProSBC. Each carrier, each AI platform, and each tenant typically gets its own NAP so that codecs, header rules, and routing can be tuned per relationship.

Programmable routing queryAn HTTP call ProSBC issues during INVITE processing to ask an external service which destination, which AI agent, or which human queue should handle the call. The query happens before media is anchored, so routing decisions have no audio impact.

Media anchoringHolding the RTP/SRTP stream at the SBC for the lifetime of the call. Anchored media enables transcoding, fork-for-recording, and per-leg encryption. Pass-through media reduces SBC media load but limits what the SBC can do mid-call.

RFC 4733 / RFC 2833 DTMFThe out-of-band telephone-event encoding used to carry keypad presses. AI IVRs that accept “press 1” need DTMF to survive transcoding and codec changes intact, which is one of the things the SBC manages explicitly per trunk group.

How AI Voice Agents Change the Traffic Pattern

A traditional contact center has a fairly stable curve. Calls arrive over the day, agents log in for shifts, concurrent-session counts move within a predictable band. AI voice agents introduce several traffic patterns that contact-center capacity planning does not naturally cover, and they all land on the SBC first.

Bursty outbound dialing

An AI outbound campaign can fire several hundred calls per second from a cold start, hold a peak of tens of thousands of concurrent sessions for an hour, then drop to zero. The dialer does not pace itself the way a roomful of humans does. Sizing the SBC against the daily average is the failure mode here; the burst peak is what determines whether your SIP trunk groups, NAP capacity, and CPS limits hold up. The per-NAP CPS limit configured against each carrier is usually the constraint that fails first in production, and it should be set before launch.

Always-on inbound

An AI receptionist answers every call at every hour, with no after-hours queue and no overflow to voicemail. The SBC’s SIP OPTIONS heartbeat, registration health, and certificate validity all have to hold without a maintenance window, and the same trust-store hygiene that applies to a Teams Direct Routing deployment applies to any TLS endpoint a bot connects to.

Short utterances, long calls

AI agents talk in short, frequent turns. The media stream is many small RTP bursts separated by silence, with voice-activity detection cutting in often. Jitter-buffer behavior tuned for human conversation interacts badly with this pattern, especially when the SBC is also transcoding between codecs. Use the SBC’s per-NAP jitter and codec profile to match what the AI platform actually emits, and verify with packet capture before going live. Guessing at this layer is the most common cause of clipped barge-in.

Voice AI agent topology: ProSBC sits between the PSTN/SIP trunk carrier and the Voice AI platform, terminating TLS/SRTP on both legs, transcoding between G.711 and Opus where needed, and issuing a programmable routing query to the AI orchestrator at INVITE time. The ASR → LLM → TTS pipeline runs inside the voice AI platform. Click to enlarge.

The Latency Budget for a Conversational AI Call

Conversational AI feels natural when end-to-end response time stays below roughly 800 ms. Past that, the caller starts asking “are you still there?” or talking over the bot. Past 1.5 seconds, the call sounds broken. The latency budget is built from several stages, and the SBC is responsible for two of them.

Where the milliseconds go

A typical round-trip splits roughly as follows: 50–150 ms of network transport from caller to AI cluster, 150–350 ms of ASR to produce a useable transcript, 150–800 ms of language-model inference, 200–600 ms of TTS synthesis, and 50–150 ms back to the caller. The SBC contributes to the first and last stages plus whatever transcoding it performs on each leg.

What the SBC controls in that budget

Three SBC settings matter most. Codec choice decides whether the SBC decodes and re-encodes every RTP packet (Opus-to-G.711 transcoding on hardware) or passes the stream through (G.711 on both legs); hardware transcoding adds about 20 ms per leg and pass-through adds a few. Jitter buffer depth trades latency against tolerance for packet loss, and AI calls want the shallowest buffer that still survives your carrier’s actual jitter, which is usually lower than the default. Media anchoring versus pass-through decides whether the SBC can fork audio for recording (anchored) or sheds the hop at the cost of giving up mid-call recording (pass-through).

What the SBC cannot fix is the AI platform itself. If a bot feels slow and the SBC is configured cleanly, the problem is upstream. Measure ASR-to-first-token and TTS-to-first-audio on the platform’s own metrics before blaming the network.

Codec Selection at the AI–PSTN Boundary

Codec mismatch is the single most common production issue for new voice AI deployments. The carrier offers G.711 or G.729. The AI platform prefers Opus, sometimes accepts L16 wideband, and may quietly degrade to G.711 if asked. The SBC sits in the middle and decides what gets negotiated where.

What the PSTN side wants

The vast majority of inbound PSTN traffic in North America arrives as G.711 PCMU or PCMA over RTP. SIP trunk providers running into the call also frequently offer G.729 to save bandwidth on the trunk side. Both codecs are narrowband and round-trip-cheap to handle. ProSBC offers software-native G.711 (ALAW and ULAW); Opus, G.729, and AMR transcoding require hardware DSP via the Ttrans product line. Software transcoding for additional codecs is on the roadmap for end of 2026, so cloud-only deployments needing Opus on the carrier side currently land in the “hardware transcoding required” bucket. Build the deployment plan around that constraint rather than around what is on a future roadmap.

What the AI side wants

Most commercial voice AI platforms accept G.711 directly on a SIP leg and handle the up-sampling internally to their preferred sample rate. WebRTC-based platforms expect Opus and DTLS-SRTP by default, but most also expose a SIP endpoint that takes G.711. The practical pattern is to terminate the carrier on G.711 with SDES-SRTP, terminate the AI on G.711 with TLS/SRTP, and let the SBC handle TLS and SRTP independently on each leg. Transcoding is only required when a specific AI platform refuses G.711 or when audio quality requirements push toward Opus end-to-end.

The DTMF question

AI IVRs that prompt for keypad input rely on DTMF surviving the codec hop. RFC 4733 (formerly RFC 2833) sends DTMF as out-of-band telephone events inside the RTP stream rather than as in-band audio tones. If the SBC’s SDP negotiation drops the telephone-event payload type during codec offer/answer, the bot stops hearing “press 1.” Confirm on a test call before launch that the telephone-event payload is negotiated end-to-end and that the AI platform is configured to listen for it.

Programmable Routing: Asking the Orchestrator Where the Call Goes

Voice AI deployments rarely use static routing tables. Which agent answers a given call depends on the dialed number, the campaign, time of day, language detection on the first utterance, the caller’s history with the brand, and increasingly, an AI orchestrator that decides in real time whether to route the call to a bot, escalate to a human, or hand off to a different bot specialized for the topic. The SBC needs a way to ask that question without rebuilding the routing table on every change.

The HTTP-query pattern

ProSBC’s REST API call routing integration handles this directly. When an INVITE arrives, the routing script issues an HTTP query to the orchestrator with the calling number, dialed number, NAP identifier, and any other call parameters that matter. The orchestrator returns a JSON response specifying the destination NAP, optional header rewrites, and routing priority. The whole exchange happens in the signaling phase before media negotiation, so it adds nothing to audio latency.

What this enables for voice AI

A single SBC can serve a dozen AI agent campaigns, each with its own routing logic, without static configuration changes. New campaigns become new entries in the orchestrator’s database rather than new ProSBC NAPs. Failover from a bot to a human happens at the orchestrator without requiring the SBC to reload. A/B testing of two voice models becomes a routing rule, not a deployment change.

Boundaries to set

Keep the orchestrator query timeout tight (500–1500 ms) and define an explicit fallback when the query fails. The fallback should land somewhere safe: a generic IVR, a default human queue, or a busy treatment, depending on the use case. Letting calls hang while waiting for a slow orchestrator response is the worst failure mode here. ProSBC supports primary and secondary URLs for the HTTP query so the orchestrator itself can be deployed redundantly.

Fraud and Trust at the AI Voice Edge

Generative voice models introduce attack patterns the SBC layer has not historically needed to think about. Some are recycled telecom-fraud problems with a new face, some are genuinely new, and the SBC is the right control point for most of them.

Deepfake caller ID and voice cloning

An attacker who clones a CEO’s voice and uses a synthesized PASSporT-signed call against a finance team has a more convincing pretext than any pre-AI vishing campaign. The defense is the same as for any caller-ID spoofing: STIR/SHAKEN attestation at the SBC layer, paired with fraud-scoring partners that catch high-risk calls before they ring. ProSBC’s production STIR/SHAKEN integration uses SIP-based redirect to TransNexus ClearIP or Neustar, where the STI-AS acts as a SIP redirect server and returns a 302 with the Identity header on the success path. The terminating side gets a clear signal about how much the originator vouches for the calling number, which is the only honest answer to a deepfake at the carrier layer.

AI-originated outbound and attestation

Regulators care about how AI-originated outbound calls are attested. A call placed by an AI dialer on behalf of an unverified end customer should not carry A-level attestation; that is a misrepresentation that risks both reputational and FCC exposure. ProSBC’s programmable routing engine sets the attestation level per call, per campaign, or per NAP, so a single SBC handling traffic for multiple AI tenants can sign each tenant’s calls at the level appropriate to that tenant’s verification status. The attestation-level routing pattern covers this in depth.

Prompt injection and AI-specific abuse

Voice AI agents have a unique attack surface: a caller can speak instructions that the language model interprets as system commands. The SBC does not solve prompt injection directly, but it does control which calls reach the bot in the first place. Dynamic blacklisting, registration scanning protection, and per-NAP CPS limits reduce probing traffic the AI layer ever has to see. The same controls apply to AI-specific toll fraud, where a hijacked dialer can generate large volumes of fraudulent international calls; per-NAP destination filters, premium-rate prefix blocks, and fraud-score integration with TransNexus, SecureLogix, or YouMail are the standard defenses.

Where Voice AI Agents Land on the SBC

Three deployment shapes account for almost all production voice AI traffic. Each has different SBC implications.

AI receptionist for an SMB or MSP customer

The simplest case: one inbound number, one AI agent, one tenant. A single NAP on the carrier side, a single NAP on the AI side, basic SIP and codec handling. For an MSP delivering AI voice agents per tenant, the multiplier is the number of customers, not the per-customer complexity. A single ProSBC instance with 1,024 available NAPs can host hundreds of AI receptionists side by side, each with its own routing, recording, and STIR/SHAKEN treatment.

Contact center with AI front line

The bot handles the first 30 seconds of every call, identifies the caller, triages the request, and resolves it or hands off to a human. This is the cloud communications pattern applied to AI agents. The SBC’s role is the same as for a traditional contact center, plus the orchestrator query at INVITE time to decide bot or human queue based on the dialed number, the time, or the caller’s prior history. Compared to a Teams Direct Routing deployment, the contact-center AI flow benefits more from media anchoring for recording and less from media bypass.

Outbound AI dialer

The most demanding shape. A campaign-driven dialer generates the burst patterns described earlier, and each call has to carry the right attestation level, caller ID, and opt-out handling. ProSBC’s programmable routing engine sets all three per call, and the per-NAP CPS limit protects the carrier from being overrun. If the AI platform is hosted by a third party, the dialer-to-SBC leg often uses WebRTC or SIP/TLS over the public internet, with TLS/SRTP terminating at the SBC. Capacity planning targets the campaign burst peak.

Recording, Compliance, and Two-Party Consent

AI-assisted call flows have to deal with consent rules in the jurisdictions where they operate. Two-party consent states require the caller to be told that the call is being recorded or processed by an AI system, and the SBC is often the layer that proves the disclosure happened. The practical pattern is to anchor media at the SBC, fork the audio (or just the inbound leg) to a secure recording target, and timestamp the consent prompt and the caller’s response. If a call gets challenged later, the recording shows the disclosure was made and the caller continued.

PII and PCI redaction live in the AI platform layer or a separate redaction service, but the SBC controls whether the original audio reaches them in the first place. Pass-through mode saves SBC media load but gives up the audit trail. For deployments hosted on a software SBC running on AWS, Azure, or KVM, the recording fork target can be a cloud storage endpoint or a dedicated recording service; the SBC does not care where the bytes land, only that the fork is configured.

Frequently Asked Questions

Does my voice AI platform need a Microsoft-certified SBC?

Only if the AI agents participate in Microsoft Teams calls. A standalone voice AI deployment that connects to the PSTN through a SIP trunk does not touch Teams Direct Routing and has no Microsoft certification requirement. ProSBC supports Teams Direct Routing for AI agents that need to join Teams meetings, with the same TLS, SRTP, and FQDN requirements that apply to any Teams DR deployment.

Which codec should I configure between the SBC and the voice AI platform?

G.711 on both legs is the simplest and lowest-latency option for most commercial voice AI platforms. Opus end-to-end gives better audio quality but currently requires hardware transcoding at the SBC if the PSTN side delivers G.711. Confirm the AI platform’s preferred codec on its SIP endpoint and match the SBC NAP configuration. Run a recorded test call before launch to verify barge-in and DTMF behavior.

How do I attest STIR/SHAKEN for AI-originated outbound calls?

Set the attestation level based on what you can verify about the calling number and the AI agent placing the call. A self-attested A-level call requires a verified relationship between your platform and the calling party. AI dialers placing calls on behalf of unverified end customers should attest at B or C level rather than A. ProSBC’s programmable routing engine sets the attestation level per call, per campaign, or per NAP, and integrates with TransNexus ClearIP or Neustar over SIP for the signing service exchange.

Can a single SBC serve multiple voice AI tenants?

Yes. ProSBC supports up to 1,024 NAPs per server, which is enough to isolate routing, codecs, recording, and STIR/SHAKEN treatment for hundreds of tenants on one instance. Per-tenant CPS limits, destination filters, and attestation rules are configured at the NAP level. The multi-tenant SBC pattern for Teams Direct Routing applies directly to multi-tenant AI voice deployments.

How should I size SBC capacity for an AI outbound dialer?

Size against the campaign burst peak, not the daily average. An AI dialer can ramp to thousands of concurrent sessions in seconds and hold that peak for an hour. Headroom of 20–30 percent above the highest observed burst is the working rule. Set the per-NAP CPS limit to protect each carrier from being overrun, and confirm the SIP trunk provider can absorb the offered call rate before launch.

Where do I record AI-handled calls for compliance?

Anchor media at the SBC and fork the audio to a recording target that meets your encryption-at-rest and retention requirements. The recording target can live in the same cloud as the SBC or in a separate storage tier. PII and PCI redaction usually happens in a downstream service rather than at the SBC, but the SBC is the only point in the path that can capture the consent disclosure and the caller’s response in a single timestamped audio file.

Conclusion

Voice AI agents work in production when the layer between the AI platform and the phone network behaves like a serious piece of telecom infrastructure rather than a SIP demo. The SBC is what turns the AI cluster into something a carrier will trust, attest, and route to. It is also what protects the AI layer from the messier realities of PSTN traffic: codec variance, attestation drift, fraud probing, and bursts that no AI inference cluster wants to absorb directly.

The right SBC for voice AI is not an “AI SBC.” It is a programmable, multi-tenant SBC with a clean REST API, per-NAP codec and routing control, STIR/SHAKEN integration that fits real production patterns, and enough capacity headroom for the burst behavior AI workloads generate. The infrastructure question for a voice AI deployment is the same question for any serious voice deployment, just with sharper tolerances on latency, attestation, and trust.

Run Voice AI Agents Through ProSBC

ProSBC handles the carrier-to-AI bridge with the same B2BUA architecture, TLS and SRTP termination, and per-NAP configuration that production voice traffic has always required. The programmable routing engine queries an external orchestrator at INVITE time, so AI campaign logic, agent selection, and bot-to-human escalation happen without static config changes. STIR/SHAKEN integration with TransNexus ClearIP and Neustar covers the attestation side for AI-originated outbound; per-NAP CPS limits, destination filters, and fraud-score partner integration cover the abuse side.

The same instance can serve hundreds of AI tenants from a single deployment, scaling to 60,000 concurrent sessions per server with 1,024 available NAPs. ProSBC runs on Microsoft Azure, AWS, VMware, KVM/Proxmox, and baremetal, so the SBC can sit wherever the voice AI platform actually lives.

Prefer to evaluate on your own first? Start your 30-day free trial.