RTP Protocol Explained: Real-Time Streaming, Packet Structure, Jitter Buffer & RTCP

What is RTP?

RTP (Real-time Transport Protocol) is the standard protocol for delivering audio and video over IP networks. Defined in RFC 3550, published in 2003, RTP provides the mechanisms that applications need to stream media in real time: timestamping for synchronization, sequence numbering for reordering, and payload type identification so receivers know how to decode the data.

RTP is not a file transfer protocol. It is designed specifically for continuous media streams where timeliness matters more than perfect delivery. A dropped video frame or a brief gap in audio is far less noticeable than the stuttering caused by waiting for a retransmission. For this reason, RTP runs over UDP rather than TCP, avoiding the latency penalties of reliable delivery.

RTP does not work alone. It is always paired with RTCP (RTP Control Protocol), which provides quality feedback, participant identification, and synchronization information. Together, RTP and RTCP form the foundation of virtually every VoIP phone call, video conference, IP camera feed, and live broadcast on the internet today. Applications like Zoom, Microsoft Teams, Google Meet, and most SIP-based phone systems all rely on RTP to carry their media streams.

How RTP Streaming Works

The RTP streaming process begins when a sender captures media (audio from a microphone, video from a camera) and encodes it using a codec such as Opus for audio or H.264 for video. The encoded data is then divided into chunks that fit within a single UDP packet, typically staying below the network MTU of 1,500 bytes to avoid IP fragmentation.

Each chunk is wrapped in an RTP header that includes a sequence number, a timestamp, and a payload type identifier. The sequence number increments by one for every packet, allowing the receiver to detect lost packets and reorder any that arrive out of sequence. The timestamp reflects when the media was captured, enabling the receiver to play it back at the correct rate regardless of network delay variations.

On the receiving end, packets flow into a jitter buffer, a short queue that absorbs timing variations in packet arrival. The buffer collects incoming packets, sorts them by sequence number, and releases them to the decoder at a steady rate. This smooths out the irregular delivery pattern caused by network jitter and produces consistent playback.

Because RTP runs over UDP, there is no guarantee that every packet will arrive. Applications handle loss gracefully: audio codecs can interpolate missing samples using packet loss concealment, and video decoders can skip corrupted frames and wait for the next keyframe. The priority is always to keep the stream moving forward rather than stalling to recover lost data.

RTP streams media packets over UDP. A jitter buffer at the receiver reorders packets and smooths timing variations. RTCP provides quality feedback to the sender.

RTP Packet Header Structure

The RTP header is compact but information-rich, with a minimum size of 12 bytes. Every field serves a specific purpose in enabling real-time media delivery. Here is what each field does:

Version (V, 2 bits): always set to 2 for the current version of RTP defined in RFC 3550.
Padding (P, 1 bit): indicates whether the packet contains padding bytes at the end. Padding is sometimes needed for encryption algorithms that require fixed block sizes.
Extension (X, 1 bit): signals that an extension header follows the fixed header, allowing applications to add custom metadata.
CSRC Count (CC, 4 bits): the number of Contributing Source identifiers that follow the fixed header. Used when an audio mixer combines multiple streams.
Marker (M, 1 bit): application-defined flag, commonly used to mark the first packet of a video frame or the end of a talk spurt in audio.
Payload Type (PT, 7 bits): identifies the codec used to encode the media. The receiver uses this to select the correct decoder. Values 0 through 95 are statically assigned, while 96 through 127 are dynamically negotiated.
Sequence Number (16 bits): increments by one for each packet sent. The receiver uses this to detect packet loss and restore the original order of packets that arrive out of sequence.
Timestamp (32 bits): reflects the sampling instant of the first byte of media data in the packet. The clock rate depends on the codec: audio codecs typically use 8,000 Hz or 48,000 Hz, while video codecs use 90,000 Hz. This field is essential for smooth playback timing.
SSRC (32 bits): Synchronization Source identifier. A randomly generated value that uniquely identifies this particular media stream. Each participant in a session has a different SSRC.
CSRC List (0 to 15 items, 32 bits each): Contributing Source identifiers, present when a mixer combines multiple input streams into one output stream. Each CSRC identifies one of the original sources.

The RTP header provides sequence numbers for reordering, timestamps for synchronization, and SSRC to identify each media stream.

RTP Payload Types

The Payload Type field in the RTP header tells the receiver which codec was used to encode the media data. RFC 3551 defines a set of static payload type assignments for common codecs. Modern codecs use dynamic payload types (96 through 127) that are negotiated during session setup, typically via SDP (Session Description Protocol) carried within SIP signaling.

PT	Codec	Media	Clock Rate
0	PCMU (G.711 mu-law)	Audio	8000 Hz
8	PCMA (G.711 A-law)	Audio	8000 Hz
9	G.722	Audio	8000 Hz
26	JPEG	Video	90000 Hz
31	H.261	Video	90000 Hz
96-127	Dynamic	Audio/Video	Varies

Modern codecs like Opus, VP8, VP9, H.264, H.265, and AV1 all use dynamic payload types in the 96 to 127 range. The specific number is agreed upon during session negotiation, so there is no fixed mapping. This approach allows new codecs to be deployed without updating the RTP specification itself.

RTCP: Quality Feedback

RTP Control Protocol (RTCP) runs alongside every RTP session, using the next odd port number (if RTP is on port 5004, RTCP uses 5005). While RTP carries the media data, RTCP carries control information that enables quality monitoring, adaptive streaming, and synchronization.

RTCP defines several packet types, each serving a different purpose:

Sender Reports (SR): sent by participants that are actively transmitting media. Each SR includes the total number of packets sent, total bytes sent, and an NTP timestamp that maps RTP timestamps to wall-clock time. This mapping is essential for synchronizing audio and video streams (lip sync).
Receiver Reports (RR): sent by participants that are receiving media. Each RR reports the fraction of packets lost, cumulative packets lost, inter-arrival jitter, and information needed to calculate round-trip time. Senders use this data to adjust their bitrate and detect congestion.
Source Description (SDES): carries textual information about participants, such as a canonical name (CNAME) that remains consistent even if the SSRC changes. The CNAME is used to associate multiple media streams (audio and video) from the same participant.
BYE: signals that a participant is leaving the session.
APP: application-specific data that extends RTCP without modifying the standard.

RTCP bandwidth is limited to roughly 5% of the total session bandwidth to avoid consuming too much capacity. The report interval scales with the number of participants, ensuring that RTCP traffic remains manageable even in large sessions.

Jitter Buffer

Network jitter is the variation in packet arrival times. Even if a sender transmits packets at perfectly regular intervals (for example, every 20 milliseconds for a VoIP call), network conditions cause packets to arrive with variable delays. One packet might arrive in 15 ms, the next in 35 ms, and the one after in 10 ms.

Without a jitter buffer, this variation would produce choppy, uneven playback. The jitter buffer solves this by introducing a small, controlled delay. Incoming packets are held in the buffer, sorted by sequence number, and released to the decoder at a constant rate matching the original transmission interval.

The size of the jitter buffer represents a fundamental trade-off. A larger buffer can absorb more jitter, producing smoother playback, but it adds latency to the stream. A smaller buffer keeps latency low but may not have enough headroom to smooth out bursts of jitter, leading to gaps in playback. For VoIP calls, a typical jitter buffer adds 20 to 60 milliseconds of delay.

Modern implementations use adaptive jitter buffers that dynamically adjust their size based on observed network conditions. When jitter is low, the buffer shrinks to minimize latency. When jitter increases, the buffer grows to maintain smooth playback. This approach provides the best balance between quality and responsiveness.

RTP vs Other Streaming Approaches

RTP is optimized for real-time, low-latency delivery, but it is not the only way to stream media. HTTP-based protocols like HLS and DASH dominate video-on-demand and live broadcast, while WebRTC brings real-time communication directly to web browsers. Each approach makes different trade-offs.

Feature	RTP/UDP	HLS/DASH (HTTP)	WebRTC
Transport	UDP	TCP (HTTP)	UDP (SRTP)
Latency	Very low (ms)	High (seconds)	Very low (ms)
Reliability	None (app handles)	Full (TCP)	Selective (NACK)
Use Case	VoIP, conferencing	VOD, live broadcast	Browser P2P
Firewall	May be blocked	Always works	ICE/STUN traversal
Encryption	SRTP	HTTPS	DTLS-SRTP

RTP excels in scenarios where latency must be minimized and both endpoints are under the same administrative control (such as a private VoIP network or an IP camera system). HTTP-based streaming wins when content must traverse firewalls reliably and scale to millions of viewers through CDNs. WebRTC combines the low latency of RTP with browser compatibility and built-in NAT traversal, making it the preferred choice for browser-based communication.

Common Use Cases

VoIP phone calls: SIP handles call signaling (dialing, ringing, hanging up), while RTP carries the actual voice audio between endpoints. Nearly every IP phone and softphone uses this combination.
Video conferencing: applications like Zoom, Microsoft Teams, and Google Meet use RTP internally to transport audio and video streams between participants and media servers.
IP cameras and surveillance: RTSP (Real-Time Streaming Protocol) controls the camera session, while RTP delivers the video feed. This is the standard architecture for network surveillance systems.
Live broadcasting: contribution feeds from remote locations to broadcast studios often use RTP for its low latency and precise timing. Protocols like SMPTE 2110 use RTP for professional media over IP.
WebRTC: browser-based real-time communication uses SRTP (Secure RTP) as its media transport. WebRTC adds encryption (DTLS-SRTP), congestion control, and NAT traversal on top of the core RTP framework.
Online gaming voice chat: many gaming platforms use RTP or RTP-like protocols to carry voice audio between players with minimal delay.

Frequently Asked Questions About RTP

Why does RTP use UDP instead of TCP?

TCP retransmits lost packets, which introduces unpredictable delays. For real-time media, a late packet is worse than a lost packet. If a voice sample arrives 200 ms late, the conversation has already moved on and the data is useless. UDP allows RTP to deliver packets as quickly as possible and let the application decide how to handle any losses.

What is the difference between RTP and RTSP?

RTP and RTSP serve different roles. RTP is the data transport protocol that carries actual media packets. RTSP (Real-Time Streaming Protocol) is a control protocol, similar to a remote control for a media player. RTSP sends commands like PLAY, PAUSE, and TEARDOWN to manage the streaming session, while RTP delivers the audio and video data itself. They are commonly used together, especially in IP camera systems.

What is SRTP?

SRTP (Secure Real-time Transport Protocol), defined in RFC 3711, adds encryption, message authentication, and replay protection to RTP. Standard RTP sends media data in plaintext, which means anyone who can capture the network traffic can listen to calls or watch video feeds. SRTP encrypts the payload using AES, ensuring confidentiality. It is mandatory in WebRTC and increasingly required in VoIP deployments.

How does RTP handle packet loss?

RTP itself does not handle packet loss. It provides the sequence numbers that allow receivers to detect which packets are missing, but recovery is left to the application. Common strategies include: audio packet loss concealment (interpolating the missing sample from surrounding data), video error concealment (repeating the previous frame or using partial decoding), forward error correction (sending redundant data so lost packets can be reconstructed), and negative acknowledgments (NACK) where the receiver requests retransmission of specific packets if time allows.

What is a jitter buffer?

A jitter buffer is a short queue at the receiver that collects incoming RTP packets and releases them at a steady rate. It absorbs the natural variation in packet arrival times (jitter) caused by network congestion, routing changes, and other factors. Without a jitter buffer, variable delays would cause choppy playback. The trade-off is that the buffer adds a small amount of latency, typically 20 to 60 ms for voice calls.

Can RTP stream both audio and video simultaneously?

Yes, but each media type uses a separate RTP session with its own SSRC, sequence numbers, and timestamps. Audio and video are not multiplexed into a single RTP stream. Instead, they run as parallel streams, often on different port pairs. RTCP Sender Reports provide the NTP timestamp mapping needed to synchronize the two streams for lip sync. The CNAME in RTCP SDES packets ties the audio and video streams together as belonging to the same participant.

Related Protocols

UDP: the transport protocol that carries RTP packets with minimal overhead and no retransmission delay
TCP: reliable transport protocol, used by HTTP-based streaming alternatives like HLS and DASH
HTTP: application layer protocol used by HLS, DASH, and other adaptive streaming formats
SSH: secure remote access protocol, sometimes used to tunnel media traffic through firewalls