Self-Hosted Session Replay with Live Tailing: DOM Mutation Streaming Architecture

When a user reports that "the button doesn't work," traditional debugging tools leave you guessing. Session replay bridges this gap by capturing the exact DOM state and user interactions leading up to an issue. The recent open-source tool Reploya demonstrates that self-hosted session replay with live tailing is not only feasible but can operate with sub-second latency, enabling real-time debugging of active user sessions.

Why DOM Mutations Beat Video Recording

Most teams initially assume session replay works like screen recording—capturing video frames. In practice, modern tools like rrweb (17,000+ GitHub stars) take a fundamentally different approach that yields dramatic efficiency gains. The process begins with an initial DOM snapshot: the library walks the entire document tree, assigns unique numeric IDs to each element, and serializes tag names, attributes, and text content into a JSON structure. For a typical React dashboard, this initial snapshot ranges from 50–200KB.

After the snapshot, MutationObserver takes over. Instead of re-recording the entire page, the library captures only incremental changes—node additions, attribute modifications, text updates. A typical user interaction like clicking a dropdown menu generates just 200–500 bytes of mutation data. This diff-based approach explains why a 5-minute session that would consume 50MB as video requires only 2–3MB of raw JSON, compressing down to 250–400KB with gzip.

The reconstruction process reverses this flow: the player renders the initial DOM snapshot into an isolated iframe, then replays mutations in timestamp order. The result is an interactive replay where you can inspect element states, view console logs, and scrub through the timeline—capabilities impossible with static video.

Implementing Live Tailing for Real-Time Debugging

Live tailing extends session replay from post-hoc analysis to real-time observation. The architecture requires three components: a streaming transport, a buffering strategy, and a playback engine capable of handling incomplete sessions.

Streaming Transport: Server-Sent Events (SSE) provides the most straightforward implementation. The client establishes a persistent connection to /api/sessions/:id/live, and the server streams mutation events as they arrive from the recording client. Unlike WebSockets, SSE works over standard HTTP, handles reconnection automatically, and requires no protocol negotiation overhead.

Buffering Strategy: The recording client batches events in memory for 5–10 seconds before transmission, striking a balance between latency and network efficiency. For live tailing, the server maintains a sliding window buffer of the most recent events, flushing to persistent storage while simultaneously pushing to connected viewers. When a support agent connects to an active session, they first receive the buffered history, then transition to real-time streaming.

Playback Engine: The rrweb player accepts events incrementally. Initialize the player with available history, then call addEvent() as new mutations arrive. This enables the viewer to see typing, scrolling, and clicking as they occur—critical for debugging issues that only manifest under specific timing conditions.

Storage Architecture at Scale

Production session replay systems face significant storage challenges. PostHog's architecture illustrates a battle-tested pattern: ingestion via Kafka, batch compression with Snappy, and persistence to object storage with metadata in ClickHouse.

The ingestion pipeline receives events from the capture service, buffers them by session ID, and compresses blocks every 10 seconds or 100MB—whichever comes first. Snappy compression achieves 85–92% size reduction while maintaining decompression speed suitable for real-time playback. Each compressed block receives a byte-range URL like s3://bucket/key?range=bytes=0-1000, enabling the frontend to fetch only the time ranges needed for playback.

For self-hosted deployments, the storage math is manageable. At 1,000 sessions per day with 30-day retention, expect 8–12GB of compressed data monthly. Sampling at 10% reduces this to under 1GB, making even modest VPS instances viable for small teams.

Privacy Controls and Compliance

Session replay captures everything visible on the page—user names in navigation bars, email addresses in profile sections, even partial credit card numbers displayed in confirmation screens. Self-hosting mitigates GDPR's third-party data transfer concerns, but masking remains essential.

rrweb provides three mechanisms:

maskAllInputs: true replaces all input field values with asterisks during recording
data-rr-mask attribute masks rendered text within container elements while preserving structure
data-rr-block excludes entire sections from recording, replacing them with empty placeholders

A production-ready masking strategy masks all inputs by default, applies data-rr-mask to user-specific content areas, and blocks payment forms and medical records entirely. Audit every route by recording test sessions and verifying no PII leaks through in the replay.

Production Deployment Checklist

Before enabling session replay in production, address these edge cases:

Sampling Strategy: Not every session needs recording. Implement smart sampling—10% of sessions by default, 100% for users who trigger errors, specific pages, or users matching support criteria. This cuts storage costs while preserving debugging capability for problematic sessions.

CORS and Keepalive: If your replay endpoint runs on a different subdomain, configure CORS headers properly. The keepalive flag on fetch requests has a 64KB limit; larger batches fail silently on page unload. Implement beacon API fallback for final batch transmission.

Cross-Origin Iframes: MutationObserver cannot see inside cross-origin iframes. Third-party widgets, embedded videos, and chat widgets appear as blank rectangles in replay. Document which iframe content is opaque and set user expectations accordingly.

CSS-in-JS Compatibility: Libraries like styled-components inject styles into <style> tags at runtime. The replay must capture these injected styles or playback appears broken. rrweb's inlineStylesheet: true option helps, but test thoroughly with your specific CSS-in-JS solution.

Performance Monitoring: Initial DOM snapshots on complex dashboards (5,000+ nodes) take 50–200ms. While this doesn't block rendering, monitor this latency and consider disabling recording on low-end devices when navigator.hardwareConcurrency < 4.

Conclusion

Self-hosted session replay with live tailing transforms how teams debug user-reported issues. The rrweb ecosystem provides mature, open-source primitives for recording and playback, while streaming architectures enable real-time observation of active sessions. With proper masking, sampling, and storage management, teams can deploy replay infrastructure that rivals commercial tools like FullStory while keeping all data within their own infrastructure and jurisdiction.

The key insight is that session replay is not video—it's structured data about DOM mutations. This distinction enables compression ratios above 90%, interactive inspection capabilities, and real-time streaming that would be impossible with frame-based recording. For teams building complex web applications, the debugging value justifies the modest infrastructure investment.

References

PostHog Session Replay Architecture Handbook: ingestion → processing → serving pipeline with Kafka, Snappy compression, and ClickHouse metadata
rrweb Documentation: DOM mutation recording, MutationObserver integration, and privacy masking APIs

web

内容声明：本文无广告投放、无付费植入。

如有事实性问题，欢迎发送勘误至 i@hotdrydog.com。