L2TP VPN on ARM: Benchmarking Throughput, Latency, and CPU Overhead

Virtual private networks remain a cornerstone of secure remote connectivity for websites, businesses, and developers who need predictable, private links across public networks. On low-power platforms such as ARM-based routers, single-board computers, and cloud instances, running an L2TP VPN (often paired with IPsec) presents unique performance considerations. This article dives into a rigorous benchmarking approach for L2TP on ARM: how to measure throughput, latency, and CPU overhead, what hardware and software variables matter most, and practical tuning strategies to maximize performance without sacrificing security.

Why ARM is different for VPN workloads

ARM architectures (Cortex-A series, Neoverse, etc.) power a wide range of appliances from SOHO routers to cloud burst instances. Compared to x86, key distinctions that affect VPN performance include:

Crypto acceleration availability: ARM CPUs commonly expose hardware crypto via the ARMv8 Crypto Extensions (AES, SHA), but implementations and driver support vary widely. Many low-end ARM SoCs lack robust crypto engines or offload engines.
Memory bandwidth and cache topology: L2TP/IPsec packet processing is memory- and cache-sensitive; smaller caches and narrower memory buses can become bottlenecks.
Interrupt and networking stack scaling: Packet processing on Linux often begins in interrupt context. How IRQs are routed (affinity) and whether the NIC supports RSS/Multiple queues matters for multicore scaling.
Clock speed and per-core IPC: ARM cores typically have lower single-thread IPC/core frequency than modern x86 server CPUs, which impacts single-flow throughput.

Testbed and methodology

A repeatable, transparent benchmark is essential. The following testbed and steps reflect a methodology that produces meaningful, comparable results.

Hardware and software baseline

ARM devices: choose representative platforms (e.g., Raspberry Pi 4 – Cortex-A72, ODROID-N2+ – Cortex-A73/A53, and a cloud ARM instance such as AWS Graviton2/3). Include an x86 reference for context.
Network interface: Gigabit or 10GbE NICs depending on platform. Make sure firmware/drivers are up-to-date and support features like checksum offload, GRO/TSO, and multi-queue if available.
OS: Recent Linux kernel (5.10+ recommended), iproute2, strongSwan/libreswan for IPsec, xl2tpd/NetworkManager-l2tp for L2TP. Kernel config should enable CONFIG_CRYPTO_DEV and relevant crypto drivers.
Testing tools: iperf3 and netperf for throughput and latency; ping for baseline RTT; perf/top/htop and vmstat/iostat for CPU/memory; sar for sustained metrics.

Topology

Typical topology uses two endpoints: client and server. To isolate VPN processing costs, ensure the link between the machines is not the bottleneck: use a direct switch, or a loopback arrangement where possible. Tests should include both local (within same LAN/VPC) and WAN-like scenarios (introducing controlled packet loss and latency using tc netem).

Test parameters

Throughput: iperf3 TCP and UDP tests across varying parallel streams (1, 2, 4, 8, 16). Measure peak and sustained throughput for 60–300 seconds.
Latency: ICMP and TCP connection latency measurements before and after establishing the L2TP/IPsec tunnel; measure jitter.
CPU overhead: record per-core CPU user/system/softirq time. Use perf to capture cycles and instructions per packet to compute cycles-per-byte and cycles-per-packet.
Packet sizes and MTU: test with large (1400–1500B) and small (128B–512B) payloads. Include fragmentation scenarios to observe overhead when packets exceed tunnel MTU.

Key metrics and interpretation

Collecting metrics is only useful if interpreted correctly. Here’s what to prioritize and how to read the results.

Throughput vs CPU utilization

Plot throughput on the Y axis and CPU utilization on the secondary axis. Ideal behavior shows linearly increasing throughput until CPU saturates. When throughput stalls while only some cores are saturated, the bottleneck may be single-threaded processing in kernel crypto or in the L2TP handling path. Multi-stream tests reveal whether scaling across cores is effective.

Latency impact

VPN encapsulation adds processing and potentially extra hops. Expect an increase in RTT proportional to per-packet CPU time and packet queueing. Small-packet workloads are disproportionately impacted because overhead per packet is higher; measure median, 95th, and 99th percentiles to capture tail latency.

Cycles-per-byte and cycles-per-packet

Use perf to measure CPU cycles consumed in the crypto and networking stacks. A useful derived metric is cycles-per-byte for encryption and authentication. Lower values indicate more efficient processing (e.g., hardware-accelerated AES-GCM might yield an order of magnitude lower cycles/byte than pure-software AES-CBC+HMAC-SHA1 on some ARM cores).

Common performance bottlenecks and remedies

Understanding typical failure modes lets you apply targeted fixes.

1. Missing or misconfigured crypto acceleration

Symptoms: high CPU usage at modest throughput, especially for AES-based ciphers.
Checks: verify kernel crypto modules (e.g., aes-aarch64, arm64_neon_crypto) are loaded and that strongSwan or libreswan is configured to use kernel crypto (XFRM).
Fixes: enable AES-GCM (single-pass authenticated encryption) in IPsec SA to reduce cycles per byte; install SoC-specific crypto drivers where available.

2. MTU/MSS and fragmentation overhead

Symptoms: reduced throughput for large transfers, increased retransmits.
Fixes: set proper tunnel MTU (e.g., 1400 bytes for typical IPsec/L2TP), enable MSS clamping on NAT/forwarding rules, or enable UDP fragmentation offload if NIC supports it.

3. Single-threaded processing

Symptoms: one core pinned at 100% while others idle; throughput plateaus below aggregated NIC capability.
Fixes: enable and tune IRQ affinity, use multi-queue NICs with RSS, and configure software like strongSwan to use multiple worker processes or use kernel-level XFRM offload where possible.

4. Small-packet inefficiency

Symptoms: high CPU use and poor throughput for VoIP or DNS over VPN.
Fixes: enable Generic Receive Offload (GRO) / Large Receive Offload (LRO) and TSO on NICs, reduce context switches via busy polling (SO_BUSY_POLL where appropriate), and prefer ciphers with low per-packet overhead (AES-GCM over AES-CBC+HMAC).

Tuning checklist for production ARM VPN servers

Use AES-GCM ciphers for IPsec SAs where supported; prefer ChaCha20-Poly1305 on platforms with no AES acceleration but with fast scalar performance.
Validate kernel crypto modules and consider building a kernel with required drivers for hardware crypto engines.
Adjust MTU and MSS across tunnel endpoints and clients; test with realistic application payloads.
Enable NIC offloads and verify they remain enabled after tunnel is created (some drivers disable offloads for encapsulated traffic).
Configure IRQ affinity and irqbalance to distribute interrupts across cores aligned with RSS queues.
Use strongSwan’s multiple worker threads, or other user-space implementations that scale; prefer kernel-based XFRM processing to avoid user-space context switches where possible.
Monitor long-running tests (hours) to detect thermal throttling on embedded ARM devices which can reduce CPU frequency and throughput.

Sample results and what to expect

While absolute numbers vary by SoC and kernel version, general trends are consistent:

ARM SBC (Cortex-A53/A72 class) without AES hardware: TCP throughput over L2TP/IPsec often caps at 50–300 Mbps depending on core and memory speeds; CPU utilization may approach 100% on one or two cores.
ARM with AES extension and kernel crypto modules: throughput improves substantially—400–900 Mbps on higher-end Cortex-A72/A73 cores with AES-GCM offload.
Graviton2/3 instances: approaching multi-gigabit encrypted throughput with proper tuning, benefiting from wider memory and crypto support.
Small-packet workloads see much lower throughput due to per-packet overhead; consider tuning offloads and choosing low-overhead ciphers.

Documenting and sharing your results

For teams and clients, produce a clear benchmark report including:

Hardware/software inventory and kernel/config versions.
Exact VPN configuration (cipher suites, SA lifetimes, L2TP options, fragmentation/MTU settings).
Test scripts and iperf/netperf command lines (so others can reproduce).
Graphs: throughput vs streams, CPU usage over time, latency percentiles, cycles-per-byte plots.

Openly publishing reproducible setup (anonymized if necessary) accelerates troubleshooting and allows the community to verify claims.

Conclusion

Benchmarking L2TP on ARM reveals the interplay between crypto capabilities, kernel networking, and platform-specific constraints. With systematic testing—covering throughput, latency, and CPU overhead—you can identify whether the bottleneck is crypto, packet handling, or hardware limits. Simple configuration changes such as selecting AES-GCM, tuning MTU/MSS, enabling offloads, and distributing interrupts can unlock substantial gains. For production deployments, automate monitoring of CPU, temperature, and network metrics so you catch regressions early.

For detailed guides, configuration examples, and VPN plans tailored to both embedded ARM appliances and cloud ARM instances, visit Dedicated-IP-VPN.

Dedicated-IP-VPN — https://dedicated-ip-vpn.com/