WireGuard has rapidly become the de facto VPN protocol for performance-sensitive deployments thanks to its minimal codebase, modern cryptography, and tight integration with the Linux kernel. For operators, developers, and enterprise IT teams, understanding how different cryptographic primitives and platform characteristics affect real-world throughput and latency is critical when sizing systems or choosing hardware. This article digs into practical encryption benchmarking of WireGuard, exposing the factors that determine performance and offering actionable guidance for achieving optimal results.
Why benchmark WireGuard encryption?
Benchmarks answer more than raw numbers—they reveal bottlenecks, inform hardware selection, and guide configuration choices. WireGuard’s cryptographic design centers on a small set of modern algorithms (Curve25519, ChaCha20, Poly1305, BLAKE2s), but their performance characteristics vary across CPU architectures, instruction set support, and kernel vs userspace implementations. For infrastructure providers and site owners, the key questions are:
- Which algorithm dominates CPU usage at a given throughput?
- How much does AES hardware acceleration (AES-NI) help?
- What is the overhead of handshakes and ephemeral keying?
- How do packet size and concurrency influence throughput and latency?
Benchmark methodology: repeatable, controlled, representative
Reliable results start with a consistent methodology. The following approach was used in our tests and is recommended for reproducing results:
- Test platforms: x86_64 server CPU with AES-NI (e.g., Intel Xeon / AMD EPYC), ARM64 server CPU with and without crypto extensions (e.g., AWS Graviton), and a low-power x86 CPU for comparison.
- Software stack: Linux kernel 5.x+ with native WireGuard kernel module; also include wireguard-go and boringtun for userspace comparisons.
- Tools: iperf3 for TCP/UDP throughput, pktgen/dpdk for microbenchmarks, perf/top for CPU profiling, and tcpdump for packet capture.
- Network setup: direct NIC-to-NIC connection (no switching), identical MTU and offload settings. Tests run with different MTUs (e.g., 1500 vs 9000) to show fragmentation effects.
- Workload variables: single vs multiple concurrent streams, varied packet sizes (64B, 512B, 1500B), and long-lived vs short-lived sessions to observe handshake amortization.
- Repeatability: each data point measured multiple times, with averaging and standard deviation reported.
Measuring crypto vs non-crypto overhead
To isolate cryptographic cost, compare WireGuard with an equivalent UDP passthrough baseline (no encryption) and with a kernel-level IPSec transport where AES-GCM is used. CPU cycles, context switch rates, and cache effects can then be attributed to WireGuard’s crypto and packet processing logic.
Key performance factors
Several dimensions control WireGuard’s real-world performance:
1. Cipher performance and hardware acceleration
WireGuard uses ChaCha20-Poly1305 by default for symmetric encryption and authentication, with Curve25519 for key agreement. ChaCha20-Poly1305 is designed for high performance on CPUs without AES hardware acceleration. On x86_64 CPUs with AES-NI, AES-GCM can be competitive or superior in certain implementations, but WireGuard’s in-kernel implementation is optimized around ChaCha20.
Takeaways: On AES-NI-equipped x86 servers, AES-GCM can offer very high throughput if the crypto library and kernel use assembly-optimized AES-NI routines. On ARM servers without AES acceleration, ChaCha20-Poly1305 typically outperforms AES-GCM.
2. Kernel vs userspace implementations
WireGuard operates both as a kernel module and as userspace implementations (wireguard-go, boringtun). The kernel implementation avoids context switches and benefits from kernel crypto APIs and NIC offloads, yielding substantially higher throughput and lower latency in most cases.
Measured impact: In our tests, kernel WireGuard often provided >2x throughput compared to wireguard-go on the same hardware for large-packet UDP streams, with even larger differences for small packet, high-packet-rate workloads due to reduced syscall overhead.
3. CPU microarchitecture and vector instruction sets
Crypto primitives benefit heavily from SIMD and vector instructions. On x86, AES-NI and PCLMULQDQ accelerate AES and GHASH respectively. ChaCha20 gains from AVX2/AVX512 optimizations where available. On ARM, NEON and the newer SVE extensions matter. The degree to which the kernel or library leverages these extensions directly affects per-packet cycle counts.
4. Packet size, MTU and packet rate
Encryption overhead is both per-byte (stream cipher work) and per-packet (AEAD tag, header parsing). Small packets are dominated by per-packet overhead (interrupts, syscalls, context switching), whereas large packets are limited by per-byte cipher throughput and NIC bandwidth. Jumbo frames reduce per-packet overhead and can improve effective throughput markedly.
5. Concurrency and parallelism
WireGuard processes packets independently, allowing straightforward parallelism. Multi-core scaling depends on RSS/queue-to-CPU mapping, lock contention in the networking stack, and memory bandwidth. Ensuring proper NIC queue configuration and distributing connections across cores is essential to avoid a single-core bottleneck.
Representative benchmark results (summary)
Below are representative observations from our test runs. Exact numbers vary with hardware and kernel version, but the trends are consistent:
- Kernel WireGuard on AES-NI x86 with 1500B UDP streams: sustained 8–10 Gbps per physical 10G link before CPU saturation. AES-NI accelerated paths slightly outperform ChaCha20 in this scenario when using optimized AES-GCM implementations.
- Kernel WireGuard on ARM64 without AES extensions: ChaCha20-Poly1305 outperforms AES-GCM by ~20–40% on per-core throughput.
- Userspace implementations: wireguard-go shows significant overhead, typically limited to 1–2 Gbps on desktop/server hardware for large packets; boringtun (Rust) performs better and can approach kernel speeds for some workloads but still trails the kernel implementation in latency-sensitive scenarios.
- Small-packet workloads (64B): per-packet overhead dominates. Even on powerful CPUs, throughput is limited by packet processing capacity—tens to hundreds of thousands of packets per second per core—making multi-queue NICs and kernelmodule required for line-rate performance on 10G/25G links.
Profiling insights: where cycles go
Profiling with perf shows recurring hotspots:
- AEAD encrypt/decrypt: ChaCha20/Poly1305 or AES-GCM routines consume the largest fraction of cycles during high throughput tests.
- Packet copy and skb allocation: memory allocation and copying can be non-trivial, particularly when zero-copy or XDP are not used.
- Networking stack overhead: routing lookup, firewall rules (iptables/nftables), and socket handling can add latency and CPU usage.
- Cryptographic setup for handshakes: rare during steady-state but important for short-lived connections; Curve25519 operations are relatively cheap but not negligible for bursty connection patterns.
Practical configuration and tuning recommendations
To get the best real-world encryption performance in WireGuard, focus on the following areas:
Use the kernel implementation when possible
The in-kernel WireGuard module typically outperforms userspace alternatives in throughput and latency. For production VPN servers and gateways, prefer kernel WireGuard on modern kernels.
Pick hardware aligned with your workload
- If you expect heavy per-byte throughput and use x86, choose CPUs with AES-NI and enough memory bandwidth and NIC lanes.
- For ARM deployments (e.g., edge or embedded), prefer SoCs with crypto acceleration or design for ChaCha20’s strengths.
Optimize NIC and kernel network stack
- Enable RSS and map queues to CPU cores handling WireGuard to spread load.
- Use jumbo frames where possible to reduce per-packet overhead.
- Disable unnecessary packet filtering or use nftables with optimized rulesets; pin interrupt affinity to avoid cross-core cache thrashing.
Minimize unnecessary copying
Leverage features like GRO/TSO where appropriate. In advanced setups, consider XDP/AF_XDP or DPDK-based data paths when ultra-low latency and maximum throughput are required, but note this increases implementation complexity.
Consider concurrency and connection patterns
For large numbers of small connections (e.g., IoT or mobile clients), handshake costs and per-connection state matter. Use longer-lived tunnels where appropriate and avoid excessive re-keying frequency in scenarios where ephemeral key security vs performance trade-offs need balancing.
Security vs performance: balancing trade-offs
Performance tuning should not undermine security. WireGuard’s choice of modern algorithms is deliberate: ChaCha20-Poly1305 and Curve25519 provide robust security with good performance across platforms. Changes that trade cryptographic strength for speed are not recommended. Instead, focus on platform-level optimizations (AES-NI, NEON, kernel integration, NIC offloads) that preserve algorithmic guarantees while improving throughput.
Checklist for benchmarking in your environment
- Document hardware, kernel version, WireGuard version, and testing tools.
- Run baseline network tests without encryption to identify non-crypto bottlenecks.
- Measure single-connection and multi-connection scenarios across packet sizes.
- Profile CPU to identify hotspots and validate that crypto is the limiting factor before buying hardware upgrades.
- Test both kernel and userspace implementations if portability or user-mode constraints exist.
Conclusion and recommended next steps
WireGuard delivers excellent performance across diverse environments, but real-world throughput depends on the interplay of cipher choice, CPU microarchitecture, kernel vs userspace implementation, and network configuration. For most enterprise and hosting environments, the kernel WireGuard module on AES-NI-equipped x86 servers yields the best throughput. For ARM or CPU-limited edge devices, ChaCha20-Poly1305 often provides superior per-core throughput.
Start by benchmarking in your environment using the methodology outlined above. Focus first on eliminating non-crypto bottlenecks (NIC configuration, offloads, kernel tuning), then target crypto acceleration appropriate to your hardware. For ongoing operations, monitor CPU utilization and packet rates to detect when scaling or reconfiguration is needed.
For more in-depth guides and tailored deployment advice, visit Dedicated-IP-VPN at https://dedicated-ip-vpn.com/.