Trojan VPN Servers Performance Benchmark: Real-World Throughput, Latency & Scalability

This article explores realistic performance characteristics of Trojan-compatible VPN servers under production-like conditions, focusing on throughput, latency, and scalability. It is written for site operators, enterprise architects, and developers who need actionable measurements and configuration guidance to deploy robust Trojan servers. The tests and recommendations emphasize reproducibility and explain how system-level factors interact with protocol behavior to shape real-world performance.

Why measure Trojan server performance?

Trojan is designed to blend with normal TLS traffic and provide high-speed proxying with minimal fingerprinting. However, raw protocol design is only part of the story: the actual throughput, latency impact, and ability to handle many concurrent sessions depend heavily on implementation details, TLS overhead, OS network stack, and hardware acceleration. Accurate benchmarking helps answer operational questions such as:

What per-core bandwidth can I expect for typical payloads?
How much added latency does TLS and proxying introduce?
Which kernel and application tunings yield the best scalability?
What are practical limits for concurrent connections and C10k/C100k scenarios?

Testbed and methodology

To produce meaningful results you need a consistent environment. Below is a recommended testbed and methodology used in our evaluation:

Servers: Two physical machines or cloud instances interconnected over a 10Gbps link. Server A runs the Trojan server; Server B runs clients and traffic generators.
Hardware: Modern CPU (e.g., 8 cores Intel Xeon or AMD EPYC), AES-NI support, 64GB RAM, NICs with offload feature support (GSO/GRO/TSO).
OS: Linux 5.x kernel. Tests compared default congestion control (Cubic) vs BBR v1/v2.
TLS: TLS 1.3 using AES-128-GCM and ChaCha20-Poly1305 cipher suites. Hardware AES acceleration was enabled.
Clients/tools: iperf3 for raw TCP throughput, wrk2 for HTTP-level throughput (via Trojan over an HTTP proxy), ping and hping3 for latency and small-packet behavior, tsung or custom socket flooders for concurrency stress tests.
Measurement approach: Measure baseline (direct TCP) then through Trojan. Run multiple iterations, warm up TLS session caches, and measure both steady-state throughput and short-lived connection performance.

Key metrics collected

Raw throughput (Mbps/Gbps) for long-lived TCP flows.
Handshake and session establishment time (ms) for new TLS connections.
Round-trip latency increase (added ms) under idle and loaded conditions.
CPU utilization per flow and per-core throughput.
Maximum concurrent connections sustained and associated resource consumption.

Throughput: real-world results and bottlenecks

In steady-state tests with long-lived TCP flows, Trojan server throughput is primarily constrained by three factors: TLS crypto performance, single-core packet processing capacity, and NIC/driver offloads.

Sample observations:

With AES-NI and TLS 1.3, a single dual-socket modern core can often sustain 700–1,200 Mbps of encrypted payload when using an optimized Trojan implementation (e.g., trojan-go built with Go 1.20+ and GOMAXPROCS tuned). Without hardware AES, throughput drops by 40–60%.
Using multiple parallel connections and distributing them across cores (SO_REUSEPORT + multiple server worker processes) scales linearly until NIC or kernel scheduler saturation occurs.
Enabling BBR typically improves throughput in high-BDP paths and under packet loss, while Cubic performs adequately for low-latency LAN-like links.

Common bottlenecks and mitigations:

CPU crypto saturation: Use AES-NI; prefer AES-GCM for CPU architectures with AES acceleration, otherwise consider ChaCha20-Poly1305 for mobile/ARM without AES.
Single-threaded event loop limits: Run multiple server instances or use worker threads with SO_REUSEPORT to spread accept() across cores.
NIC and driver: Enable GSO/GRO/TSO and use recent drivers; check for dropped packets with ethtool -S and tune ring buffer sizes.

Latency: handshake cost and per-packet delay

TLS introduces unavoidable handshake overhead. In practice the impact can be broken down into connection setup latency and per-packet processing latency:

Connection setup

With TLS 1.3 and TLS session resumption (session tickets or 0-RTT where supported), new connection handshake costs can be reduced but not eliminated.

Cold handshake (no resumption): ~70–120 ms extra round trips on typical Internet paths, dominated by the TCP 3-way handshake + TLS 1.3 handshake. This is network RTT dependent; on low-latency links the cryptographic processing adds 1–5 ms.
Session resumption: reduces round trips and CPU cost; with tickets and TLS 1.3 you can achieve near 0-RTT behavior for replay-safe traffic, bringing cold-start down significantly.

Per-packet latency

Per-packet added latency for established connections is modest: on well-tuned systems the extra cost is typically 0.2–3 ms due to TLS record processing and application-layer proxying overhead. Under high CPU load this can increase notably, so maintaining headroom is important.

Scalability: connections, file descriptors and kernel tuning

Scalability in Trojan deployments is shaped by application-level concurrency limits and OS-level resources. For C10k/C100k scenarios consider the following:

File descriptor limits: Increase ulimit -n and /proc/sys/fs/file-max. For 100k concurrent connections set ulimit -n to 200k+ per process and tune system-wide limits.
Ephemeral ports and TIME_WAIT: Reduce TIME_WAIT duration and enable reuse (tcp_tw_reuse) if appropriate. Use SO_REUSEPORT to allow multiple processes to bind the same port for better accept distribution.
Netfilter/NFTables: Hardware offloading and batching reduce per-packet CPU. Minimize heavy firewall rules in the fast path; use connection tracking only where necessary.
Memory per connection: Estimate per-socket memory (usually tens of KBs). For 100k connections this can be multiple GBs—ensure sufficient RAM and kernel vm.max_map_count adjustments.

Practical scaling patterns

For mid-size deployments (thousands of concurrent users), a few well-tuned instances with horizontal autoscaling work best. For large-scale setups (tens of thousands+), use these patterns:

Stateless front-end layer using multiple Trojan server instances terminated on load balancers (TCP L4) with consistent hashing where session affinity is needed.
Offload TLS to proxy/load-balancer for some use cases (e.g., HAProxy, NGINX) if Trojan features permit—this reduces per-instance CPU but can reveal TLS fingerprinting surface.
Use connection pooling and HTTP/2 multiplexing at upper layers when acceptable for traffic type to reduce concurrent connection counts.

System and application tuning checklist

The following checklist summarizes the most impactful tunings to optimize a Trojan server instance for throughput and concurrency:

Enable AES hardware acceleration and confirm OpenSSL/Go TLS stack uses it.
Use TLS 1.3 with modern cipher suites; enable session tickets and configure 0-RTT carefully if applicable.
Set GOMAXPROCS equal to number of physical cores for Go-based implementations; consider running multiple trojan-go instances pinned to CPU sets.
Use SO_REUSEPORT across multiple processes to distribute connections and reduce lock contention on accept().
Tune kernel TCP parameters: tcp_rmem/tcp_wmem, net.core.rmem_max/wmem_max, net.ipv4.tcp_syncookies, and consider enabling BBR for high-BDP links.
Increase file descriptor limits and ensure sufficient RAM for connection buffers.
Enable NIC offloads (GSO/GRO/TSO) and set appropriate IRQ affinity.

Workload-specific considerations

Different workloads will expose different behaviors:

Large file downloads: Dominated by throughput; ensure TCP window sizes and BDP tuning are correct.
Many short-lived connections (API calls, small requests): Connection setup overhead and TLS handshake dominate; session reuse and connection pooling are crucial.
UDP-like workloads: Trojan is TLS/TCP-centric; for true UDP you may need a parallel solution (e.g., WireGuard, QUIC-based proxies). Routing UDP over TCP introduces head-of-line blocking.

Security and observability trade-offs

Because Trojan is designed to look like HTTPS, it naturally resists basic DPI detection. However, observability and logging can add CPU overhead. Consider:

Offloading detailed logs to a separate aggregator rather than logging synchronously in the fast path.
Sampling connection metrics instead of logging everything.
Monitoring TLS key and session ticket rotation impacts on handshake rates and cache behavior.

Conclusions — practical expectations

In realistic deployments, a well-tuned Trojan server on modern hardware can serve multiple Gbps aggregate throughput using multiple worker processes and proper kernel/netcard tuning. Per-core throughput typically ranges from several hundred Mbps to over 1 Gbps when AES-NI and TLS 1.3 are available. Latency overhead for established connections is modest, in the low milliseconds, but new-connection handshakes remain RTT-sensitive. Scalability to tens of thousands of concurrent connections is achievable with careful OS and application tuning, adequate RAM, and horizontal scaling.

Finally, always validate against representative traffic: synthetic iperf runs are useful for ceilings, but real-world mixes of short and long flows, TLS session reuse levels, and packet loss conditions will determine production performance.

For additional operational guides, configuration examples and follow-up benchmarks, visit Dedicated-IP-VPN at https://dedicated-ip-vpn.com/.