Secure Remote CI/CD Runners with IKEv2 VPN

Overview

Continuous Integration and Continuous Deployment (CI/CD) systems increasingly rely on distributed, ephemeral runners that execute build, test, and deployment jobs. When these runners operate outside a trusted data center—on remote servers, cloud instances, or developer machines—network security becomes a first-class concern. One robust approach is to place runners on a private layer via an IKEv2-based VPN. This approach secures control plane communications, reduces attack surface, and enables predictable network policies for sensitive pipelines.

Why choose IKEv2 for protecting remote runners

IKEv2 (Internet Key Exchange version 2) provides a modern, resilient IPsec control plane. It offers several advantages particularly relevant to CI/CD environments:

Strong Cryptography and PFS: IKEv2 supports AES-GCM, AES-CBC with HMAC, and Diffie-Hellman groups that provide Perfect Forward Secrecy.
Mobility and Multihoming (MOBIKE): Endpoint IP changes (common with cloud spot instances or mobile developers) are handled without tearing down sessions.
NAT Traversal and Fragmentation: NAT-T and packet fragmentation mechanisms allow IPsec to function across diverse networks.
Certificate-based Authentication: IKEv2 integrates cleanly with PKI for scalable, auditable identity management—critical for enterprise CI/CD governance.

Architectural patterns for secure runner connectivity

Below are practical architectures that teams use to secure remote executor infrastructure. Each pattern has trade-offs in complexity, latency, and manageability.

1. Centralized VPN gateway with private runner pool

All remote runners establish IKEv2 tunnels to a centralized VPN gateway located in a hardened VPC. The gateway enforces firewall rules and routes traffic into a secured CI/CD network segment. This model simplifies auditing and policy enforcement, and centralizes certificate distribution.

2. Mesh of gateways with route injection

For distributed workloads across regions, multiple gateway nodes (each running an IKEv2 server) form a peered backbone using static routes or a dynamic routing protocol. Runners attach to their nearest gateway, reducing latency while preserving a unified security posture.

3. On-demand ephemeral tunnels

Ephemeral runners bring up IKEv2 tunnels only for job duration. This minimizes exposure of worker nodes on the public Internet. Use automation (cloud-init, container entrypoint scripts) to provision certificates, initiate IKEv2 SAs, and tear down on job completion.

Security and operational best practices

Implementing a secure IKEv2 layer requires careful configuration choices and automation. Consider these guidelines:

Prefer certificate-based authentication over pre-shared keys (PSKs). Certificates scale better and allow revocation without reconfiguring runners.
Harden cryptographic proposals: Use AES-GCM (for AEAD) or AES-256 + SHA-2 for integrity, and choose Diffie-Hellman group 19/20/31 (or larger) to ensure strong PFS.
Tune SA lifetimes: Shorter IKE/Child SA lifetimes reduce exposure but increase rekey overhead. Typical values: IKE SA 24h, Child SA 1–8h, adjusted to workload patterns.
Enable Dead Peer Detection (DPD): Quickly remove stale tunnels from the gateway and recover resources for ephemeral runners.
Implement split-tunnel policy cautiously: Full-tunnel (route all traffic through VPN) simplifies egress filtering and auditing; split-tunnel reduces bandwidth and latency but requires strict host security controls.
Automate certificate lifecycle: Use an internal PKI (e.g., CFSSL, Vault PKI) and automated enrollment (SCEP/EST) so runners receive short-lived certificates on provisioning.
Network segmentation: Place runners in an isolated subnet with minimal inbound rules—only the CI controller and necessary artifact stores should be reachable.

Example deployment using strongSwan

strongSwan is a common open-source IKEv2 implementation used in production. The following describes conceptual configuration steps (not verbatim configs) to run a secure server and runner client:

Server: generate CA certificate and host certificate for the gateway; configure ipsec.conf with a connection profile that enforces AES-GCM, SHA-256, and DH group 19; enable NAT-T and MOBIKE; configure leftsubnet to route runner address space into the VM.
Client (runner): obtain a short-lived certificate from your PKI; configure client connection to authenticate with certificate; set right as server public IP; configure leftsourceip or virtual IP pool for runner addressing.
Firewall/iptables: accept UDP/500 and UDP/4500 from runner pools, and allow ESP where NAT-T is not used; implement MASQUERADE or NAT as needed for egress.

Operationally, embed a small bootstrap script into runner images that: retrieves a certificate, starts strongSwan, verifies the tunnel is up, and only then starts the CI job. At job termination, the script cleanly shuts down the VPN and deletes any ephemeral credentials.

Integrating with GitLab/GitHub Actions self-hosted runners

When integrating with common CI platforms, follow these recommendations:

Runner registration: Register runners in a way that ties them to specific projects/environments. Use tagging and runner groups to limit what jobs can execute on private runners.
Secure metadata and tokens: Store registration tokens and PKI enrollment credentials in a secret manager (e.g., HashiCorp Vault) and inject them into the runner only during provisioning.
Health checks and metrics: Monitor tunnel state, retransmits, and handshake failures. Export strongSwan counters to Prometheus for alerting on high rekey rates or DPD flaps.

Performance tuning and MTU considerations

IPsec adds overhead and can cause packet fragmentation. To minimize issues:

Adjust MTU on the VPN client interface (typically lower than 1500, often 1400–1420) to account for ESP encapsulation and NAT-T.
Enable MSS clamping on gateway egress if runners initiate TCP flows that traverse the tunnel to remote services.
Use AES-GCM AEAD ciphers which often perform faster with hardware acceleration (AES-NI) and reduce packet size compared to separate encryption+auth schemes.

Monitoring, auditing and incident response

Operational visibility is essential for security and reliability:

Logs: Collect IKE logs (auth failures, certificate errors, rekey events). Send to a centralized log store and retain logs per compliance requirements.
Telemetry: Track session counts, bytes transferred, and tunnel uptime. Alert on anomalous spikes in outbound traffic from runners (possible data exfiltration).
Revocation process: Have a documented and automated certificate revocation workflow. Integrate CRL/OCSP checks into your VPN server configuration where possible.
Incident playbook: Define steps to isolate runner instances, revoke certificates, rotate gateway keys, and re-enroll new runners.

Scaling and high availability

Large CI fleets require HA and load balancing:

Stateless frontends: Use multiple IKEv2 gateways behind a DNS or HAProxy layer. Because IKEv2 is stateful, ensure session stickiness or use client-side reconfiguration to reconnect to a different gateway.
Centralized route distribution: For multi-gateway architectures, use route redistribution via BGP or orchestrated route programming (e.g., via cloud route tables) to steer runner address spaces to the right region.
Auto-scaling: Provision CI runners with autoscaling hooks that include tunnel establishment as part of the bootstrapping lifecycle.

Common pitfalls and how to avoid them

Be aware of these frequent mistakes:

Using long-lived certificates or PSKs without automated rotation—this increases blast radius.
Split tunneling without host hardening—attackers on the runner host can bypass network controls.
Neglecting MTU and MSS settings—produces mysterious network timeouts during artifact transfer.
Lack of telemetry—without logs you cannot investigate failed deployments or exfiltration attempts.

Conclusion

Securing remote CI/CD runners with an IKEv2-based VPN balances strong cryptography, operational resilience, and scalability. By combining certificate-based authentication, careful cipher selection, automated credential lifecycle management, and robust monitoring, organizations can protect sensitive build pipelines while retaining the flexibility of distributed runners. Operational practices—such as ephemeral tunnels, network segmentation, and automated enrollment—reduce exposure and simplify incident response.

For teams running production CI systems, the next steps are to prototype with a small runner pool (using strongSwan or a managed IKEv2 service), measure performance, and iterate on automation for certificate issuance and teardown. Properly implemented, this approach provides an enterprise-grade network boundary for CI workloads without sacrificing developer velocity.

Published by Dedicated-IP-VPN