Skip to main content

A resilient, self-hosted mesh VPN with Nebula

Until now I ran a classic, centralized VPN: all my devices connected to a single server that relayed their traffic. It works, but it creates an annoying dependency. The day that server — or simply its internet link — goes down, nothing communicates anymore, including two devices that are otherwise perfectly reachable from one another. I find that a shame: a third party’s outage shouldn’t break a link that, technically, doesn’t need it.

That’s what led me to the peer-to-peer (mesh) model. Instead of a central relay point, each device establishes an encrypted tunnel directly with the others. The result is resilience: as long as two peers can reach each other on the network, they communicate, regardless of the state of the rest of the infrastructure.

This post retraces my approach: the few notions worth laying out (tunneling, NAT), a survey of the existing solutions, and the concrete setup of the one I picked.

The basics
#

Split tunnel vs full tunnel
#

A first question came to me while thinking about this p2p setup: where does the traffic that isn’t an exchange between my devices go — say, a request to some website? That’s the whole difference between two VPN operating modes.

  • Full tunnel: all of the device’s traffic goes into the VPN, including ordinary web browsing. That was the mode of my centralized setup: the server saw all my traffic and acted as the internet gateway. Useful for encrypting a connection on an untrusted network or masking your IP, but it routes everything through a third party — latency and throughput depend on it, and if that party goes down, you lose internet access altogether.
  • Split tunnel: only the traffic destined for the VPN network takes the tunnels; the rest goes out directly through the device’s local connection. A request to a website leaves via my normal internet access, and only the exchanges between my devices go through the mesh.

For my use case, the split tunnel is what makes sense. I want to link my devices to each other, not route all my browsing through them — it would be absurd to send a web request up through the mesh only to have it come back out somewhere else. And it fits the resilience goal: each device keeps its own internet access, the mesh only serves what it’s meant for.

NAT & NAT traversal
#

P2P raises a concrete problem: for two devices to communicate directly, each has to know how to reach the other. But most of my devices (phone on 4G, laptop on some random WiFi, RPi behind a client’s router) have no dedicated public IP nor an open port. They’re all hidden behind a NAT.

The role of NAT
#

NAT (Network Address Translation) lets several machines on a private network share a single public IP. When my laptop (192.168.1.20) contacts a server, the router rewrites the source address to its public IP and keeps the mapping in a table:

192.168.1.20:51820  ->  public_IP:42315  ->  server:443

The reply comes back to public_IP:42315, and the router knows to forward it to the right internal device. This works perfectly for outbound traffic… but nobody can initiate a connection toward my laptop: from the outside, there’s no entry in the table until it has spoken first.

That’s the whole problem with the mesh: my two peers are each behind a NAT, and neither can “call” the other directly.

Hole punching
#

The trick is called hole punching, and it relies on a publicly reachable trusted third party (a coordination server, called a “lighthouse” in Nebula, a “DERP/coordination server” elsewhere).

The principle:

  1. Each peer opens an outbound connection to the coordinator. In doing so, it creates an entry in its router’s NAT table (the famous “hole”).
  2. The coordinator therefore knows each peer’s public IP + port, as seen from the outside.
  3. When A wants to talk to B, it asks the coordinator for B’s coordinates.
  4. A and B then start sending each other packets simultaneously. A’s first packet opens the hole in A’s NAT; B’s first packet opens the hole in B’s NAT. Since each has already “spoken first,” the NATs let the other’s inbound packets through.

From there, the coordinator is out of the loop: traffic flows directly from A to B, peer-to-peer. This is exactly the resilience I’m after: even if the coordinator goes down afterward, the tunnels already established keep working.

The cases that break
#

Hole punching works with the majority of home NATs, but not all. The deciding factor is how the router assigns the public port:

  • Cone NAT (endpoint-independent): the router reuses the same public port regardless of the destination. The port seen by the coordinator is therefore usable by the peer. Hole punching works.
  • Symmetric NAT: the router picks a different public port for each destination. The port the coordinator observed no longer matches the one used to reach the peer: the hole is in the wrong place. Hole punching fails.

You mostly run into symmetric NAT away from home: mobile 4G/5G networks (CGNAT almost systematically — that’s my phone’s case), some ISP offerings short on IPv4, and corporate / hotel / public WiFi networks with a strict firewall. Consumer routers in France generally give a full-cone public IPv4: at home, hole punching works without a hitch. The tricky points in my fleet are therefore the phone on 4G and the RPi, sitting on a network I don’t control and whose NAT behavior I don’t know.

When both peers are behind a symmetric NAT (or a CGNAT), a direct connection is impossible. A relay is then needed.

Coordinator & relay
#

When hole punching fails, traffic transits through a public node that relays the packets between the two peers. It’s a safety net, not the nominal mode: everything goes through a third party, at the cost of latency and throughput. The ideal is therefore that only the badly-placed link (my phone on 4G) uses it, and only when its peer is also unreachable directly.

That’s what sets a real mesh solution apart from “bare” WireGuard: it provides both the coordinator (to attempt p2p) and the relay (to guarantee connectivity as a last resort).

My setup
#

Before comparing the solutions, here’s the fleet I want to link. It mixes two very different profiles, and it’s precisely that mix that drives my choices.

Nomadic devices, which by definition have no stable entry point:

  • an Android phone, most often on 4G/5G — hence behind the carrier’s CGNAT, the worst case for a direct connection;
  • a Linux laptop, which ends up on all sorts of networks (home, office, public WiFi), with no predictable IP or port from one moment to the next.

These two change networks constantly: impossible to configure a fixed endpoint for them. They are typically the ones that will need hole punching, and the relay when it fails.

Linux servers, more stable in operation, but very unequal when it comes to the network:

  • a home server behind my router, which has a public IP and where I can open a port: it is reachable from the outside. It’s the only one in my fleet in that situation;
  • a Raspberry Pi hosted at a client’s site, on a network I don’t control at all: I can’t open a port there and I don’t even know their public IP. Although it runs permanently, it is unreachable from the outside — network-wise, it behaves like a nomadic device behind NAT.

In other words, only one of my hosts at home is publicly reachable. It’s an obvious coordinator/relay, but I specifically don’t want the mesh’s whole resilience to rest on that single machine (nor on my router). So I also rely on a second public node, hosted elsewhere, to have two redundant entry points. This need for redundant coordinators is the criterion that, as we’ll see, directly drives the choice of solution.

Solutions
#

Several solutions are on offer. I compare them mainly on three axes that follow from the previous section: presence of a coordinator and a relay, ability to make that coordinator redundant (my real resilience goal), and self-hosting effort.

WireGuard mesh
#

“Bare” WireGuard: you manually set up a tunnel between each pair of peers and configure the keys, IPs and endpoints yourself. The crypto and performance are excellent (it’s the building block several of the other solutions are built on), but there is neither coordinator nor relay: no automatic hole punching, and each peer has to know in advance an endpoint reachable on the other. On a mesh of N mobile devices behind NAT, it quickly becomes unmanageable (N² tunnels to maintain, endpoints that change). It’s the reference for protocol simplicity, but not a turnkey mesh solution.

Link: wireguard.com

NetBird
#

WireGuard driven by a full control plane: a management server, a signal server for hole punching, and a TURN relay (coturn) as a fallback. Self-hostable, with ACLs, SSO/OIDC and a nice UI. It’s the most “product-like” solution of the lot, at the cost of a heavier stack to operate (several components + a database).

Links: netbird.io · repo (self-hosted)

Headscale
#

Open source re-implementation of the Tailscale control server: you get the Tailscale clients (very polished, on every platform) plugged into your own control server. Relaying goes through end-to-end encrypted DERP servers (self-hostable). Excellent ease of use, but the control server is a central point whose redundancy remains laborious — which is exactly what I want to avoid.

Links: docs · repo

Nebula
#

Developed by Slack. In-house protocol based on the Noise Framework, with a certificate-based PKI (each host carries its certificate signed by a CA). The lighthouses play the coordinator role, and any public host can be designated as a relay. Key point for me: you can declare several lighthouses, which gives native coordinator redundancy, with no database or separate control plane — just a single binary and one config file per host.

Links: docs · repo

Comparison
#

Criterion WireGuard mesh NetBird Headscale Nebula
Base WireGuard WireGuard WireGuard (Tailscale clients) In-house protocol (Noise)
Coordinator ✅ management/signal ✅ control server ✅ lighthouse
Relay ✅ TURN (coturn) ✅ DERP ✅ relay nodes
Redundant coordinator ⚠️ limited ⚠️ laborious ✅ native multi-lighthouse
Self-hosting trivial but all manual heavy (several services + DB) medium (control + DERP) light (1 binary + config)
Clients OS-native good excellent (Tailscale) decent (incl. Android)
Aside — Nebula’s “in-house protocol”. Nebula doesn’t use WireGuard but its own protocol, built on the Noise Framework (the same cryptographic foundation as WireGuard): Noise IX handshake, Curve25519 key exchange, AES-256-GCM or ChaCha20-Poly1305 encryption. What sets it apart is the layer above: a certificate-based PKI (identity, IP and groups signed by the CA) and a distributed firewall where each host locally applies rules expressed in terms of groups.

Why Nebula
#

My primary need isn’t onboarding comfort but coordinator resilience: that my devices keep reaching each other even when a router goes down. Nebula is the only option that natively offers several redundant lighthouses, while staying 100% p2p and lightweight to operate (one binary, no database). So it’s the solution I’m going with.

Architecture
#

Now that Nebula is chosen, here’s how I organize the network. A few Nebula-specific vocabulary notions first:

  • CA: a certificate authority I generate myself. It signs each host’s certificate. A host is only accepted into the network if its certificate is signed by my CA.
  • host certificate: it binds the machine’s key to its IP in the Nebula network and to groups (labels like serveurs or mobiles).
  • lighthouse: the coordinator. It helps peers discover each other and do hole punching. It must be publicly reachable.
  • relay: a public host that relays traffic when a direct connection fails.

Addressing plan
#

I reserve a subnet dedicated to the mesh, distinct from all my local networks to avoid route collisions. Here I use 10.42.0.0/24 (adjust as needed).

Host Role Nebula IP Groups
Home server lighthouse + relay 10.42.0.1 lighthouse, serveurs
Anchor node (hosted elsewhere) lighthouse + relay 10.42.0.2 lighthouse, serveurs
RPi (at the client’s) client 10.42.0.20 serveurs
Laptop client 10.42.0.10 mobiles
Android phone client 10.42.0.11 mobiles

The RPi runs as a server but, network-wise, it’s treated as a client: it isn’t reachable from the outside, it relies on the lighthouses like the others.

Redundant coordinators and relays
#

This is the heart of the architecture and the reason for choosing Nebula. My two public hosts — the home server and the anchor node — are declared as both lighthouses and relays. Each client device knows the public address of both (static_host_map) and registers with both.

Consequences:

  • If one of the two goes down (outage, router off, maintenance), the other keeps handling peer discovery and relaying. No single point of failure on the coordination side.
  • Once two peers are put in touch, their tunnel is direct: the lighthouses are no longer in the data path. They only become useful again to establish new links, or as a relay when direct is impossible (typically the phone on 4G facing a badly-placed peer).

Firewall policy
#

The Nebula firewall is distributed: each host carries its own rules, which rely on the groups proven by the certificates. I start from a restrictive posture (block everything inbound, then open case by case):

  • the servers expose to the mobiles only the intended services (e.g. SSH, and the specific application services);
  • the mobiles don’t need to be reachable from one another: they accept almost nothing inbound;
  • Nebula’s own coordination traffic is handled by the protocol, not by these application-level rules.

The detailed rules and configuration files come in the next section, the step-by-step setup.

Setup
#

1. Install Nebula
#

Nebula boils down to two tools: the nebula daemon (brings up the tunnel, on every host) and nebula-cert (manages the PKI, only where I generate the certificates). Most distributions now package it:

# Debian / Ubuntu
sudo apt install nebula

# Fedora
sudo dnf install nebula

# Arch
sudo pacman -S nebula

# Alpine
sudo apk add nebula

# macOS
brew install nebula

Failing a package, you grab the binaries from the GitHub releases (github.com/slackhq/nebula/releases) for the desired architecture (Linux 64/32-bit, ARM for the RPi, etc.). On Android, everything goes through the Nebula application.

Depending on the packaging, nebula-cert is provided by the same package or separately; a which nebula-cert confirms it. The package generally also installs a systemd service nebula.service that reads /etc/nebula/config.yml.

2. Generate the CA
#

On my admin machine, in a dedicated folder, a single command:

nebula-cert ca -name "Mon Mesh"

This produces ca.crt (the public certificate, to distribute to all hosts) and ca.key (the CA’s private key — to keep offline and never distribute: anyone who has it can forge valid identities).

3. Sign a certificate per host
#

Each host receives a certificate that pins its IP in the mesh and its groups. The commands must be run in the folder containing ca.crt/ca.key:

nebula-cert sign -name "home-server" -ip "10.42.0.1/24"  -groups "lighthouse,serveurs"
nebula-cert sign -name "ancrage"     -ip "10.42.0.2/24"  -groups "lighthouse,serveurs"
nebula-cert sign -name "rpi"         -ip "10.42.0.20/24" -groups "serveurs"
nebula-cert sign -name "laptop"      -ip "10.42.0.10/24" -groups "mobiles"
nebula-cert sign -name "android"     -ip "10.42.0.11/24" -groups "mobiles"

Each command generates <name>.crt and <name>.key. On each Linux host, I copy three files into /etc/nebula/: the shared ca.crt, plus the .crt and .key of that host only. A host’s private key never leaves that host.

4. Configure the lighthouses (home server + anchor node)
#

/etc/nebula/config.yml on the two public hosts (adjust cert/key to the host’s name):

pki:
  ca: /etc/nebula/ca.crt
  cert: /etc/nebula/host.crt
  key: /etc/nebula/host.key

static_host_map:
  "10.42.0.1": ["home.exemple.net:4242"]
  "10.42.0.2": ["ancrage.exemple.net:4242"]

lighthouse:
  am_lighthouse: true
  # a lighthouse does not list itself in "hosts"

listen:
  host: 0.0.0.0
  port: 4242

punchy:
  punch: true
  respond: true

relay:
  am_relay: true
  use_relays: false

tun:
  dev: nebula1

firewall:
  outbound:
    - port: any
      proto: any
      host: any
  inbound:
    - port: any
      proto: icmp
      host: any
    - port: 22
      proto: tcp
      groups: [mobiles]

The static_host_map maps a public host’s mesh IP to its real endpoint (DNS name or public IP + UDP port). The two lighthouses thus know each other.

A network gotcha not to forget. The UDP 4242 port must be reachable from the outside on these two hosts: port forwarding on my router for the home server, opening in the firewall / security group for the anchor node. Without it, no client will be able to contact them.

5. Configure the clients (laptop, RPi)
#

Same principle, but the host is neither lighthouse nor relay: it uses the two lighthouses. /etc/nebula/config.yml:

pki:
  ca: /etc/nebula/ca.crt
  cert: /etc/nebula/host.crt
  key: /etc/nebula/host.key

static_host_map:
  "10.42.0.1": ["home.exemple.net:4242"]
  "10.42.0.2": ["ancrage.exemple.net:4242"]

lighthouse:
  am_lighthouse: false
  hosts:
    - "10.42.0.1"
    - "10.42.0.2"

listen:
  host: 0.0.0.0
  port: 0          # ephemeral port: these hosts are nomadic

punchy:
  punch: true
  respond: true

relay:
  relays:
    - 10.42.0.1
    - 10.42.0.2
  am_relay: false
  use_relays: true

tun:
  dev: nebula1

firewall:
  outbound:
    - port: any
      proto: any
      host: any
  inbound:
    - port: any
      proto: icmp
      host: any

Two key differences from the lighthouses: listen.port: 0 (ephemeral port, since these hosts change networks and have no port to expose), and the relay block pointing at the two lighthouses as a fallback.

The firewall rule above (ICMP only) suits the laptop, in the mobiles group. The RPi, though, is in the serveurs group: if I want to administer it over SSH from my devices, I add the same inbound rule as on the lighthouses:

    - port: 22
      proto: tcp
      groups: [mobiles]

6. The Android case
#

The phone uses the Nebula mobile application (Play Store / F-Droid). In it I create a “site” and provide the same elements as for a Linux client: the ca.crt, the android.crt certificate, its android.key key, and the configuration (lighthouses + static_host_map + relay). The app plays the role of the nebula daemon and manages the tunnel in the background.

7. Start and verify
#

The package provides a systemd service. Otherwise, you run it by hand:

nebula -config /etc/nebula/config.yml

Then enable the service on each Linux host:

sudo systemctl enable --now nebula

Useful checks:

# inspect a certificate (IP, groups, validity)
nebula-cert print -path /etc/nebula/host.crt

# once two hosts are up, test the overlay link
ping 10.42.0.1        # from a client, reach the home server

I start the two lighthouses first, then the clients. A ping that answers on the 10.42.0.x IPs confirms the tunnel is up. To validate resilience, I cut one lighthouse and check that the peers keep reaching each other through the other.

Faradj Saadana
Author
Faradj Saadana
Passionate about the Android platform and its multiple fields of application