None
cluster mesh

BGP Control Plane

Let me walk you through the complete Cilium BGP Control Plane setup from my recent study session. This is week 5 of gasida's Cilium study group, and I'll share every command and output so you can reproduce this setup exactly. The BGP integration is fascinating — it allows Kubernetes clusters to participate in traditional network routing protocols.

We're using the same four-VM topology, but this time the router will run FRR (Free Range Routing) for BGP:

  • k8s-ctr: Control plane at 192.168.10.100
  • k8s-w1: Worker at 192.168.10.101
  • k8s-w0: Worker at 192.168.20.100 (different network segment)
  • router: BGP router at 192.168.10.200/192.168.20.200 with FRR installed

The key difference in our Cilium installation is we're disabling autoDirectNodeRoutes since BGP will handle route distribution:

helm install cilium cilium/cilium --version 1.18.0 --namespace kube-system \
--set k8sServiceHost=192.168.10.100 --set k8sServicePort=6443 \
--set ipam.mode="cluster-pool" --set ipam.operator.clusterPoolIPv4PodCIDRList={"172.20.0.0/16"} --set ipv4NativeRoutingCIDR=172.20.0.0/16 \
--set routingMode=native --set autoDirectNodeRoutes=false --set bgpControlPlane.enabled=true \
--set kubeProxyReplacement=true --set bpf.masquerade=true --set installNoConntrackIptablesRules=true \
--set endpointHealthChecking.enabled=false --set healthChecking=false \
--set hubble.enabled=true --set hubble.relay.enabled=true --set hubble.ui.enabled=true \
--set hubble.ui.service.type=NodePort --set hubble.ui.service.nodePort=30003 \
--set prometheus.enabled=true --set operator.prometheus.enabled=true --set hubble.metrics.enableOpenMetrics=true \
--set hubble.metrics.enabled="{dns,drop,tcp,flow,port-distribution,icmp,httpV2:exemplars=true;labelsContext=source_ip\,source_namespace\,source_workload\,destination_ip\,destination_namespace\,destination_workload\,traffic_direction}" \
--set operator.replicas=1 --set debug.enabled=true >/dev/null 2>&1

Note the critical settings:

  • autoDirectNodeRoutes=false: BGP will manage routes instead
  • bgpControlPlane.enabled=true: Enables Cilium's BGP speaker

Now let's configure FRR on the router. First, install and enable BGP daemon:

echo "[TASK 7] Configure FRR"
apt install frr -y >/dev/null 2>&1
sed -i "s/^bgpd=no/bgpd=yes/g" /etc/frr/daemons

NODEIP=$(ip -4 addr show eth1 | grep -oP '(?<=inet\s)\d+(\.\d+){3}')
cat << EOF >> /etc/frr/frr.conf
!
router bgp 65000
  bgp router-id $NODEIP
  bgp graceful-restart
  no bgp ebgp-requires-policy
  bgp bestpath as-path multipath-relax
  maximum-paths 4
  network 10.10.1.0/24
EOF

systemctl daemon-reexec >/dev/null 2>&1
systemctl restart frr >/dev/null 2>&1
systemctl enable frr >/dev/null 2>&1

Verify Cilium recognizes BGP is enabled:

cilium config view | grep bgp
"enable-bgp-control-plane": "true"

Check the initial routing tables on each node:

(⎈|HomeLab:N/A) root@k8s-ctr:~# ip -c route
default via 10.0.2.2 dev eth0 proto dhcp src 10.0.2.15 metric 100
10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.15 metric 100
10.0.2.2 dev eth0 proto dhcp scope link src 10.0.2.15 metric 100
10.0.2.3 dev eth0 proto dhcp scope link src 10.0.2.15 metric 100
172.20.0.0/24 via 172.20.0.251 dev cilium_host proto kernel src 172.20.0.251
172.20.0.251 dev cilium_host proto kernel scope link
192.168.10.0/24 dev eth1 proto kernel scope link src 192.168.10.100
192.168.20.0/24 via 192.168.10.200 dev eth1 proto static

Check k8s-w1:

>> node : k8s-w1 
Warning: Permanently added 'k8s-w1' (ED25519) to the list of known hosts.

default via 10.0.2.2 dev eth0 proto dhcp src 10.0.2.15 metric 100
10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.15 metric 100
10.0.2.2 dev eth0 proto dhcp scope link src 10.0.2.15 metric 100
10.0.2.3 dev eth0 proto dhcp scope link src 10.0.2.15 metric 100
172.20.1.0/24 via 172.20.1.115 dev cilium_host proto kernel src 172.20.1.115
172.20.1.115 dev cilium_host proto kernel scope link
192.168.10.0/24 dev eth1 proto kernel scope link src 192.168.10.101
192.168.20.0/24 via 192.168.10.200 dev eth1 proto static

Check k8s-w0:

>> node : k8s-w0 
Warning: Permanently added 'k8s-w0' (ED25519) to the list of known hosts.
default via 10.0.2.2 dev eth0 proto dhcp src 10.0.2.15 metric 100
10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.15 metric 100
10.0.2.2 dev eth0 proto dhcp scope link src 10.0.2.15 metric 100
10.0.2.3 dev eth0 proto dhcp scope link src 10.0.2.15 metric 100
172.20.2.0/24 via 172.20.2.116 dev cilium_host proto kernel src 172.20.2.116
172.20.2.116 dev cilium_host proto kernel scope link
192.168.10.0/24 via 192.168.20.200 dev eth1 proto static
192.168.20.0/24 dev eth1 proto kernel scope link src 192.168.20.100

Notice that nodes can't see each other's Pod CIDRs — this is because autoDirectNodeRoutes=false.

Let's deploy our test application to see the problem:

cat << EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: webpod
spec:
  replicas: 3
  selector:
    matchLabels:
      app: webpod
  template:
    metadata:
      labels:
        app: webpod
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - sample-app
            topologyKey: "kubernetes.io/hostname"
      containers:
      - name: webpod
        image: traefik/whoami
        ports:
        - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: webpod
  labels:
    app: webpod
spec:
  selector:
    app: webpod
  ports:
  - protocol: TCP
    port: 80
    targetPort: 80
  type: ClusterIP
EOF

# Deploy curl pod on control plane
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: curl-pod
  labels:
    app: curl
spec:
  nodeName: k8s-ctr
  containers:
  - name: curl
    image: nicolaka/netshoot
    command: ["tail"]
    args: ["-f", "/dev/null"]
  terminationGracePeriodSeconds: 0
EOF

The problem is that curl from k8s-ctr only reaches pods on the same node because Pod CIDRs aren't in the routing table!

Cilium BGP Control Plane Configuration

Now let's configure BGP peering. First, check the FRR status on the router:

# SSH to router
sshpass -p 'vagrant' ssh vagrant@router

# Check FRR processes
ss -tnlp | grep -iE 'zebra|bgpd'
ps -ef |grep frr
root        4127       1  0 13:38 ?        00:00:00 /usr/lib/frr/watchfrr -d -F traditional zebra bgpd staticd
frr         4140       1  0 13:38 ?        00:00:00 /usr/lib/frr/zebra -d -F traditional -A 127.0.0.1 -s 90000000
frr         4145       1  0 13:38 ?        00:00:00 /usr/lib/frr/bgpd -d -F traditional -A 127.0.0.1
frr         4152       1  0 13:38 ?        00:00:00 /usr/lib/frr/staticd -d -F traditional -A 127.0.0.1

# Check FRR configuration
vtysh -c 'show running'

# Check BGP status - no neighbors yet
vtysh -c 'show ip bgp summary'
% No BGP neighbors found in VRF default

# Check advertised routes - only loop1 network
vtysh -c 'show ip bgp'
BGP table version is 1, local router ID is 192.168.10.200, vrf id 0
Default local pref 100, local AS 65000
Status codes:  s suppressed, d damped, h history, * valid, > best, = multipath,
               i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes:  i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found

   Network          Next Hop            Metric LocPrf Weight Path
*> 10.10.1.0/24     0.0.0.0                  0         32768 i

Configure FRR to accept Cilium nodes as BGP neighbors. You can do this two ways:

Method 1. Edit config file:

cat << EOF >> /etc/frr/frr.conf
  neighbor CILIUM peer-group
  neighbor CILIUM remote-as external
  neighbor 192.168.10.100 peer-group CILIUM
  neighbor 192.168.10.101 peer-group CILIUM
  neighbor 192.168.20.100 peer-group CILIUM 
EOF

systemctl daemon-reexec && systemctl restart frr
systemctl status frr --no-pager --full

Method 2 — Use vtysh interactively:

vtysh
conf
router bgp 65000
neighbor CILIUM peer-group
neighbor CILIUM remote-as external
neighbor 192.168.10.100 peer-group CILIUM
neighbor 192.168.10.101 peer-group CILIUM
neighbor 192.168.20.100 peer-group CILIUM 
end
write memory
exit

Start monitoring on the router before configuring Cilium:

# Terminal 1 (router): Monitor FRR logs
journalctl -u frr -f

# Terminal 2 (k8s-ctr): Test connectivity
kubectl exec -it curl-pod -- sh -c 'while true; do curl -s --connect-timeout 1 webpod | grep Hostname; echo "---" ; sleep 1; done'

Now configure Cilium BGP. Label nodes that should run BGP:

kubectl label nodes k8s-ctr k8s-w0 k8s-w1 enable-bgp=true

kubectl get node -l enable-bgp=true
NAME      STATUS   ROLES           AGE     VERSION
k8s-ctr   Ready    control-plane   3h37m   v1.33.2
k8s-w0    Ready    <none>          3h32m   v1.33.2
k8s-w1    Ready    <none>          3h35m   v1.33.2

Apply the BGP configuration CRDs:

cat << EOF | kubectl apply -f -
apiVersion: cilium.io/v2
kind: CiliumBGPAdvertisement
metadata:
  name: bgp-advertisements
  labels:
    advertise: bgp
spec:
  advertisements:
    - advertisementType: "PodCIDR"
---
apiVersion: cilium.io/v2
kind: CiliumBGPPeerConfig
metadata:
  name: cilium-peer
spec:
  timers:
    holdTimeSeconds: 9
    keepAliveTimeSeconds: 3
  ebgpMultihop: 2
  gracefulRestart:
    enabled: true
    restartTimeSeconds: 15
  families:
    - afi: ipv4
      safi: unicast
      advertisements:
        matchLabels:
          advertise: "bgp"
---
apiVersion: cilium.io/v2
kind: CiliumBGPClusterConfig
metadata:
  name: cilium-bgp
spec:
  nodeSelector:
    matchLabels:
      "enable-bgp": "true"
  bgpInstances:
  - name: "instance-65001"
    localASN: 65001
    peers:
    - name: "tor-switch"
      peerASN: 65000
      peerAddress: 192.168.10.200  # router ip address
      peerConfigRef:
        name: "cilium-peer"
EOF

Watch the router logs — you'll see BGP sessions establish! The three CRDs work together:

  • CiliumBGPAdvertisement: Defines what to advertise (PodCIDR)
  • CiliumBGPPeerConfig: BGP session parameters (timers, graceful restart)
  • CiliumBGPClusterConfig: Which nodes run BGP and peer details

Verify BGP operation:

# Check Cilium BGP status
cilium bgp peers
Node      Local AS   Peer AS   Peer Address     Session State   Uptime     Family         Received   Advertised
k8s-ctr   65001      65000     192.168.10.200   established     25s        ipv4/unicast   1          1
k8s-w1    65001      65000     192.168.10.200   established     25s        ipv4/unicast   1          1
k8s-w0    65001      65000     192.168.10.200   established     25s        ipv4/unicast   1          1

cilium bgp routes available ipv4 unicast
Node      VRouter   Prefix          NextHop   Age    Attrs
k8s-ctr   65001     172.20.0.0/24   0.0.0.0   2m5s   [{Origin: i} {Nexthop: 0.0.0.0}]
k8s-w1    65001     172.20.1.0/24   0.0.0.0   2m5s   [{Origin: i} {Nexthop: 0.0.0.0}]
k8s-w0    65001     172.20.2.0/24   0.0.0.0   2m5s   [{Origin: i} {Nexthop: 0.0.0.0}]

# Check Kubernetes CRDs
kubectl get ciliumbgpadvertisements,ciliumbgppeerconfigs,ciliumbgpclusterconfigs
kubectl get ciliumbgpnodeconfigs -o yaml | yq

On the router, verify BGP learned routes:

# Terminal 1 (router)
journalctl -u frr -f
Aug 09 14:31:40 router bgpd[4665]: [M59KS-A3ZXZ] bgp_update_receive: rcvd End-of-RIB for IPv4 Unicast from 192.168.20.100 in vrf default
Aug 09 14:31:40 router bgpd[4665]: [M59KS-A3ZXZ] bgp_update_receive: rcvd End-of-RIB for IPv4 Unicast from 192.168.10.101 in vrf default
Aug 09 14:31:40 router bgpd[4665]: [M59KS-A3ZXZ] bgp_update_receive: rcvd End-of-RIB for IPv4 Unicast from 192.168.10.100 in vrf default

# Check kernel routing table - BGP routes installed!
ip -c route | grep bgp
172.20.0.0/24 nhid 32 via 192.168.10.100 dev eth1 proto bgp metric 20
172.20.1.0/24 nhid 30 via 192.168.10.101 dev eth1 proto bgp metric 20
172.20.2.0/24 nhid 31 via 192.168.20.100 dev eth2 proto bgp metric 20

vtysh -c 'show ip bgp summary'
Neighbor        V         AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt Desc
192.168.10.100  4      65001       509       511        0    0    0 00:25:15            1        4 N/A
192.168.10.101  4      65001       508       511        0    0    0 00:25:15            1        4 N/A
192.168.20.100  4      65001       509       511        0    0    0 00:25:15            1        4 N/A

vtysh -c 'show ip bgp'
   Network          Next Hop            Metric LocPrf Weight Path
*> 10.10.1.0/24     0.0.0.0                  0         32768 i
*> 172.20.0.0/24    192.168.10.100                         0 65001 i
*> 172.20.1.0/24    192.168.10.101                         0 65001 i
*> 172.20.2.0/24    192.168.20.100                         0 65001 i

But there's still a problem — the k8s nodes don't have routes to each other's Pod CIDRs! Let's capture BGP traffic to see what's happening:

# k8s-ctr: Capture BGP traffic
tcpdump -i eth1 tcp port 179 -w /tmp/bgp.pcap

# Router: Restart FRR to trigger BGP updates
systemctl restart frr && journalctl -u frr -f

# Analyze the capture
termshark -r /tmp/bgp.pcap
# Filter: bgp.type == 2 (UPDATE messages)

You'll see BGP UPDATE messages from the router, but checking the node routes:

cilium bgp routes
ip -c route

The routes from the router aren't in the kernel! This is by design — Cilium's BGP implementation is control-plane only. It advertises routes but doesn't install received routes into the kernel FIB. Instead, Cilium uses eBPF for packet forwarding.

For our multi-NIC setup, we need to manually add routes on the nodes:

# Add routes for the entire Pod CIDR via the router
ip route add 172.20.0.0/16 via 192.168.10.200
sshpass -p 'vagrant' ssh vagrant@k8s-w1 sudo ip route add 172.20.0.0/16 via 192.168.10.200
sshpass -p 'vagrant' ssh vagrant@k8s-w0 sudo ip route add 172.20.0.0/16 via 192.168.20.200

# Verify router has BGP-learned routes
sshpass -p 'vagrant' ssh vagrant@router ip -c route | grep bgp
172.20.0.0/24 nhid 64 via 192.168.10.100 dev eth1 proto bgp metric 20 
172.20.1.0/24 nhid 60 via 192.168.10.101 dev eth1 proto bgp metric 20 
172.20.2.0/24 nhid 62 via 192.168.20.100 dev eth2 proto bgp metric 20 

# Now connectivity works!
kubectl exec -it curl-pod -- sh -c 'while true; do curl -s --connect-timeout 1 webpod | grep Hostname; echo "---" ; sleep 1; done'
Hostname: webpod-697b545f57-jpjbd
---
Hostname: webpod-697b545f57-hnvxq
---
Hostname: webpod-697b545f57-nfhk8
---

Monitor with Hubble:

cilium hubble port-forward&
hubble status
hubble observe -f --protocol tcp --pod curl-pod

Node Maintenance

BGP makes node maintenance elegant. Let's drain a node and see what happens:

# Monitor connectivity
kubectl exec -it curl-pod -- sh -c 'while true; do curl -s --connect-timeout 1 webpod | grep Hostname; echo "---" ; sleep 1; done'

# Monitor BGP logs (optional)
kubectl logs -n kube-system -l name=cilium-operator -f | grep "subsys=bgp-cp-operator"
kubectl logs -n kube-system -l k8s-app=cilium -f | grep "subsys=bgp-control-plane"

# Drain k8s-w0 for maintenance
kubectl drain k8s-w0 --ignore-daemonsets
kubectl label nodes k8s-w0 enable-bgp=false --overwrite

Check BGP status:

kubectl get node
kubectl get ciliumbgpnodeconfigs
cilium bgp routes
cilium bgp peers
Node      Local AS   Peer AS   Peer Address     Session State   Uptime     Family         Received   Advertised
k8s-ctr   65001      65000     192.168.10.200   established     2h13m35s   ipv4/unicast   3          2    
k8s-w1    65001      65000     192.168.10.200   established     2h13m36s   ipv4/unicast   3          2

# Router view - k8s-w0's route is gone!
sshpass -p 'vagrant' ssh vagrant@router "sudo vtysh -c 'show ip bgp summary'"
sshpass -p 'vagrant' ssh vagrant@router "sudo vtysh -c 'show ip bgp'"
sshpass -p 'vagrant' ssh vagrant@router ip -c route | grep bgp
172.20.0.0/24 nhid 64 via 192.168.10.100 dev eth1 proto bgp metric 20 
172.20.1.0/24 nhid 60 via 192.168.10.101 dev eth1 proto bgp metric 20

The k8s-w0 node gracefully withdrew its routes! Traffic continues uninterrupted. Restore the node:

kubectl label nodes k8s-w0 enable-bgp=true --overwrite
kubectl uncordon k8s-w0

# Verify restoration
kubectl get node
kubectl get ciliumbgpnodeconfigs
cilium bgp routes
cilium bgp peers

# Redistribute pods
kubectl scale deployment webpod --replicas 0
kubectl scale deployment webpod --replicas 3

For automatic pod redistribution, consider using Descheduler — it evicts pods based on policies to maintain balanced distribution.

Advertising LoadBalancer Service IPs via BGP

Let's advertise LoadBalancer IPs through BGP. First, create an IP pool:

cat << EOF | kubectl apply -f -
apiVersion: "cilium.io/v2"
kind: CiliumLoadBalancerIPPool
metadata:
  name: "cilium-pool"
spec:
  allowFirstLastIPs: "No"
  blocks:
  - cidr: "172.16.1.0/24"
EOF

kubectl get ippool
NAME          DISABLED   CONFLICTING   IPS AVAILABLE   AGE
cilium-pool   false      False         254             8s

Convert our service to LoadBalancer:

kubectl patch svc webpod -p '{"spec": {"type": "LoadBalancer"}}'
kubectl get svc webpod 
NAME     TYPE           CLUSTER-IP    EXTERNAL-IP   PORT(S)        AGE
webpod   LoadBalancer   10.96.39.92   172.16.1.1    80:30800/TCP   3h56m

# Pool now shows one IP allocated
kubectl get ippool
NAME          DISABLED   CONFLICTING   IPS AVAILABLE   AGE
cilium-pool   false      False         253             2m23s

# Check service details
kubectl describe svc webpod | grep 'Traffic Policy'
External Traffic Policy:  Cluster
Internal Traffic Policy:  Cluster

# Verify Cilium service list
kubectl -n kube-system exec ds/cilium -c cilium-agent -- cilium-dbg service list
ID   Frontend               Service Type   Backend
...
16   172.16.1.1:80/TCP      LoadBalancer   1 => 172.20.0.229:80/TCP (active)
                                           2 => 172.20.1.158:80/TCP (active)
                                           3 => 172.20.2.219:80/TCP (active)

Test from within the cluster:

LBIP=$(kubectl get svc webpod -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
echo $LBIP
172.16.1.1

curl -s $LBIP | grep Hostname
Hostname: webpod-697b545f57-jpjbd

Now advertise this LoadBalancer IP via BGP:

# Monitor router routes
watch "sshpass -p 'vagrant' ssh vagrant@router ip -c route"

# Create BGP advertisement for LoadBalancer IPs
cat << EOF | kubectl apply -f -
apiVersion: cilium.io/v2
kind: CiliumBGPAdvertisement
metadata:
  name: bgp-advertisements-lb-exip-webpod
  labels:
    advertise: bgp
spec:
  advertisements:
    - advertisementType: "Service"
      service:
        addresses:
          - LoadBalancerIP
      selector:             
        matchExpressions:
          - { key: app, operator: In, values: [ webpod ] }
EOF

kubectl get CiliumBGPAdvertisement
NAME                                AGE
bgp-advertisements                  2m1s
bgp-advertisements-lb-exip-webpod   3s

Check BGP route policies:

kubectl exec -it -n kube-system ds/cilium -- cilium-dbg bgp route-policies
VRouter   Policy Name                                             Type     Match Peers         Match Families   Match Prefixes (Min..Max Len)   RIB Action   Path Actions
65001     allow-local                                             import                                                                        accept
65001     tor-switch-ipv4-PodCIDR                                 export   192.168.10.200/32                    172.20.1.0/24 (24..24)          accept
65001     tor-switch-ipv4-Service-webpod-default-LoadBalancerIP   export   192.168.10.200/32                    172.16.1.1/32 (32..32)          accept

# All nodes advertise the LoadBalancer IP!
cilium bgp routes available ipv4 unicast
Node      VRouter   Prefix          NextHop   Age      Attrs
k8s-ctr   65001     172.16.1.1/32   0.0.0.0   32s      [{Origin: i} {Nexthop: 0.0.0.0}]   
          65001     172.20.0.0/24   0.0.0.0   24m41s   [{Origin: i} {Nexthop: 0.0.0.0}]   
k8s-w0    65001     172.16.1.1/32   0.0.0.0   32s      [{Origin: i} {Nexthop: 0.0.0.0}]   
          65001     172.20.2.0/24   0.0.0.0   24m56s   [{Origin: i} {Nexthop: 0.0.0.0}]   
k8s-w1    65001     172.16.1.1/32   0.0.0.0   32s      [{Origin: i} {Nexthop: 0.0.0.0}]   
          65001     172.20.1.0/24   0.0.0.0   24m56s   [{Origin: i} {Nexthop: 0.0.0.0}]

Router now has ECMP routes to the LoadBalancer IP:

sshpass -p 'vagrant' ssh vagrant@router ip -c route
...
172.16.1.1 nhid 71 proto bgp metric 20 
        nexthop via 192.168.10.101 dev eth1 weight 1 
        nexthop via 192.168.10.100 dev eth1 weight 1 
        nexthop via 192.168.20.100 dev eth2 weight 1 

sshpass -p 'vagrant' ssh vagrant@router "sudo vtysh -c 'show ip bgp'"
   Network          Next Hop            Metric LocPrf Weight Path
*> 172.16.1.1/32    192.168.10.100                         0 65001 i 
*=                  192.168.20.100                         0 65001 i 
*=                  192.168.10.101                         0 65001 i

sshpass -p 'vagrant' ssh vagrant@router "sudo vtysh -c 'show ip bgp 172.16.1.1/32'"
BGP routing table entry for 172.16.1.1/32, version 7
Paths: (3 available, best #1, table default)
  Advertised to non peer-group peers:
  192.168.10.100 192.168.10.101 192.168.20.100
  65001
    192.168.10.100 from 192.168.10.100 (192.168.10.100)
      Origin IGP, valid, external, multipath, best (Router ID)
      Last update: Sat Aug  9 17:50:29 2025
  65001
    192.168.20.100 from 192.168.20.100 (192.168.20.100)
      Origin IGP, valid, external, multipath
      Last update: Sat Aug  9 17:50:29 2025
  65001
    192.168.10.101 from 192.168.10.101 (192.168.10.101)
      Origin IGP, valid, external, multipath
      Last update: Sat Aug  9 17:50:29 2025

Test from the router:

LBIP=172.16.1.1
curl -s $LBIP
curl -s $LBIP | grep Hostname
curl -s $LBIP | grep RemoteAddr

# Load balance test
for i in {1..100};  do curl -s $LBIP | grep Hostname; done | sort | uniq -c | sort -nr
     34 Hostname: webpod-697b545f57-jpjbd
     33 Hostname: webpod-697b545f57-hnvxq
     33 Hostname: webpod-697b545f57-nfhk8

Now scale down to see a problem:

kubectl scale deployment webpod --replicas 2
kubectl get pod -owide

The router still has routes through k8s-ctr even though no pods are there:

# Router still sees all three paths
vtysh -c 'show ip bgp 172.16.1.1/32'

# This causes SNAT when traffic goes through k8s-ctr
for i in {1..100};  do curl -s $LBIP | grep Hostname; done | sort | uniq -c | sort -nr
while true; do curl -s $LBIP | egrep 'Hostname|RemoteAddr' ; sleep 0.1; done
Hostname: webpod-697b545f57-swtdz
RemoteAddr: 192.168.10.100:40460
Hostname: webpod-697b545f57-87lf2
RemoteAddr: 192.168.10.100:40474

The RemoteAddr shows k8s-ctr's IP because of SNAT! This happens with ExternalTrafficPolicy: Cluster.

External Traffic Policy Local

To preserve source IPs and advertise only nodes with pods:

# Monitor routes
watch "sshpass -p 'vagrant' ssh vagrant@router ip -c route"

# Change to Local policy
kubectl patch service webpod -p '{"spec":{"externalTrafficPolicy":"Local"}}'

Now the router only sees routes through nodes with pods:

# Router - only nodes with pods advertise!
vtysh -c 'show ip bgp'
vtysh -c 'show ip bgp 172.16.1.1/32'
vtysh -c 'show ip route bgp'
ip -c route

# Start tcpdump on all nodes
# Terminal 1 (k8s-w1)
tcpdump -i eth1 -A -s 0 -nn 'tcp port 80'

# Terminal 2 (k8s-w0) 
tcpdump -i eth1 -A -s 0 -nn 'tcp port 80'

# Terminal 3 (k8s-ctr)
tcpdump -i eth1 -A -s 0 -nn 'tcp port 80'

# Test from router - source IP preserved!
LBIP=172.16.1.1
for i in {1..100};  do curl -s $LBIP | grep Hostname; done | sort | uniq -c | sort -nr
    100 Hostname: webpod-697b545f57-lppz4

while true; do curl -s $LBIP | egrep 'Hostname|RemoteAddr' ; sleep 0.1; done
Hostname: webpod-697b545f57-lppz4
RemoteAddr: 192.168.10.200:54312

Source IP is preserved! But all traffic goes to one node due to ECMP hash.

Linux ECMP Hash Policy

The default Linux ECMP uses L3 (destination IP) hashing, so all traffic to the same destination goes to one path. Let's improve this:

# On router - check current distribution
for i in {1..100};  do curl -s $LBIP | grep Hostname; done | sort | uniq -c | sort -nr
    100 Hostname: webpod-697b545f57-lppz4

# Change to L4 hash (includes ports)
sudo sysctl -w net.ipv4.fib_multipath_hash_policy=1
echo "net.ipv4.fib_multipath_hash_policy=1" >> /etc/sysctl.conf

# Test again - better distribution!
for i in {1..100};  do curl -s $LBIP | grep Hostname; done | sort | uniq -c | sort -nr
     59 Hostname: webpod-697b545f57-87lf2
     41 Hostname: webpod-697b545f57-swtdz

# Scale to 3 replicas
kubectl scale deployment webpod --replicas 3
kubectl get pod -owide

# Router sees all three paths
ip -c route
for i in {1..100};  do curl -s $LBIP | grep Hostname; done | sort | uniq -c | sort -nr
     37 Hostname: webpod-697b545f57-bgpv9
     35 Hostname: webpod-697b545f57-87lf2
     28 Hostname: webpod-697b545f57-swtdz

Much better distribution with L4 hashing!

DSR with Maglev

Let's compare different load balancing modes. First check current settings:

kubectl exec -it -n kube-system ds/cilium -- cilium status --verbose
...
  Mode:                 SNAT
  Backend Selection:    Random
  Session Affinity:     Enabled
...

Now enable DSR (Direct Server Return) with Maglev for consistent hashing:

# Load GENEVE module for DSR
modprobe geneve
lsmod | grep -E 'vxlan|geneve'
for i in w1 w0 ; do echo ">> node : k8s-$i <<"; sshpass -p 'vagrant' ssh vagrant@k8s-$i sudo modprobe geneve ; echo; done
for i in w1 w0 ; do echo ">> node : k8s-$i <<"; sshpass -p 'vagrant' ssh vagrant@k8s-$i sudo lsmod | grep -E 'vxlan|geneve' ; echo; done

# Upgrade to DSR mode
helm upgrade cilium cilium/cilium --version 1.18.0 --namespace kube-system --reuse-values \
  --set tunnelProtocol=geneve --set loadBalancer.mode=dsr --set loadBalancer.dsrDispatch=geneve \
  --set loadBalancer.algorithm=maglev

kubectl -n kube-system rollout restart ds/cilium

# Verify settings
kubectl exec -it -n kube-system ds/cilium -- cilium status --verbose
...
  Mode:                  DSR
    DSR Dispatch Mode:   Geneve
  Backend Selection:     Maglev (Table Size: 16381)
  Session Affinity:     Enabled
...

# Reset to Cluster policy for testing
kubectl patch svc webpod -p '{"spec":{"externalTrafficPolicy":"Cluster"}}'

# Capture on all nodes
tcpdump -i eth1 -w /tmp/dsr.pcap

# Test from router
curl -s $LBIP
curl -s $LBIP
curl -s $LBIP
curl -s $LBIP
curl -s $LBIP

Download and analyze the capture to see GENEVE encapsulation carrying the original client info for direct return!

The three main traffic patterns we explored:

  1. BGP + Local Policy + SNAT: Recommended. Only nodes with pods advertise routes, source IP preserved, but all traffic through ECMP-selected nodes.
  2. BGP + Cluster Policy + SNAT: Not recommended. All nodes advertise, causes extra hops and SNAT when pods are on different nodes.
  3. BGP + Cluster Policy + DSR + Maglev: Better than #2. Uses GENEVE to preserve client info, Maglev for consistent hashing, direct return from pod nodes.

Disabling Status Reporting

For large clusters, disable BGP status reporting to reduce API server load:

# Check current status
kubectl get ciliumbgpnodeconfigs -o yaml | yq

# Disable status reporting
helm upgrade cilium cilium/cilium --version 1.18.0 --namespace kube-system --reuse-values \
  --set bgpControlPlane.statusReport.enabled=false

kubectl -n kube-system rollout restart ds/cilium

# Verify - status is now empty
kubectl get ciliumbgpnodeconfigs -o yaml | yq
...
      "status": {}

This BGP integration transforms Kubernetes from an isolated island into a first-class citizen in your network infrastructure. The combination of BGP route advertisement, ECMP load balancing, and ExternalTrafficPolicy gives you fine control over traffic patterns. The ability to gracefully maintain nodes through BGP withdrawal and the choice between SNAT and DSR modes provides the flexibility needed for production deployments.

Cilium ClusterMesh

Now let me show you how to connect multiple Kubernetes clusters using Cilium ClusterMesh. This is incredibly useful for high availability, disaster recovery, and geographic distribution. I'll use Kind to create two local clusters that we'll mesh together.

Creating the Test Clusters

First, let's create two Kind clusters named west and east. Each will have its own Pod and Service CIDRs:

# Create west cluster
kind create cluster --name west --image kindest/node:v1.33.2 --config - <<EOF
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
  extraPortMappings:
  - containerPort: 30000 # sample apps
    hostPort: 30000
  - containerPort: 30001 # hubble ui
    hostPort: 30001
- role: worker
  extraPortMappings:
  - containerPort: 30002 # sample apps
    hostPort: 30002
networking:
  podSubnet: "10.0.0.0/16"
  serviceSubnet: "10.2.0.0/16"
  disableDefaultCNI: true
  kubeProxyMode: none
EOF

Important settings here:

  • disableDefaultCNI: true: We'll install Cilium instead of kindnet
  • kubeProxyMode: none: Cilium will replace kube-proxy entirely
  • Port mappings for accessing services and Hubble UI from localhost

Install basic tools on west nodes:

docker exec -it west-control-plane sh -c 'apt update && apt install tree psmisc lsof wget net-tools dnsutils tcpdump ngrep iputils-ping git -y'
docker exec -it west-worker sh -c 'apt update && apt install tree psmisc lsof wget net-tools dnsutils tcpdump ngrep iputils-ping git -y'

Create the east cluster with different CIDRs:

kind create cluster --name east --image kindest/node:v1.33.2 --config - <<EOF
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
  extraPortMappings:
  - containerPort: 31000 # sample apps
    hostPort: 31000
  - containerPort: 31001 # hubble ui
    hostPort: 31001
- role: worker
  extraPortMappings:
  - containerPort: 31002 # sample apps
    hostPort: 31002
networking:
  podSubnet: "10.1.0.0/16"
  serviceSubnet: "10.3.0.0/16"
  disableDefaultCNI: true
  kubeProxyMode: none
EOF

Install tools on east nodes:

docker exec -it east-control-plane sh -c 'apt update && apt install tree psmisc lsof wget net-tools dnsutils tcpdump ngrep iputils-ping git -y'
docker exec -it east-worker sh -c 'apt update && apt install tree psmisc lsof wget net-tools dnsutils tcpdump ngrep iputils-ping git -y'

Verify both clusters are created:

kubectl config get-contexts 
CURRENT   NAME        CLUSTER     AUTHINFO    NAMESPACE
*         kind-east   kind-east   kind-east   
          kind-west   kind-west   kind-west

# Test accessing both clusters
kubectl get node --context kind-west
NAME                 STATUS     ROLES           AGE   VERSION
west-control-plane   NotReady   control-plane   45s   v1.33.2
west-worker          NotReady   <none>          23s   v1.33.2

kubectl get node --context kind-east
NAME                 STATUS     ROLES           AGE   VERSION
east-control-plane   NotReady   control-plane   38s   v1.33.2
east-worker          NotReady   <none>          16s   v1.33.2

Nodes are NotReady because we haven't installed a CNI yet. Let's set up aliases for easier management:

alias kwest='kubectl --context kind-west'
alias keast='kubectl --context kind-east'

# Test aliases
kwest get node -owide
keast get node -owide

Installing Cilium with ClusterMesh Support

Install Cilium CLI if you haven't already:

# macOS
brew install cilium-cli

# Linux (including WSL2)
CILIUM_CLI_VERSION=$(curl -s https://raw.githubusercontent.com/cilium/cilium-cli/main/stable.txt)
CLI_ARCH=amd64
if [ "$(uname -m)" = "aarch64" ]; then CLI_ARCH=arm64; fi
curl -L --fail --remote-name-all https://github.com/cilium/cilium-cli/releases/download/${CILIUM_CLI_VERSION}/cilium-linux-${CLI_ARCH}.tar.gz{,.sha256sum}
sha256sum --check cilium-linux-${CLI_ARCH}.tar.gz.sha256sum
sudo tar xzvfC cilium-linux-${CLI_ARCH}.tar.gz /usr/local/bin
rm cilium-linux-${CLI_ARCH}.tar.gz{,.sha256sum}

Install Cilium on west cluster with ClusterMesh-specific settings:

cilium install --version 1.17.6 --set ipam.mode=kubernetes \
--set kubeProxyReplacement=true --set bpf.masquerade=true \
--set endpointHealthChecking.enabled=false --set healthChecking=false \
--set operator.replicas=1 --set debug.enabled=true \
--set routingMode=native --set autoDirectNodeRoutes=true --set ipv4NativeRoutingCIDR=10.0.0.0/16 \
--set ipMasqAgent.enabled=true --set ipMasqAgent.config.nonMasqueradeCIDRs='{10.1.0.0/16}' \
--set cluster.name=west --set cluster.id=1 \
--context kind-west

Critical settings for ClusterMesh:

  • ipv4NativeRoutingCIDR=10.0.0.0/16: West's own Pod CIDR for direct routing
  • nonMasqueradeCIDRs='{10.1.0.0/16}': Don't NAT traffic to east's Pod CIDR
  • cluster.name=west --set cluster.id=1: Unique cluster identification

Watch the installation:

watch kubectl get pod -n kube-system --context kind-west

Install Cilium on east cluster:

cilium install --version 1.17.6 --set ipam.mode=kubernetes \
--set kubeProxyReplacement=true --set bpf.masquerade=true \
--set endpointHealthChecking.enabled=false --set healthChecking=false \
--set operator.replicas=1 --set debug.enabled=true \
--set routingMode=native --set autoDirectNodeRoutes=true --set ipv4NativeRoutingCIDR=10.1.0.0/16 \
--set ipMasqAgent.enabled=true --set ipMasqAgent.config.nonMasqueradeCIDRs='{10.0.0.0/16}' \
--set cluster.name=east --set cluster.id=2 \
--context kind-east

watch kubectl get pod -n kube-system --context kind-east

Verify installations:

kwest get pod -A && keast get pod -A
cilium status --context kind-west
cilium status --context kind-east

# Check configuration
cilium config view --context kind-west | grep -E "cluster-|masq"
cilium config view --context kind-east | grep -E "cluster-|masq"

# Detailed status
kwest exec -it -n kube-system ds/cilium -- cilium status --verbose
keast exec -it -n kube-system ds/cilium -- cilium status --verbose

Check IP masquerade configuration:

kwest -n kube-system exec ds/cilium -c cilium-agent -- cilium-dbg bpf ipmasq list
IP PREFIX/ADDRESS
10.0.0.0/8
10.1.0.0/16

keast -n kube-system exec ds/cilium -c cilium-agent -- cilium-dbg bpf ipmasq list
IP PREFIX/ADDRESS
10.0.0.0/16
10.0.0.0/8

Each cluster won't masquerade traffic to the other's Pod CIDR!

Setting Up ClusterMesh

Check initial routing tables — no routes between clusters yet:

docker exec -it west-control-plane ip -c route
docker exec -it west-worker ip -c route
docker exec -it east-control-plane ip -c route
docker exec -it east-worker ip -c route

Step 1: Synchronize Certificate Authority.

ClusterMesh requires shared CA certificates:

# Check existing CA in east
keast get secret -n kube-system cilium-ca
NAME         TYPE     DATA   AGE
cilium-ca    Opaque   2      5m12s

# Delete and replace with west's CA
keast delete secret -n kube-system cilium-ca

kubectl --context kind-west get secret -n kube-system cilium-ca -o yaml | \
kubectl --context kind-east create -f -

# Verify
keast get secret -n kube-system cilium-ca

Step 2: Enable ClusterMesh on both clusters.

Start monitoring in separate terminals:

# Terminal 1
cilium clustermesh status --context kind-west --wait

# Terminal 2  
cilium clustermesh status --context kind-east --wait

Enable ClusterMesh:

cilium clustermesh enable --service-type NodePort --enable-kvstoremesh=false --context kind-west
cilium clustermesh enable --service-type NodePort --enable-kvstoremesh=false --context kind-east

This creates clustermesh-apiserver deployments. Check them:

kwest get svc,ep -n kube-system clustermesh-apiserver
NAME                            TYPE       CLUSTER-IP     EXTERNAL-IP   PORT(S)          AGE
service/clustermesh-apiserver   NodePort   10.2.216.182   <none>        2379:32379/TCP   65s

NAME                              ENDPOINTS         AGE
endpoints/clustermesh-apiserver   10.0.0.195:2379   65s

kwest get pod -n kube-system -owide | grep clustermesh
clustermesh-apiserver-7d6b9c4b7f-xmxqt   1/1     Running   0          87s   10.0.0.195   west-control-plane   <none>

keast get svc,ep -n kube-system clustermesh-apiserver
NAME                            TYPE       CLUSTER-IP     EXTERNAL-IP   PORT(S)          AGE
service/clustermesh-apiserver   NodePort   10.3.252.188   <none>        2379:32379/TCP   43s

NAME                              ENDPOINTS         AGE
endpoints/clustermesh-apiserver   10.1.0.206:2379   43s

Step 3: Connect the clusters:

# Monitor in separate terminals
watch -d "cilium clustermesh status --context kind-west --wait"
watch -d "cilium clustermesh status --context kind-east --wait"

# Connect them
cilium clustermesh connect --context kind-west --destination-context kind-east

Verify connection:

cilium clustermesh status --context kind-west --wait
✅ Service "clustermesh-apiserver" of type "NodePort" found
✅ Cluster access information is available:
  - 172.18.0.3:32379
✅ Deployment clustermesh-apiserver is ready
✅ All 1 nodes are connected to all clusters [min:1 / avg:1.0 / max:1]
🔌 Cluster Connections:
- east: 2/2 configured, 2/2 connected

cilium clustermesh status --context kind-east --wait
✅ Service "clustermesh-apiserver" of type "NodePort" found
✅ Cluster access information is available:
  - 172.18.0.4:32379
✅ Deployment clustermesh-apiserver is ready
✅ All 1 nodes are connected to all clusters [min:1 / avg:1.0 / max:1]
🔌 Cluster Connections:
- west: 2/2 configured, 2/2 connected

Check detailed mesh status:

kubectl exec -it -n kube-system ds/cilium -c cilium-agent --context kind-west -- cilium-dbg troubleshoot clustermesh
kubectl exec -it -n kube-system ds/cilium -c cilium-agent --context kind-east -- cilium-dbg troubleshoot clustermesh

kwest exec -it -n kube-system ds/cilium -- cilium status --verbose | grep -A5 ClusterMesh
ClusterMesh:   1/1 remote clusters ready, 0 global-services
   east: ready, 2 nodes, 4 endpoints, 3 identities, 0 services, 0 MCS-API service exports, 0 reconnections (last: never)
   └  etcd: 1/1 connected, leases=0, lock leases=0, has-quorum=true: endpoint status checks are disabled, ID: c6ba18866da7dfd8
   └  remote configuration: expected=true, retrieved=true, cluster-id=2, kvstoremesh=false, sync-canaries=true, service-exports=disabled
   └  synchronization status: nodes=true, endpoints=true, identities=true, services=true

Check Helm values to see the configuration:

helm get values -n kube-system cilium --kube-context kind-west | grep -A10 clustermesh
clustermesh:
  apiserver:
    kvstoremesh:
      enabled: false
    service:
      type: NodePort
  config:
    clusters:
    - ips:
      - 172.18.0.4
      name: east
      port: 32379
    enabled: true

Now check routing tables again — routes to the other cluster's Pod CIDRs are added!

docker exec -it west-worker ip -c route | grep 10.1
10.1.0.0/24 via 172.18.0.4 dev eth0 proto kernel
10.1.1.0/24 via 172.18.0.3 dev eth0 proto kernel

docker exec -it east-worker ip -c route | grep 10.0
10.0.0.0/24 via 172.18.0.2 dev eth0 proto kernel
10.0.1.0/24 via 172.18.0.6 dev eth0 proto kernel

Enable Hubble for Visualization

Enable Hubble on west:

helm upgrade cilium cilium/cilium --version 1.17.6 --namespace kube-system --reuse-values \
--set hubble.enabled=true --set hubble.relay.enabled=true --set hubble.ui.enabled=true \
--set hubble.ui.service.type=NodePort --set hubble.ui.service.nodePort=30001 --kube-context kind-west

kwest -n kube-system rollout restart ds/cilium

Enable Hubble on east:

helm upgrade cilium cilium/cilium --version 1.17.6 --namespace kube-system --reuse-values \
--set hubble.enabled=true --set hubble.relay.enabled=true --set hubble.ui.enabled=true \
--set hubble.ui.service.type=NodePort --set hubble.ui.service.nodePort=31001 --kube-context kind-east

keast -n kube-system rollout restart ds/cilium

Verify Hubble services:

kwest get svc,ep -n kube-system hubble-ui
keast get svc,ep -n kube-system hubble-ui

# Access Hubble UIs
open http://localhost:30001  # West cluster
open http://localhost:31001  # East cluster

Pod-to-Pod Communication Across Clusters

Deploy test pods in both clusters:

cat << EOF | kubectl apply --context kind-west -f -
apiVersion: v1
kind: Pod
metadata:
  name: curl-pod
  labels:
    app: curl
spec:
  containers:
  - name: curl
    image: nicolaka/netshoot
    command: ["tail"]
    args: ["-f", "/dev/null"]
  terminationGracePeriodSeconds: 0
EOF

cat << EOF | kubectl apply --context kind-east -f -
apiVersion: v1
kind: Pod
metadata:
  name: curl-pod
  labels:
    app: curl
spec:
  containers:
  - name: curl
    image: nicolaka/netshoot
    command: ["tail"]
    args: ["-f", "/dev/null"]
  terminationGracePeriodSeconds: 0
EOF

Check pod IPs:

kwest get pod -owide && keast get pod -owide
NAME       READY   STATUS    RESTARTS   AGE   IP           NODE                 NOMINATED NODE   READINESS GATES
curl-pod   1/1     Running   0          43s   10.0.0.144   west-control-plane   <none>           <none>
NAME       READY   STATUS    RESTARTS   AGE   IP           NODE                 NOMINATED NODE   READINESS GATES
curl-pod   1/1     Running   0          36s   10.1.0.128   east-control-plane   <none>           <none>

Test cross-cluster connectivity:

# West to East
kubectl exec -it curl-pod --context kind-west -- ping -c 1 10.1.0.128
PING 10.1.0.128 (10.1.0.128) 56(84) bytes of data.
64 bytes from 10.1.0.128: icmp_seq=1 ttl=62 time=0.877 ms

# Start continuous ping
kubectl exec -it curl-pod --context kind-west -- ping 10.1.0.128

Check on the destination — no NAT, source IP preserved!

# Terminal 1: tcpdump on destination pod
kubectl exec -it curl-pod --context kind-east -- tcpdump -i eth0 -nn
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
08:23:45.123456 IP 10.0.0.144 > 10.1.0.128: ICMP echo request, id 1, seq 1, length 64
08:23:45.123489 IP 10.1.0.128 > 10.0.0.144: ICMP echo reply, id 1, seq 1, length 64

Source IP 10.0.0.144 is preserved — no masquerading!

Test the reverse direction:

kubectl exec -it curl-pod --context kind-east -- ping -c 1 10.0.0.144
PING 10.0.0.144 (10.0.0.144) 56(84) bytes of data.
64 bytes from 10.0.0.144: icmp_seq=1 ttl=62 time=1.24 ms

Global Services — Load Balancing Across Clusters

Deploy identical services in both clusters with global annotation:

cat << EOF | kubectl apply --context kind-west -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: webpod
spec:
  replicas: 2
  selector:
    matchLabels:
      app: webpod
  template:
    metadata:
      labels:
        app: webpod
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - webpod
            topologyKey: "kubernetes.io/hostname"
      containers:
      - name: webpod
        image: traefik/whoami
        ports:
        - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: webpod
  labels:
    app: webpod
  annotations:
    service.cilium.io/global: "true"
spec:
  selector:
    app: webpod
  ports:
  - protocol: TCP
    port: 80
    targetPort: 80
  type: ClusterIP
EOF

Deploy the same in east:

cat << EOF | kubectl apply --context kind-east -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: webpod
spec:
  replicas: 2
  selector:
    matchLabels:
      app: webpod
  template:
    metadata:
      labels:
        app: webpod
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - webpod
            topologyKey: "kubernetes.io/hostname"
      containers:
      - name: webpod
        image: traefik/whoami
        ports:
        - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: webpod
  labels:
    app: webpod
  annotations:
    service.cilium.io/global: "true"
spec:
  selector:
    app: webpod
  ports:
  - protocol: TCP
    port: 80
    targetPort: 80
  type: ClusterIP
EOF

The key is service.cilium.io/global: "true" - this makes the service span both clusters!

Check the service endpoints — west sees all pods from both clusters:

kwest exec -it -n kube-system ds/cilium -c cilium-agent -- cilium service list --clustermesh-affinity
ID   Frontend            Service Type   Backend
13   10.2.47.149:80/TCP  ClusterIP      1 => 10.0.0.46:80/TCP (active)
                                        2 => 10.0.1.219:80/TCP (active)
                                        3 => 10.1.0.190:80/TCP (active)
                                        4 => 10.1.1.41:80/TCP (active)

Test load balancing across clusters:

for i in {1..100}; do
  kubectl exec -i curl-pod --context kind-west -- sh -c "curl -s --connect-timeout 1 webpod | grep Hostname"
done | sort | uniq -c | sort -nr
     31 Hostname: webpod-697b545f57-rcp4r
     29 Hostname: webpod-697b545f57-jjpz9
     23 Hostname: webpod-697b545f57-gkh6r
     17 Hostname: webpod-697b545f57-75q89

Traffic is distributed across both clusters! But this might not be optimal due to latency.

Service Affinity — Preferring Local Cluster

To prefer pods in the same cluster, add affinity annotation:

kwest annotate service webpod service.cilium.io/affinity=local --overwrite
keast annotate service webpod service.cilium.io/affinity=local --overwrite

Check the service list again — notice "preferred" backends:

kwest exec -it -n kube-system ds/cilium -c cilium-agent -- cilium service list --clustermesh-affinity
ID   Frontend            Service Type   Backend
13   10.2.47.149:80/TCP  ClusterIP      1 => 10.0.0.46:80/TCP (active) (preferred)
                                        2 => 10.0.1.219:80/TCP (active) (preferred)
                                        3 => 10.1.0.190:80/TCP (active)
                                        4 => 10.1.1.41:80/TCP (active)

West cluster prefers its local pods! Test again:

for i in {1..100}; do
  kubectl exec -i curl-pod --context kind-west -- sh -c "curl -s --connect-timeout 1 webpod | grep Hostname"
done | sort | uniq -c | sort -nr
     52 Hostname: webpod-697b545f57-rcp4r
     48 Hostname: webpod-697b545f57-jjpz9

Now traffic stays within the west cluster! The remote endpoints are only used if local ones are unavailable.

You can also set remote preference (for testing):

kwest annotate service webpod service.cilium.io/affinity=remote --overwrite
keast annotate service webpod service.cilium.io/affinity=remote --overwrite

kwest exec -it -n kube-system ds/cilium -c cilium-agent -- cilium service list --clustermesh-affinity
ID   Frontend            Service Type   Backend
13   10.2.47.149:80/TCP  ClusterIP      1 => 10.1.0.190:80/TCP (active) (preferred)
                                        2 => 10.1.1.41:80/TCP (active) (preferred)
                                        3 => 10.0.0.46:80/TCP (active)
                                        4 => 10.0.1.219:80/TCP (active)

Now east endpoints are preferred from west!

Controlling Service Sharing

You can disable endpoint synchronization for specific services:

kwest annotate service webpod service.cilium.io/shared=false --overwrite

# Check east - no west endpoints anymore
keast exec -it -n kube-system ds/cilium -c cilium-agent -- cilium service list --clustermesh-affinity
ID   Frontend             Service Type   Backend
13   10.3.105.173:80/TCP  ClusterIP      1 => 10.1.1.190:80/TCP (active) (preferred)
                                         2 => 10.1.1.41:80/TCP (active) (preferred)

The service is now local-only in west but still global in east.

For production, I recommend this combination of annotations:

apiVersion: v1
kind: Service
metadata:
  annotations:
    service.cilium.io/global: "true"     # Share across clusters
    service.cilium.io/affinity: "local"  # Prefer local endpoints
    service.cilium.io/shared: "true"     # Allow endpoint sync

This gives you:

  • High availability (failover to remote cluster)
  • Optimal performance (prefer local cluster)
  • Reduced cross-cluster traffic and costs

Monitoring with Hubble

Open Hubble UI to see the cross-cluster flows:

# West cluster
open http://localhost:30001

# East cluster  
open http://localhost:31001

In Hubble, you'll see cluster names in the flow visualization — this only appears with ClusterMesh enabled. You can filter by cluster, see cross-cluster traffic patterns, and identify potential optimizations.

Troubleshooting ClusterMesh

If clusters aren't connecting, check:

# Verify cluster configuration
cilium clustermesh status --context kind-west
cilium clustermesh status --context kind-east

# Check connectivity from Cilium agents
kubectl exec -it -n kube-system ds/cilium -c cilium-agent --context kind-west -- cilium-dbg troubleshoot clustermesh

# Verify clustermesh-apiserver is accessible
kwest get svc -n kube-system clustermesh-apiserver
keast run test --image=nicolaka/netshoot --rm -it --restart=Never -- curl -v telnet://172.18.0.3:32379

# Check logs
kwest logs -n kube-system deployment/clustermesh-apiserver
kwest logs -n kube-system ds/cilium -c cilium-agent | grep cluster

# Verify certificates match
kwest get secret -n kube-system cilium-ca -o yaml | grep ca.crt | head -1
keast get secret -n kube-system cilium-ca -o yaml | grep ca.crt | head -1

ClusterMesh transforms multiple Kubernetes clusters into a unified network fabric. The ability to have pods communicate directly across clusters without NAT, combined with intelligent service load balancing and affinity controls, enables sophisticated multi-cluster architectures. Whether you're building for high availability, geographic distribution, or disaster recovery, ClusterMesh provides the networking foundation you need.

The key takeaway is that with proper configuration — especially service affinity settings — you can have the benefits of a global service mesh while maintaining optimal network paths and minimizing cross-cluster traffic. This is crucial for both performance and cost optimization in multi-region deployments.