BGP Control Plane
Let me walk you through the complete Cilium BGP Control Plane setup from my recent study session. This is week 5 of gasida's Cilium study group, and I'll share every command and output so you can reproduce this setup exactly. The BGP integration is fascinating — it allows Kubernetes clusters to participate in traditional network routing protocols.
We're using the same four-VM topology, but this time the router will run FRR (Free Range Routing) for BGP:
- k8s-ctr: Control plane at 192.168.10.100
- k8s-w1: Worker at 192.168.10.101
- k8s-w0: Worker at 192.168.20.100 (different network segment)
- router: BGP router at 192.168.10.200/192.168.20.200 with FRR installed
The key difference in our Cilium installation is we're disabling autoDirectNodeRoutes since BGP will handle route distribution:
helm install cilium cilium/cilium --version 1.18.0 --namespace kube-system \
--set k8sServiceHost=192.168.10.100 --set k8sServicePort=6443 \
--set ipam.mode="cluster-pool" --set ipam.operator.clusterPoolIPv4PodCIDRList={"172.20.0.0/16"} --set ipv4NativeRoutingCIDR=172.20.0.0/16 \
--set routingMode=native --set autoDirectNodeRoutes=false --set bgpControlPlane.enabled=true \
--set kubeProxyReplacement=true --set bpf.masquerade=true --set installNoConntrackIptablesRules=true \
--set endpointHealthChecking.enabled=false --set healthChecking=false \
--set hubble.enabled=true --set hubble.relay.enabled=true --set hubble.ui.enabled=true \
--set hubble.ui.service.type=NodePort --set hubble.ui.service.nodePort=30003 \
--set prometheus.enabled=true --set operator.prometheus.enabled=true --set hubble.metrics.enableOpenMetrics=true \
--set hubble.metrics.enabled="{dns,drop,tcp,flow,port-distribution,icmp,httpV2:exemplars=true;labelsContext=source_ip\,source_namespace\,source_workload\,destination_ip\,destination_namespace\,destination_workload\,traffic_direction}" \
--set operator.replicas=1 --set debug.enabled=true >/dev/null 2>&1Note the critical settings:
autoDirectNodeRoutes=false: BGP will manage routes insteadbgpControlPlane.enabled=true: Enables Cilium's BGP speaker
Now let's configure FRR on the router. First, install and enable BGP daemon:
echo "[TASK 7] Configure FRR"
apt install frr -y >/dev/null 2>&1
sed -i "s/^bgpd=no/bgpd=yes/g" /etc/frr/daemons
NODEIP=$(ip -4 addr show eth1 | grep -oP '(?<=inet\s)\d+(\.\d+){3}')
cat << EOF >> /etc/frr/frr.conf
!
router bgp 65000
bgp router-id $NODEIP
bgp graceful-restart
no bgp ebgp-requires-policy
bgp bestpath as-path multipath-relax
maximum-paths 4
network 10.10.1.0/24
EOF
systemctl daemon-reexec >/dev/null 2>&1
systemctl restart frr >/dev/null 2>&1
systemctl enable frr >/dev/null 2>&1Verify Cilium recognizes BGP is enabled:
cilium config view | grep bgp
"enable-bgp-control-plane": "true"Check the initial routing tables on each node:
(⎈|HomeLab:N/A) root@k8s-ctr:~# ip -c route
default via 10.0.2.2 dev eth0 proto dhcp src 10.0.2.15 metric 100
10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.15 metric 100
10.0.2.2 dev eth0 proto dhcp scope link src 10.0.2.15 metric 100
10.0.2.3 dev eth0 proto dhcp scope link src 10.0.2.15 metric 100
172.20.0.0/24 via 172.20.0.251 dev cilium_host proto kernel src 172.20.0.251
172.20.0.251 dev cilium_host proto kernel scope link
192.168.10.0/24 dev eth1 proto kernel scope link src 192.168.10.100
192.168.20.0/24 via 192.168.10.200 dev eth1 proto staticCheck k8s-w1:
>> node : k8s-w1
Warning: Permanently added 'k8s-w1' (ED25519) to the list of known hosts.
default via 10.0.2.2 dev eth0 proto dhcp src 10.0.2.15 metric 100
10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.15 metric 100
10.0.2.2 dev eth0 proto dhcp scope link src 10.0.2.15 metric 100
10.0.2.3 dev eth0 proto dhcp scope link src 10.0.2.15 metric 100
172.20.1.0/24 via 172.20.1.115 dev cilium_host proto kernel src 172.20.1.115
172.20.1.115 dev cilium_host proto kernel scope link
192.168.10.0/24 dev eth1 proto kernel scope link src 192.168.10.101
192.168.20.0/24 via 192.168.10.200 dev eth1 proto staticCheck k8s-w0:
>> node : k8s-w0
Warning: Permanently added 'k8s-w0' (ED25519) to the list of known hosts.
default via 10.0.2.2 dev eth0 proto dhcp src 10.0.2.15 metric 100
10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.15 metric 100
10.0.2.2 dev eth0 proto dhcp scope link src 10.0.2.15 metric 100
10.0.2.3 dev eth0 proto dhcp scope link src 10.0.2.15 metric 100
172.20.2.0/24 via 172.20.2.116 dev cilium_host proto kernel src 172.20.2.116
172.20.2.116 dev cilium_host proto kernel scope link
192.168.10.0/24 via 192.168.20.200 dev eth1 proto static
192.168.20.0/24 dev eth1 proto kernel scope link src 192.168.20.100Notice that nodes can't see each other's Pod CIDRs — this is because autoDirectNodeRoutes=false.
Let's deploy our test application to see the problem:
cat << EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: webpod
spec:
replicas: 3
selector:
matchLabels:
app: webpod
template:
metadata:
labels:
app: webpod
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- sample-app
topologyKey: "kubernetes.io/hostname"
containers:
- name: webpod
image: traefik/whoami
ports:
- containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
name: webpod
labels:
app: webpod
spec:
selector:
app: webpod
ports:
- protocol: TCP
port: 80
targetPort: 80
type: ClusterIP
EOF
# Deploy curl pod on control plane
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: curl-pod
labels:
app: curl
spec:
nodeName: k8s-ctr
containers:
- name: curl
image: nicolaka/netshoot
command: ["tail"]
args: ["-f", "/dev/null"]
terminationGracePeriodSeconds: 0
EOFThe problem is that curl from k8s-ctr only reaches pods on the same node because Pod CIDRs aren't in the routing table!
Cilium BGP Control Plane Configuration
Now let's configure BGP peering. First, check the FRR status on the router:
# SSH to router
sshpass -p 'vagrant' ssh vagrant@router
# Check FRR processes
ss -tnlp | grep -iE 'zebra|bgpd'
ps -ef |grep frr
root 4127 1 0 13:38 ? 00:00:00 /usr/lib/frr/watchfrr -d -F traditional zebra bgpd staticd
frr 4140 1 0 13:38 ? 00:00:00 /usr/lib/frr/zebra -d -F traditional -A 127.0.0.1 -s 90000000
frr 4145 1 0 13:38 ? 00:00:00 /usr/lib/frr/bgpd -d -F traditional -A 127.0.0.1
frr 4152 1 0 13:38 ? 00:00:00 /usr/lib/frr/staticd -d -F traditional -A 127.0.0.1
# Check FRR configuration
vtysh -c 'show running'
# Check BGP status - no neighbors yet
vtysh -c 'show ip bgp summary'
% No BGP neighbors found in VRF default
# Check advertised routes - only loop1 network
vtysh -c 'show ip bgp'
BGP table version is 1, local router ID is 192.168.10.200, vrf id 0
Default local pref 100, local AS 65000
Status codes: s suppressed, d damped, h history, * valid, > best, = multipath,
i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes: i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found
Network Next Hop Metric LocPrf Weight Path
*> 10.10.1.0/24 0.0.0.0 0 32768 iConfigure FRR to accept Cilium nodes as BGP neighbors. You can do this two ways:
Method 1. Edit config file:
cat << EOF >> /etc/frr/frr.conf
neighbor CILIUM peer-group
neighbor CILIUM remote-as external
neighbor 192.168.10.100 peer-group CILIUM
neighbor 192.168.10.101 peer-group CILIUM
neighbor 192.168.20.100 peer-group CILIUM
EOF
systemctl daemon-reexec && systemctl restart frr
systemctl status frr --no-pager --fullMethod 2 — Use vtysh interactively:
vtysh
conf
router bgp 65000
neighbor CILIUM peer-group
neighbor CILIUM remote-as external
neighbor 192.168.10.100 peer-group CILIUM
neighbor 192.168.10.101 peer-group CILIUM
neighbor 192.168.20.100 peer-group CILIUM
end
write memory
exitStart monitoring on the router before configuring Cilium:
# Terminal 1 (router): Monitor FRR logs
journalctl -u frr -f
# Terminal 2 (k8s-ctr): Test connectivity
kubectl exec -it curl-pod -- sh -c 'while true; do curl -s --connect-timeout 1 webpod | grep Hostname; echo "---" ; sleep 1; done'Now configure Cilium BGP. Label nodes that should run BGP:
kubectl label nodes k8s-ctr k8s-w0 k8s-w1 enable-bgp=true
kubectl get node -l enable-bgp=true
NAME STATUS ROLES AGE VERSION
k8s-ctr Ready control-plane 3h37m v1.33.2
k8s-w0 Ready <none> 3h32m v1.33.2
k8s-w1 Ready <none> 3h35m v1.33.2Apply the BGP configuration CRDs:
cat << EOF | kubectl apply -f -
apiVersion: cilium.io/v2
kind: CiliumBGPAdvertisement
metadata:
name: bgp-advertisements
labels:
advertise: bgp
spec:
advertisements:
- advertisementType: "PodCIDR"
---
apiVersion: cilium.io/v2
kind: CiliumBGPPeerConfig
metadata:
name: cilium-peer
spec:
timers:
holdTimeSeconds: 9
keepAliveTimeSeconds: 3
ebgpMultihop: 2
gracefulRestart:
enabled: true
restartTimeSeconds: 15
families:
- afi: ipv4
safi: unicast
advertisements:
matchLabels:
advertise: "bgp"
---
apiVersion: cilium.io/v2
kind: CiliumBGPClusterConfig
metadata:
name: cilium-bgp
spec:
nodeSelector:
matchLabels:
"enable-bgp": "true"
bgpInstances:
- name: "instance-65001"
localASN: 65001
peers:
- name: "tor-switch"
peerASN: 65000
peerAddress: 192.168.10.200 # router ip address
peerConfigRef:
name: "cilium-peer"
EOFWatch the router logs — you'll see BGP sessions establish! The three CRDs work together:
- CiliumBGPAdvertisement: Defines what to advertise (PodCIDR)
- CiliumBGPPeerConfig: BGP session parameters (timers, graceful restart)
- CiliumBGPClusterConfig: Which nodes run BGP and peer details
Verify BGP operation:
# Check Cilium BGP status
cilium bgp peers
Node Local AS Peer AS Peer Address Session State Uptime Family Received Advertised
k8s-ctr 65001 65000 192.168.10.200 established 25s ipv4/unicast 1 1
k8s-w1 65001 65000 192.168.10.200 established 25s ipv4/unicast 1 1
k8s-w0 65001 65000 192.168.10.200 established 25s ipv4/unicast 1 1
cilium bgp routes available ipv4 unicast
Node VRouter Prefix NextHop Age Attrs
k8s-ctr 65001 172.20.0.0/24 0.0.0.0 2m5s [{Origin: i} {Nexthop: 0.0.0.0}]
k8s-w1 65001 172.20.1.0/24 0.0.0.0 2m5s [{Origin: i} {Nexthop: 0.0.0.0}]
k8s-w0 65001 172.20.2.0/24 0.0.0.0 2m5s [{Origin: i} {Nexthop: 0.0.0.0}]
# Check Kubernetes CRDs
kubectl get ciliumbgpadvertisements,ciliumbgppeerconfigs,ciliumbgpclusterconfigs
kubectl get ciliumbgpnodeconfigs -o yaml | yqOn the router, verify BGP learned routes:
# Terminal 1 (router)
journalctl -u frr -f
Aug 09 14:31:40 router bgpd[4665]: [M59KS-A3ZXZ] bgp_update_receive: rcvd End-of-RIB for IPv4 Unicast from 192.168.20.100 in vrf default
Aug 09 14:31:40 router bgpd[4665]: [M59KS-A3ZXZ] bgp_update_receive: rcvd End-of-RIB for IPv4 Unicast from 192.168.10.101 in vrf default
Aug 09 14:31:40 router bgpd[4665]: [M59KS-A3ZXZ] bgp_update_receive: rcvd End-of-RIB for IPv4 Unicast from 192.168.10.100 in vrf default
# Check kernel routing table - BGP routes installed!
ip -c route | grep bgp
172.20.0.0/24 nhid 32 via 192.168.10.100 dev eth1 proto bgp metric 20
172.20.1.0/24 nhid 30 via 192.168.10.101 dev eth1 proto bgp metric 20
172.20.2.0/24 nhid 31 via 192.168.20.100 dev eth2 proto bgp metric 20
vtysh -c 'show ip bgp summary'
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd PfxSnt Desc
192.168.10.100 4 65001 509 511 0 0 0 00:25:15 1 4 N/A
192.168.10.101 4 65001 508 511 0 0 0 00:25:15 1 4 N/A
192.168.20.100 4 65001 509 511 0 0 0 00:25:15 1 4 N/A
vtysh -c 'show ip bgp'
Network Next Hop Metric LocPrf Weight Path
*> 10.10.1.0/24 0.0.0.0 0 32768 i
*> 172.20.0.0/24 192.168.10.100 0 65001 i
*> 172.20.1.0/24 192.168.10.101 0 65001 i
*> 172.20.2.0/24 192.168.20.100 0 65001 iBut there's still a problem — the k8s nodes don't have routes to each other's Pod CIDRs! Let's capture BGP traffic to see what's happening:
# k8s-ctr: Capture BGP traffic
tcpdump -i eth1 tcp port 179 -w /tmp/bgp.pcap
# Router: Restart FRR to trigger BGP updates
systemctl restart frr && journalctl -u frr -f
# Analyze the capture
termshark -r /tmp/bgp.pcap
# Filter: bgp.type == 2 (UPDATE messages)You'll see BGP UPDATE messages from the router, but checking the node routes:
cilium bgp routes
ip -c routeThe routes from the router aren't in the kernel! This is by design — Cilium's BGP implementation is control-plane only. It advertises routes but doesn't install received routes into the kernel FIB. Instead, Cilium uses eBPF for packet forwarding.
For our multi-NIC setup, we need to manually add routes on the nodes:
# Add routes for the entire Pod CIDR via the router
ip route add 172.20.0.0/16 via 192.168.10.200
sshpass -p 'vagrant' ssh vagrant@k8s-w1 sudo ip route add 172.20.0.0/16 via 192.168.10.200
sshpass -p 'vagrant' ssh vagrant@k8s-w0 sudo ip route add 172.20.0.0/16 via 192.168.20.200
# Verify router has BGP-learned routes
sshpass -p 'vagrant' ssh vagrant@router ip -c route | grep bgp
172.20.0.0/24 nhid 64 via 192.168.10.100 dev eth1 proto bgp metric 20
172.20.1.0/24 nhid 60 via 192.168.10.101 dev eth1 proto bgp metric 20
172.20.2.0/24 nhid 62 via 192.168.20.100 dev eth2 proto bgp metric 20
# Now connectivity works!
kubectl exec -it curl-pod -- sh -c 'while true; do curl -s --connect-timeout 1 webpod | grep Hostname; echo "---" ; sleep 1; done'
Hostname: webpod-697b545f57-jpjbd
---
Hostname: webpod-697b545f57-hnvxq
---
Hostname: webpod-697b545f57-nfhk8
---Monitor with Hubble:
cilium hubble port-forward&
hubble status
hubble observe -f --protocol tcp --pod curl-podNode Maintenance
BGP makes node maintenance elegant. Let's drain a node and see what happens:
# Monitor connectivity
kubectl exec -it curl-pod -- sh -c 'while true; do curl -s --connect-timeout 1 webpod | grep Hostname; echo "---" ; sleep 1; done'
# Monitor BGP logs (optional)
kubectl logs -n kube-system -l name=cilium-operator -f | grep "subsys=bgp-cp-operator"
kubectl logs -n kube-system -l k8s-app=cilium -f | grep "subsys=bgp-control-plane"
# Drain k8s-w0 for maintenance
kubectl drain k8s-w0 --ignore-daemonsets
kubectl label nodes k8s-w0 enable-bgp=false --overwriteCheck BGP status:
kubectl get node
kubectl get ciliumbgpnodeconfigs
cilium bgp routes
cilium bgp peers
Node Local AS Peer AS Peer Address Session State Uptime Family Received Advertised
k8s-ctr 65001 65000 192.168.10.200 established 2h13m35s ipv4/unicast 3 2
k8s-w1 65001 65000 192.168.10.200 established 2h13m36s ipv4/unicast 3 2
# Router view - k8s-w0's route is gone!
sshpass -p 'vagrant' ssh vagrant@router "sudo vtysh -c 'show ip bgp summary'"
sshpass -p 'vagrant' ssh vagrant@router "sudo vtysh -c 'show ip bgp'"
sshpass -p 'vagrant' ssh vagrant@router ip -c route | grep bgp
172.20.0.0/24 nhid 64 via 192.168.10.100 dev eth1 proto bgp metric 20
172.20.1.0/24 nhid 60 via 192.168.10.101 dev eth1 proto bgp metric 20The k8s-w0 node gracefully withdrew its routes! Traffic continues uninterrupted. Restore the node:
kubectl label nodes k8s-w0 enable-bgp=true --overwrite
kubectl uncordon k8s-w0
# Verify restoration
kubectl get node
kubectl get ciliumbgpnodeconfigs
cilium bgp routes
cilium bgp peers
# Redistribute pods
kubectl scale deployment webpod --replicas 0
kubectl scale deployment webpod --replicas 3For automatic pod redistribution, consider using Descheduler — it evicts pods based on policies to maintain balanced distribution.
Advertising LoadBalancer Service IPs via BGP
Let's advertise LoadBalancer IPs through BGP. First, create an IP pool:
cat << EOF | kubectl apply -f -
apiVersion: "cilium.io/v2"
kind: CiliumLoadBalancerIPPool
metadata:
name: "cilium-pool"
spec:
allowFirstLastIPs: "No"
blocks:
- cidr: "172.16.1.0/24"
EOF
kubectl get ippool
NAME DISABLED CONFLICTING IPS AVAILABLE AGE
cilium-pool false False 254 8sConvert our service to LoadBalancer:
kubectl patch svc webpod -p '{"spec": {"type": "LoadBalancer"}}'
kubectl get svc webpod
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
webpod LoadBalancer 10.96.39.92 172.16.1.1 80:30800/TCP 3h56m
# Pool now shows one IP allocated
kubectl get ippool
NAME DISABLED CONFLICTING IPS AVAILABLE AGE
cilium-pool false False 253 2m23s
# Check service details
kubectl describe svc webpod | grep 'Traffic Policy'
External Traffic Policy: Cluster
Internal Traffic Policy: Cluster
# Verify Cilium service list
kubectl -n kube-system exec ds/cilium -c cilium-agent -- cilium-dbg service list
ID Frontend Service Type Backend
...
16 172.16.1.1:80/TCP LoadBalancer 1 => 172.20.0.229:80/TCP (active)
2 => 172.20.1.158:80/TCP (active)
3 => 172.20.2.219:80/TCP (active)Test from within the cluster:
LBIP=$(kubectl get svc webpod -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
echo $LBIP
172.16.1.1
curl -s $LBIP | grep Hostname
Hostname: webpod-697b545f57-jpjbdNow advertise this LoadBalancer IP via BGP:
# Monitor router routes
watch "sshpass -p 'vagrant' ssh vagrant@router ip -c route"
# Create BGP advertisement for LoadBalancer IPs
cat << EOF | kubectl apply -f -
apiVersion: cilium.io/v2
kind: CiliumBGPAdvertisement
metadata:
name: bgp-advertisements-lb-exip-webpod
labels:
advertise: bgp
spec:
advertisements:
- advertisementType: "Service"
service:
addresses:
- LoadBalancerIP
selector:
matchExpressions:
- { key: app, operator: In, values: [ webpod ] }
EOF
kubectl get CiliumBGPAdvertisement
NAME AGE
bgp-advertisements 2m1s
bgp-advertisements-lb-exip-webpod 3sCheck BGP route policies:
kubectl exec -it -n kube-system ds/cilium -- cilium-dbg bgp route-policies
VRouter Policy Name Type Match Peers Match Families Match Prefixes (Min..Max Len) RIB Action Path Actions
65001 allow-local import accept
65001 tor-switch-ipv4-PodCIDR export 192.168.10.200/32 172.20.1.0/24 (24..24) accept
65001 tor-switch-ipv4-Service-webpod-default-LoadBalancerIP export 192.168.10.200/32 172.16.1.1/32 (32..32) accept
# All nodes advertise the LoadBalancer IP!
cilium bgp routes available ipv4 unicast
Node VRouter Prefix NextHop Age Attrs
k8s-ctr 65001 172.16.1.1/32 0.0.0.0 32s [{Origin: i} {Nexthop: 0.0.0.0}]
65001 172.20.0.0/24 0.0.0.0 24m41s [{Origin: i} {Nexthop: 0.0.0.0}]
k8s-w0 65001 172.16.1.1/32 0.0.0.0 32s [{Origin: i} {Nexthop: 0.0.0.0}]
65001 172.20.2.0/24 0.0.0.0 24m56s [{Origin: i} {Nexthop: 0.0.0.0}]
k8s-w1 65001 172.16.1.1/32 0.0.0.0 32s [{Origin: i} {Nexthop: 0.0.0.0}]
65001 172.20.1.0/24 0.0.0.0 24m56s [{Origin: i} {Nexthop: 0.0.0.0}]Router now has ECMP routes to the LoadBalancer IP:
sshpass -p 'vagrant' ssh vagrant@router ip -c route
...
172.16.1.1 nhid 71 proto bgp metric 20
nexthop via 192.168.10.101 dev eth1 weight 1
nexthop via 192.168.10.100 dev eth1 weight 1
nexthop via 192.168.20.100 dev eth2 weight 1
sshpass -p 'vagrant' ssh vagrant@router "sudo vtysh -c 'show ip bgp'"
Network Next Hop Metric LocPrf Weight Path
*> 172.16.1.1/32 192.168.10.100 0 65001 i
*= 192.168.20.100 0 65001 i
*= 192.168.10.101 0 65001 i
sshpass -p 'vagrant' ssh vagrant@router "sudo vtysh -c 'show ip bgp 172.16.1.1/32'"
BGP routing table entry for 172.16.1.1/32, version 7
Paths: (3 available, best #1, table default)
Advertised to non peer-group peers:
192.168.10.100 192.168.10.101 192.168.20.100
65001
192.168.10.100 from 192.168.10.100 (192.168.10.100)
Origin IGP, valid, external, multipath, best (Router ID)
Last update: Sat Aug 9 17:50:29 2025
65001
192.168.20.100 from 192.168.20.100 (192.168.20.100)
Origin IGP, valid, external, multipath
Last update: Sat Aug 9 17:50:29 2025
65001
192.168.10.101 from 192.168.10.101 (192.168.10.101)
Origin IGP, valid, external, multipath
Last update: Sat Aug 9 17:50:29 2025Test from the router:
LBIP=172.16.1.1
curl -s $LBIP
curl -s $LBIP | grep Hostname
curl -s $LBIP | grep RemoteAddr
# Load balance test
for i in {1..100}; do curl -s $LBIP | grep Hostname; done | sort | uniq -c | sort -nr
34 Hostname: webpod-697b545f57-jpjbd
33 Hostname: webpod-697b545f57-hnvxq
33 Hostname: webpod-697b545f57-nfhk8Now scale down to see a problem:
kubectl scale deployment webpod --replicas 2
kubectl get pod -owideThe router still has routes through k8s-ctr even though no pods are there:
# Router still sees all three paths
vtysh -c 'show ip bgp 172.16.1.1/32'
# This causes SNAT when traffic goes through k8s-ctr
for i in {1..100}; do curl -s $LBIP | grep Hostname; done | sort | uniq -c | sort -nr
while true; do curl -s $LBIP | egrep 'Hostname|RemoteAddr' ; sleep 0.1; done
Hostname: webpod-697b545f57-swtdz
RemoteAddr: 192.168.10.100:40460
Hostname: webpod-697b545f57-87lf2
RemoteAddr: 192.168.10.100:40474The RemoteAddr shows k8s-ctr's IP because of SNAT! This happens with ExternalTrafficPolicy: Cluster.
External Traffic Policy Local
To preserve source IPs and advertise only nodes with pods:
# Monitor routes
watch "sshpass -p 'vagrant' ssh vagrant@router ip -c route"
# Change to Local policy
kubectl patch service webpod -p '{"spec":{"externalTrafficPolicy":"Local"}}'Now the router only sees routes through nodes with pods:
# Router - only nodes with pods advertise!
vtysh -c 'show ip bgp'
vtysh -c 'show ip bgp 172.16.1.1/32'
vtysh -c 'show ip route bgp'
ip -c route
# Start tcpdump on all nodes
# Terminal 1 (k8s-w1)
tcpdump -i eth1 -A -s 0 -nn 'tcp port 80'
# Terminal 2 (k8s-w0)
tcpdump -i eth1 -A -s 0 -nn 'tcp port 80'
# Terminal 3 (k8s-ctr)
tcpdump -i eth1 -A -s 0 -nn 'tcp port 80'
# Test from router - source IP preserved!
LBIP=172.16.1.1
for i in {1..100}; do curl -s $LBIP | grep Hostname; done | sort | uniq -c | sort -nr
100 Hostname: webpod-697b545f57-lppz4
while true; do curl -s $LBIP | egrep 'Hostname|RemoteAddr' ; sleep 0.1; done
Hostname: webpod-697b545f57-lppz4
RemoteAddr: 192.168.10.200:54312Source IP is preserved! But all traffic goes to one node due to ECMP hash.
Linux ECMP Hash Policy
The default Linux ECMP uses L3 (destination IP) hashing, so all traffic to the same destination goes to one path. Let's improve this:
# On router - check current distribution
for i in {1..100}; do curl -s $LBIP | grep Hostname; done | sort | uniq -c | sort -nr
100 Hostname: webpod-697b545f57-lppz4
# Change to L4 hash (includes ports)
sudo sysctl -w net.ipv4.fib_multipath_hash_policy=1
echo "net.ipv4.fib_multipath_hash_policy=1" >> /etc/sysctl.conf
# Test again - better distribution!
for i in {1..100}; do curl -s $LBIP | grep Hostname; done | sort | uniq -c | sort -nr
59 Hostname: webpod-697b545f57-87lf2
41 Hostname: webpod-697b545f57-swtdz
# Scale to 3 replicas
kubectl scale deployment webpod --replicas 3
kubectl get pod -owide
# Router sees all three paths
ip -c route
for i in {1..100}; do curl -s $LBIP | grep Hostname; done | sort | uniq -c | sort -nr
37 Hostname: webpod-697b545f57-bgpv9
35 Hostname: webpod-697b545f57-87lf2
28 Hostname: webpod-697b545f57-swtdzMuch better distribution with L4 hashing!
DSR with Maglev
Let's compare different load balancing modes. First check current settings:
kubectl exec -it -n kube-system ds/cilium -- cilium status --verbose
...
Mode: SNAT
Backend Selection: Random
Session Affinity: Enabled
...Now enable DSR (Direct Server Return) with Maglev for consistent hashing:
# Load GENEVE module for DSR
modprobe geneve
lsmod | grep -E 'vxlan|geneve'
for i in w1 w0 ; do echo ">> node : k8s-$i <<"; sshpass -p 'vagrant' ssh vagrant@k8s-$i sudo modprobe geneve ; echo; done
for i in w1 w0 ; do echo ">> node : k8s-$i <<"; sshpass -p 'vagrant' ssh vagrant@k8s-$i sudo lsmod | grep -E 'vxlan|geneve' ; echo; done
# Upgrade to DSR mode
helm upgrade cilium cilium/cilium --version 1.18.0 --namespace kube-system --reuse-values \
--set tunnelProtocol=geneve --set loadBalancer.mode=dsr --set loadBalancer.dsrDispatch=geneve \
--set loadBalancer.algorithm=maglev
kubectl -n kube-system rollout restart ds/cilium
# Verify settings
kubectl exec -it -n kube-system ds/cilium -- cilium status --verbose
...
Mode: DSR
DSR Dispatch Mode: Geneve
Backend Selection: Maglev (Table Size: 16381)
Session Affinity: Enabled
...
# Reset to Cluster policy for testing
kubectl patch svc webpod -p '{"spec":{"externalTrafficPolicy":"Cluster"}}'
# Capture on all nodes
tcpdump -i eth1 -w /tmp/dsr.pcap
# Test from router
curl -s $LBIP
curl -s $LBIP
curl -s $LBIP
curl -s $LBIP
curl -s $LBIPDownload and analyze the capture to see GENEVE encapsulation carrying the original client info for direct return!
The three main traffic patterns we explored:
- BGP + Local Policy + SNAT: Recommended. Only nodes with pods advertise routes, source IP preserved, but all traffic through ECMP-selected nodes.
- BGP + Cluster Policy + SNAT: Not recommended. All nodes advertise, causes extra hops and SNAT when pods are on different nodes.
- BGP + Cluster Policy + DSR + Maglev: Better than #2. Uses GENEVE to preserve client info, Maglev for consistent hashing, direct return from pod nodes.
Disabling Status Reporting
For large clusters, disable BGP status reporting to reduce API server load:
# Check current status
kubectl get ciliumbgpnodeconfigs -o yaml | yq
# Disable status reporting
helm upgrade cilium cilium/cilium --version 1.18.0 --namespace kube-system --reuse-values \
--set bgpControlPlane.statusReport.enabled=false
kubectl -n kube-system rollout restart ds/cilium
# Verify - status is now empty
kubectl get ciliumbgpnodeconfigs -o yaml | yq
...
"status": {}This BGP integration transforms Kubernetes from an isolated island into a first-class citizen in your network infrastructure. The combination of BGP route advertisement, ECMP load balancing, and ExternalTrafficPolicy gives you fine control over traffic patterns. The ability to gracefully maintain nodes through BGP withdrawal and the choice between SNAT and DSR modes provides the flexibility needed for production deployments.
Cilium ClusterMesh
Now let me show you how to connect multiple Kubernetes clusters using Cilium ClusterMesh. This is incredibly useful for high availability, disaster recovery, and geographic distribution. I'll use Kind to create two local clusters that we'll mesh together.
Creating the Test Clusters
First, let's create two Kind clusters named west and east. Each will have its own Pod and Service CIDRs:
# Create west cluster
kind create cluster --name west --image kindest/node:v1.33.2 --config - <<EOF
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
extraPortMappings:
- containerPort: 30000 # sample apps
hostPort: 30000
- containerPort: 30001 # hubble ui
hostPort: 30001
- role: worker
extraPortMappings:
- containerPort: 30002 # sample apps
hostPort: 30002
networking:
podSubnet: "10.0.0.0/16"
serviceSubnet: "10.2.0.0/16"
disableDefaultCNI: true
kubeProxyMode: none
EOFImportant settings here:
disableDefaultCNI: true: We'll install Cilium instead of kindnetkubeProxyMode: none: Cilium will replace kube-proxy entirely- Port mappings for accessing services and Hubble UI from localhost
Install basic tools on west nodes:
docker exec -it west-control-plane sh -c 'apt update && apt install tree psmisc lsof wget net-tools dnsutils tcpdump ngrep iputils-ping git -y'
docker exec -it west-worker sh -c 'apt update && apt install tree psmisc lsof wget net-tools dnsutils tcpdump ngrep iputils-ping git -y'Create the east cluster with different CIDRs:
kind create cluster --name east --image kindest/node:v1.33.2 --config - <<EOF
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
extraPortMappings:
- containerPort: 31000 # sample apps
hostPort: 31000
- containerPort: 31001 # hubble ui
hostPort: 31001
- role: worker
extraPortMappings:
- containerPort: 31002 # sample apps
hostPort: 31002
networking:
podSubnet: "10.1.0.0/16"
serviceSubnet: "10.3.0.0/16"
disableDefaultCNI: true
kubeProxyMode: none
EOFInstall tools on east nodes:
docker exec -it east-control-plane sh -c 'apt update && apt install tree psmisc lsof wget net-tools dnsutils tcpdump ngrep iputils-ping git -y'
docker exec -it east-worker sh -c 'apt update && apt install tree psmisc lsof wget net-tools dnsutils tcpdump ngrep iputils-ping git -y'Verify both clusters are created:
kubectl config get-contexts
CURRENT NAME CLUSTER AUTHINFO NAMESPACE
* kind-east kind-east kind-east
kind-west kind-west kind-west
# Test accessing both clusters
kubectl get node --context kind-west
NAME STATUS ROLES AGE VERSION
west-control-plane NotReady control-plane 45s v1.33.2
west-worker NotReady <none> 23s v1.33.2
kubectl get node --context kind-east
NAME STATUS ROLES AGE VERSION
east-control-plane NotReady control-plane 38s v1.33.2
east-worker NotReady <none> 16s v1.33.2Nodes are NotReady because we haven't installed a CNI yet. Let's set up aliases for easier management:
alias kwest='kubectl --context kind-west'
alias keast='kubectl --context kind-east'
# Test aliases
kwest get node -owide
keast get node -owideInstalling Cilium with ClusterMesh Support
Install Cilium CLI if you haven't already:
# macOS
brew install cilium-cli
# Linux (including WSL2)
CILIUM_CLI_VERSION=$(curl -s https://raw.githubusercontent.com/cilium/cilium-cli/main/stable.txt)
CLI_ARCH=amd64
if [ "$(uname -m)" = "aarch64" ]; then CLI_ARCH=arm64; fi
curl -L --fail --remote-name-all https://github.com/cilium/cilium-cli/releases/download/${CILIUM_CLI_VERSION}/cilium-linux-${CLI_ARCH}.tar.gz{,.sha256sum}
sha256sum --check cilium-linux-${CLI_ARCH}.tar.gz.sha256sum
sudo tar xzvfC cilium-linux-${CLI_ARCH}.tar.gz /usr/local/bin
rm cilium-linux-${CLI_ARCH}.tar.gz{,.sha256sum}Install Cilium on west cluster with ClusterMesh-specific settings:
cilium install --version 1.17.6 --set ipam.mode=kubernetes \
--set kubeProxyReplacement=true --set bpf.masquerade=true \
--set endpointHealthChecking.enabled=false --set healthChecking=false \
--set operator.replicas=1 --set debug.enabled=true \
--set routingMode=native --set autoDirectNodeRoutes=true --set ipv4NativeRoutingCIDR=10.0.0.0/16 \
--set ipMasqAgent.enabled=true --set ipMasqAgent.config.nonMasqueradeCIDRs='{10.1.0.0/16}' \
--set cluster.name=west --set cluster.id=1 \
--context kind-westCritical settings for ClusterMesh:
ipv4NativeRoutingCIDR=10.0.0.0/16: West's own Pod CIDR for direct routingnonMasqueradeCIDRs='{10.1.0.0/16}': Don't NAT traffic to east's Pod CIDRcluster.name=west --set cluster.id=1: Unique cluster identification
Watch the installation:
watch kubectl get pod -n kube-system --context kind-westInstall Cilium on east cluster:
cilium install --version 1.17.6 --set ipam.mode=kubernetes \
--set kubeProxyReplacement=true --set bpf.masquerade=true \
--set endpointHealthChecking.enabled=false --set healthChecking=false \
--set operator.replicas=1 --set debug.enabled=true \
--set routingMode=native --set autoDirectNodeRoutes=true --set ipv4NativeRoutingCIDR=10.1.0.0/16 \
--set ipMasqAgent.enabled=true --set ipMasqAgent.config.nonMasqueradeCIDRs='{10.0.0.0/16}' \
--set cluster.name=east --set cluster.id=2 \
--context kind-east
watch kubectl get pod -n kube-system --context kind-eastVerify installations:
kwest get pod -A && keast get pod -A
cilium status --context kind-west
cilium status --context kind-east
# Check configuration
cilium config view --context kind-west | grep -E "cluster-|masq"
cilium config view --context kind-east | grep -E "cluster-|masq"
# Detailed status
kwest exec -it -n kube-system ds/cilium -- cilium status --verbose
keast exec -it -n kube-system ds/cilium -- cilium status --verboseCheck IP masquerade configuration:
kwest -n kube-system exec ds/cilium -c cilium-agent -- cilium-dbg bpf ipmasq list
IP PREFIX/ADDRESS
10.0.0.0/8
10.1.0.0/16
keast -n kube-system exec ds/cilium -c cilium-agent -- cilium-dbg bpf ipmasq list
IP PREFIX/ADDRESS
10.0.0.0/16
10.0.0.0/8Each cluster won't masquerade traffic to the other's Pod CIDR!
Setting Up ClusterMesh
Check initial routing tables — no routes between clusters yet:
docker exec -it west-control-plane ip -c route
docker exec -it west-worker ip -c route
docker exec -it east-control-plane ip -c route
docker exec -it east-worker ip -c routeStep 1: Synchronize Certificate Authority.
ClusterMesh requires shared CA certificates:
# Check existing CA in east
keast get secret -n kube-system cilium-ca
NAME TYPE DATA AGE
cilium-ca Opaque 2 5m12s
# Delete and replace with west's CA
keast delete secret -n kube-system cilium-ca
kubectl --context kind-west get secret -n kube-system cilium-ca -o yaml | \
kubectl --context kind-east create -f -
# Verify
keast get secret -n kube-system cilium-caStep 2: Enable ClusterMesh on both clusters.
Start monitoring in separate terminals:
# Terminal 1
cilium clustermesh status --context kind-west --wait
# Terminal 2
cilium clustermesh status --context kind-east --waitEnable ClusterMesh:
cilium clustermesh enable --service-type NodePort --enable-kvstoremesh=false --context kind-west
cilium clustermesh enable --service-type NodePort --enable-kvstoremesh=false --context kind-eastThis creates clustermesh-apiserver deployments. Check them:
kwest get svc,ep -n kube-system clustermesh-apiserver
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/clustermesh-apiserver NodePort 10.2.216.182 <none> 2379:32379/TCP 65s
NAME ENDPOINTS AGE
endpoints/clustermesh-apiserver 10.0.0.195:2379 65s
kwest get pod -n kube-system -owide | grep clustermesh
clustermesh-apiserver-7d6b9c4b7f-xmxqt 1/1 Running 0 87s 10.0.0.195 west-control-plane <none>
keast get svc,ep -n kube-system clustermesh-apiserver
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/clustermesh-apiserver NodePort 10.3.252.188 <none> 2379:32379/TCP 43s
NAME ENDPOINTS AGE
endpoints/clustermesh-apiserver 10.1.0.206:2379 43sStep 3: Connect the clusters:
# Monitor in separate terminals
watch -d "cilium clustermesh status --context kind-west --wait"
watch -d "cilium clustermesh status --context kind-east --wait"
# Connect them
cilium clustermesh connect --context kind-west --destination-context kind-eastVerify connection:
cilium clustermesh status --context kind-west --wait
✅ Service "clustermesh-apiserver" of type "NodePort" found
✅ Cluster access information is available:
- 172.18.0.3:32379
✅ Deployment clustermesh-apiserver is ready
✅ All 1 nodes are connected to all clusters [min:1 / avg:1.0 / max:1]
🔌 Cluster Connections:
- east: 2/2 configured, 2/2 connected
cilium clustermesh status --context kind-east --wait
✅ Service "clustermesh-apiserver" of type "NodePort" found
✅ Cluster access information is available:
- 172.18.0.4:32379
✅ Deployment clustermesh-apiserver is ready
✅ All 1 nodes are connected to all clusters [min:1 / avg:1.0 / max:1]
🔌 Cluster Connections:
- west: 2/2 configured, 2/2 connectedCheck detailed mesh status:
kubectl exec -it -n kube-system ds/cilium -c cilium-agent --context kind-west -- cilium-dbg troubleshoot clustermesh
kubectl exec -it -n kube-system ds/cilium -c cilium-agent --context kind-east -- cilium-dbg troubleshoot clustermesh
kwest exec -it -n kube-system ds/cilium -- cilium status --verbose | grep -A5 ClusterMesh
ClusterMesh: 1/1 remote clusters ready, 0 global-services
east: ready, 2 nodes, 4 endpoints, 3 identities, 0 services, 0 MCS-API service exports, 0 reconnections (last: never)
└ etcd: 1/1 connected, leases=0, lock leases=0, has-quorum=true: endpoint status checks are disabled, ID: c6ba18866da7dfd8
└ remote configuration: expected=true, retrieved=true, cluster-id=2, kvstoremesh=false, sync-canaries=true, service-exports=disabled
└ synchronization status: nodes=true, endpoints=true, identities=true, services=trueCheck Helm values to see the configuration:
helm get values -n kube-system cilium --kube-context kind-west | grep -A10 clustermesh
clustermesh:
apiserver:
kvstoremesh:
enabled: false
service:
type: NodePort
config:
clusters:
- ips:
- 172.18.0.4
name: east
port: 32379
enabled: trueNow check routing tables again — routes to the other cluster's Pod CIDRs are added!
docker exec -it west-worker ip -c route | grep 10.1
10.1.0.0/24 via 172.18.0.4 dev eth0 proto kernel
10.1.1.0/24 via 172.18.0.3 dev eth0 proto kernel
docker exec -it east-worker ip -c route | grep 10.0
10.0.0.0/24 via 172.18.0.2 dev eth0 proto kernel
10.0.1.0/24 via 172.18.0.6 dev eth0 proto kernelEnable Hubble for Visualization
Enable Hubble on west:
helm upgrade cilium cilium/cilium --version 1.17.6 --namespace kube-system --reuse-values \
--set hubble.enabled=true --set hubble.relay.enabled=true --set hubble.ui.enabled=true \
--set hubble.ui.service.type=NodePort --set hubble.ui.service.nodePort=30001 --kube-context kind-west
kwest -n kube-system rollout restart ds/ciliumEnable Hubble on east:
helm upgrade cilium cilium/cilium --version 1.17.6 --namespace kube-system --reuse-values \
--set hubble.enabled=true --set hubble.relay.enabled=true --set hubble.ui.enabled=true \
--set hubble.ui.service.type=NodePort --set hubble.ui.service.nodePort=31001 --kube-context kind-east
keast -n kube-system rollout restart ds/ciliumVerify Hubble services:
kwest get svc,ep -n kube-system hubble-ui
keast get svc,ep -n kube-system hubble-ui
# Access Hubble UIs
open http://localhost:30001 # West cluster
open http://localhost:31001 # East clusterPod-to-Pod Communication Across Clusters
Deploy test pods in both clusters:
cat << EOF | kubectl apply --context kind-west -f -
apiVersion: v1
kind: Pod
metadata:
name: curl-pod
labels:
app: curl
spec:
containers:
- name: curl
image: nicolaka/netshoot
command: ["tail"]
args: ["-f", "/dev/null"]
terminationGracePeriodSeconds: 0
EOF
cat << EOF | kubectl apply --context kind-east -f -
apiVersion: v1
kind: Pod
metadata:
name: curl-pod
labels:
app: curl
spec:
containers:
- name: curl
image: nicolaka/netshoot
command: ["tail"]
args: ["-f", "/dev/null"]
terminationGracePeriodSeconds: 0
EOFCheck pod IPs:
kwest get pod -owide && keast get pod -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
curl-pod 1/1 Running 0 43s 10.0.0.144 west-control-plane <none> <none>
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
curl-pod 1/1 Running 0 36s 10.1.0.128 east-control-plane <none> <none>Test cross-cluster connectivity:
# West to East
kubectl exec -it curl-pod --context kind-west -- ping -c 1 10.1.0.128
PING 10.1.0.128 (10.1.0.128) 56(84) bytes of data.
64 bytes from 10.1.0.128: icmp_seq=1 ttl=62 time=0.877 ms
# Start continuous ping
kubectl exec -it curl-pod --context kind-west -- ping 10.1.0.128Check on the destination — no NAT, source IP preserved!
# Terminal 1: tcpdump on destination pod
kubectl exec -it curl-pod --context kind-east -- tcpdump -i eth0 -nn
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
08:23:45.123456 IP 10.0.0.144 > 10.1.0.128: ICMP echo request, id 1, seq 1, length 64
08:23:45.123489 IP 10.1.0.128 > 10.0.0.144: ICMP echo reply, id 1, seq 1, length 64Source IP 10.0.0.144 is preserved — no masquerading!
Test the reverse direction:
kubectl exec -it curl-pod --context kind-east -- ping -c 1 10.0.0.144
PING 10.0.0.144 (10.0.0.144) 56(84) bytes of data.
64 bytes from 10.0.0.144: icmp_seq=1 ttl=62 time=1.24 msGlobal Services — Load Balancing Across Clusters
Deploy identical services in both clusters with global annotation:
cat << EOF | kubectl apply --context kind-west -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: webpod
spec:
replicas: 2
selector:
matchLabels:
app: webpod
template:
metadata:
labels:
app: webpod
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- webpod
topologyKey: "kubernetes.io/hostname"
containers:
- name: webpod
image: traefik/whoami
ports:
- containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
name: webpod
labels:
app: webpod
annotations:
service.cilium.io/global: "true"
spec:
selector:
app: webpod
ports:
- protocol: TCP
port: 80
targetPort: 80
type: ClusterIP
EOFDeploy the same in east:
cat << EOF | kubectl apply --context kind-east -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: webpod
spec:
replicas: 2
selector:
matchLabels:
app: webpod
template:
metadata:
labels:
app: webpod
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- webpod
topologyKey: "kubernetes.io/hostname"
containers:
- name: webpod
image: traefik/whoami
ports:
- containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
name: webpod
labels:
app: webpod
annotations:
service.cilium.io/global: "true"
spec:
selector:
app: webpod
ports:
- protocol: TCP
port: 80
targetPort: 80
type: ClusterIP
EOFThe key is service.cilium.io/global: "true" - this makes the service span both clusters!
Check the service endpoints — west sees all pods from both clusters:
kwest exec -it -n kube-system ds/cilium -c cilium-agent -- cilium service list --clustermesh-affinity
ID Frontend Service Type Backend
13 10.2.47.149:80/TCP ClusterIP 1 => 10.0.0.46:80/TCP (active)
2 => 10.0.1.219:80/TCP (active)
3 => 10.1.0.190:80/TCP (active)
4 => 10.1.1.41:80/TCP (active)Test load balancing across clusters:
for i in {1..100}; do
kubectl exec -i curl-pod --context kind-west -- sh -c "curl -s --connect-timeout 1 webpod | grep Hostname"
done | sort | uniq -c | sort -nr
31 Hostname: webpod-697b545f57-rcp4r
29 Hostname: webpod-697b545f57-jjpz9
23 Hostname: webpod-697b545f57-gkh6r
17 Hostname: webpod-697b545f57-75q89Traffic is distributed across both clusters! But this might not be optimal due to latency.
Service Affinity — Preferring Local Cluster
To prefer pods in the same cluster, add affinity annotation:
kwest annotate service webpod service.cilium.io/affinity=local --overwrite
keast annotate service webpod service.cilium.io/affinity=local --overwriteCheck the service list again — notice "preferred" backends:
kwest exec -it -n kube-system ds/cilium -c cilium-agent -- cilium service list --clustermesh-affinity
ID Frontend Service Type Backend
13 10.2.47.149:80/TCP ClusterIP 1 => 10.0.0.46:80/TCP (active) (preferred)
2 => 10.0.1.219:80/TCP (active) (preferred)
3 => 10.1.0.190:80/TCP (active)
4 => 10.1.1.41:80/TCP (active)West cluster prefers its local pods! Test again:
for i in {1..100}; do
kubectl exec -i curl-pod --context kind-west -- sh -c "curl -s --connect-timeout 1 webpod | grep Hostname"
done | sort | uniq -c | sort -nr
52 Hostname: webpod-697b545f57-rcp4r
48 Hostname: webpod-697b545f57-jjpz9Now traffic stays within the west cluster! The remote endpoints are only used if local ones are unavailable.
You can also set remote preference (for testing):
kwest annotate service webpod service.cilium.io/affinity=remote --overwrite
keast annotate service webpod service.cilium.io/affinity=remote --overwrite
kwest exec -it -n kube-system ds/cilium -c cilium-agent -- cilium service list --clustermesh-affinity
ID Frontend Service Type Backend
13 10.2.47.149:80/TCP ClusterIP 1 => 10.1.0.190:80/TCP (active) (preferred)
2 => 10.1.1.41:80/TCP (active) (preferred)
3 => 10.0.0.46:80/TCP (active)
4 => 10.0.1.219:80/TCP (active)Now east endpoints are preferred from west!
Controlling Service Sharing
You can disable endpoint synchronization for specific services:
kwest annotate service webpod service.cilium.io/shared=false --overwrite
# Check east - no west endpoints anymore
keast exec -it -n kube-system ds/cilium -c cilium-agent -- cilium service list --clustermesh-affinity
ID Frontend Service Type Backend
13 10.3.105.173:80/TCP ClusterIP 1 => 10.1.1.190:80/TCP (active) (preferred)
2 => 10.1.1.41:80/TCP (active) (preferred)The service is now local-only in west but still global in east.
For production, I recommend this combination of annotations:
apiVersion: v1
kind: Service
metadata:
annotations:
service.cilium.io/global: "true" # Share across clusters
service.cilium.io/affinity: "local" # Prefer local endpoints
service.cilium.io/shared: "true" # Allow endpoint syncThis gives you:
- High availability (failover to remote cluster)
- Optimal performance (prefer local cluster)
- Reduced cross-cluster traffic and costs
Monitoring with Hubble
Open Hubble UI to see the cross-cluster flows:
# West cluster
open http://localhost:30001
# East cluster
open http://localhost:31001In Hubble, you'll see cluster names in the flow visualization — this only appears with ClusterMesh enabled. You can filter by cluster, see cross-cluster traffic patterns, and identify potential optimizations.
Troubleshooting ClusterMesh
If clusters aren't connecting, check:
# Verify cluster configuration
cilium clustermesh status --context kind-west
cilium clustermesh status --context kind-east
# Check connectivity from Cilium agents
kubectl exec -it -n kube-system ds/cilium -c cilium-agent --context kind-west -- cilium-dbg troubleshoot clustermesh
# Verify clustermesh-apiserver is accessible
kwest get svc -n kube-system clustermesh-apiserver
keast run test --image=nicolaka/netshoot --rm -it --restart=Never -- curl -v telnet://172.18.0.3:32379
# Check logs
kwest logs -n kube-system deployment/clustermesh-apiserver
kwest logs -n kube-system ds/cilium -c cilium-agent | grep cluster
# Verify certificates match
kwest get secret -n kube-system cilium-ca -o yaml | grep ca.crt | head -1
keast get secret -n kube-system cilium-ca -o yaml | grep ca.crt | head -1ClusterMesh transforms multiple Kubernetes clusters into a unified network fabric. The ability to have pods communicate directly across clusters without NAT, combined with intelligent service load balancing and affinity controls, enables sophisticated multi-cluster architectures. Whether you're building for high availability, geographic distribution, or disaster recovery, ClusterMesh provides the networking foundation you need.
The key takeaway is that with proper configuration — especially service affinity settings — you can have the benefits of a global service mesh while maintaining optimal network paths and minimizing cross-cluster traffic. This is crucial for both performance and cost optimization in multi-region deployments.