This is a story of “Proxy ARP” going rogue. Writing down that story took more than I expected so it’s split in two different posts.
In this first part we explain what proxy ARP is and how it’s used in GRNET Ganeti clusters to provide public IPv4 to guest vms. I’m going to investigate a particular incident where certain hosts caused DOS by hijacking all IPv4 addresses within a VLAN.
In the second part we track down this particular behavior by reading the linux source code, setting up a Debian Buster testbed environment with network namespaces, and playing around with python scapy, eBPF Compiler Collection toolkit and linux kernel static tracepoints.
I assume the reader is accustomed with basic linux networking. Even if not, do read on if you fancy linux kernel and low level networking stuff.
ARP Proxy going rogue, part 1: the incident
Ganeti “routed” networks for kvm guests
When it comes to network connectivity for KVM guests the simpler solution is attaching tap interfaces to a linux bridge. The downside of this approach is that all guests reside on the same layer 2 or same broadcast domain, so all kinds of layer 2 shortcomings are present: broadcast traffic sniff, ARP poisoning, MiTM, IP hijacking etc. Given that guests are considered untrusted, linux bridge (without further safeguards) is not the wiser option for a cloud environment.
At GRNET we extensively use the “routed” network flavor of Ganeti to provide IP connectivity to untrusted KVM guests with public addresses. The main advantage of this approach as opposed to linux bridge is that the guests do not reside on the same layer2/broadcast domain but get the “feeling” they do.
Essentially, with “routed” networks the physical host acts as a router for the guests. Guests are still getting public IPv4 address, but are not actually directly connected to the vlan. Host hides the fact that guests are living on it, attracts all traffic targeting their guests and then routes that traffic to the corresponding guest interface. The broadcast domain is actually segmented multiple times and the host fully controls passing by traffic, thus preventing guests' malicious behavior.
Brief overview of Ganeti “routed” networks
I’m now going to briefly illustrate how “routed” networks work under the hood. Suppose we have a host, bare-metal-node-0, with two kvm guests connected to vlan 90, subnet 188.8.131.52/24. Each guest has a public IPv4 assigned to its eth0 interface as well as default gateway set to 184.108.40.206. For the rest of this section I’ll only refer to guest0 attached to tap0, to keep things more simple.
guest0 +----+ +------------------- bare-metal-node0 +---------|tap0|--|eth0:220.127.116.11/24 | +----+ +------------------- +-------+ +----------+ | bond0 |-----| bond0.90 | +-------+ +----------+ guest1 | +----+ +------------------- +---------|tap1|--|eth0:18.104.22.168/24 +----+ +-------------------
note: bond0 is a logical interface aggregating host’s physical interfaces and bond0.90 is the interface where vlan traffic gets tagged and untagged.
The host routes traffic, from and towards the guest, via a set of ip rules:
➜ bare-metal-node-0 ~ ip rule 0: from all lookup local 32764: from all iif tap0 lookup public_90 32765: from all iif bond0.90 lookup public_90 32766: from all lookup main 32767: from all lookup default
and a dedicated routing table:
➜ bare-metal-node-0 ~ ip r show table public_90 default via 22.214.171.124 dev bond0.90 126.96.36.199/24 dev bond0.90 scope link 188.8.131.52 dev tap0 scope link
What you should take from these snippets is that traffic from tap0 as well as traffic from bond0.90 results in lookups on a separate routing table, not physical host’s main routing table This table contains the vlan’s default gateway accessible through bond0.90 and /32 IPv4 guest0 address accessible through the directly connected tap interface.
So when traffic from 184.108.40.206 leaves the guest towards the internet, say 220.127.116.11, it will be forwarded to the default gateway and once the reply is received it will be forwarded back to tap.
Hosts within the same subnet know how to reach each other in layer 2 by using the ARP protocol. ARP maps layer 3 IP addresses to layer 2 MAC addresses. ARP is elementary for all kinds of Ethernet networks (although it comes with zero safeguards).
Briefly, when 18.104.22.168 wants to reach 22.214.171.124, ARP traffic will look like this:
- 126.96.36.199 will emit a broadcast packet “who has 188.8.131.52?”
- the host which has 184.108.40.206 assigned on its interface, here guest0, shall respond with an ARP reply
This communication serves as a means for both hosts to record the corresponding layer 2 MAC addresses to their ARP cache for instanct or future use.
The role of Proxy ARP in Ganeti “routed” networks
Since aforementioned ARP packets will only travel within a broadcast domain, and as already said guests are not in the same broadcast domain with the gateway(or other guests), how does ARP work? How does the physical host attract traffic targeting a guest vm IPv4 so as to route it? The answer is “Proxy ARP”.
Proxy ARP means a host will reply to ARP “who-has” requests for IPv4 addresses which they do not actually hold/have configured in their interfaces. This is actually the case for our ganeti physical hosts: they do respond to ARP requests targeting IPv4 addresses of guests virtual machines they hold.
Of course, we only want Proxy ARP enabled on particular vlan interfaces (the interfaces used for guests' subnets) and we want each physical host to respond only for the IPv4 addresses of the virtual machines on it. How do we achieve that?
Enable proxy_arp and forwarding on the vlan interface:
echo "1" > /proc/sys/net/ipv4/conf/bond0.90/proxy_arp echo "1" > /proc/sys/net/ipv4/conf/bond0.90/forwarding
then add guest vm’s address (for which we want the host to respond) on the relevant routing table:
ip r add 220.127.116.11 dev tap0 table public_90
Lookups on the routing table(s) are crucial for the Proxy ARP functionality, and this particular point we will try to further examine later in this post.
Of course all this won’t be implemented manually for every ganeti node or cluster. Rather, it’s automated gnt-networking and by pouring some puppet sugar on top. We won’t go into further details though since it would make this post explode. Let’s assume all this just works.
The incident of Proxy ARP going rogue
Now imagine the aforementioned setup replicated in a dozen of physical nodes consisting a ganeti cluster. Each node holds a dozen virtual machines with routed networking, so each node performs Proxy ARP for all the IPv4 addresses of the guests on it.
At some point we noticed IP connectivity problems for guests with routed networks in a particular cluster. Problems affected vms on all cluster’s members although we knew that only a single one of them was under maintenance work.
Tracerouting one of the affected IPv4 addresses revealed that traffic was looping inside the datacenter between the vlan gateway (DC router) and the physical host which we worked on:
user@user-laptop ~ $ traceroute 18.104.22.168 traceroute to 22.214.171.124 (126.96.36.199), 30 hops max, 60 byte packets 1 int-gw.noc.grnet.gr (188.8.131.52) 0.236 ms 0.201 ms 0.175 ms 2 grnetnoc-1-gw.eier.access-link.grnet.gr (184.108.40.206) 0.538 ms 0.530 ms 0.507 ms 3 ypedcfs2-eier-1.backbone.grnet.gr (220.127.116.11) 1.491 ms 1.486 ms 1.480 ms 4 bare-metal-node-0.grnet.gr (18.104.22.168) 1.450 ms 1.447 ms 1.439 ms 5 * * * 6 bare-metal-node-0.grnet.gr (22.214.171.124) 1.550 ms 1.466 ms 1.457 ms 7 * * * 8 * * * 9 * * * 10 * * * 11 * * * 12 bare-metal-node-0.grnet.gr (126.96.36.199) 1.585 ms 1.598 ms 1.583 ms 13 * * * 14 bare-metal-node-0.grnet.gr (188.8.131.52) 1.755 ms 1.751 ms * 15 * * * 16 * * * 17 * * * 18 * bare-metal-node-0.grnet.gr (184.108.40.206) 1.895 ms 1.906 ms
A clear indication that something was going wrong with that particular node. When troubleshooting networking issues, tcpdump is our first thought and resort:
➜ bare-metal-node-0 ~ # tcpdump -ni bond0 arp and ether src a0:36:9f:59:be:ef tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on bond0, link-type EN10MB (Ethernet), capture size 262144 bytes 11:33:33.841315 ARP, Reply 220.127.116.11 is-at a0:36:9f:59:be:ef, length 28 11:33:34.217271 ARP, Reply 18.104.22.168 is-at a0:36:9f:59:be:ef, length 28 11:33:34.293271 ARP, Reply 22.214.171.124 is-at a0:36:9f:59:be:ef, length 28 11:33:34.405313 ARP, Reply 126.96.36.199 is-at a0:36:9f:59:be:ef, length 28 11:33:34.621281 ARP, Reply 188.8.131.52 is-at a0:36:9f:59:be:ef, length 28 11:33:34.981310 ARP, Reply 184.108.40.206 is-at a0:36:9f:59:be:ef, length 28 11:33:35.201312 ARP, Reply 220.127.116.11 is-at a0:36:9f:59:be:ef, length 28 11:33:35.765317 ARP, Reply 18.104.22.168 is-at a0:36:9f:59:be:ef, length 28 11:33:36.165321 ARP, Reply 22.214.171.124 is-at a0:36:9f:59:be:ef, length 28 11:33:36.213270 ARP, Reply 126.96.36.199 is-at a0:36:9f:59:be:ef, length 28 11:33:36.281279 ARP, Reply 188.8.131.52 is-at a0:36:9f:59:be:ef, length 28 11:33:36.313273 ARP, Reply 184.108.40.206 is-at a0:36:9f:59:be:ef, length 28 11:33:36.753311 ARP, Reply 220.127.116.11 is-at a0:36:9f:59:be:ef, length 28 11:33:37.425313 ARP, Reply 18.104.22.168 is-at a0:36:9f:59:be:ef, length 28 11:33:38.161312 ARP, Reply 22.214.171.124 is-at a0:36:9f:59:be:ef, length 28 11:33:38.181322 ARP, Reply 126.96.36.199 is-at a0:36:9f:59:be:ef, length 28 11:33:38.253305 ARP, Reply 188.8.131.52 is-at a0:36:9f:59:be:ef, length 28
As depicted in the snippet, the physical host was flinging itself into replying every single ARP “who-has”, for every IPv4 address in the vlan. This is equivalent to IP hijacking and the host attracted all guests' traffic on it. Problem is that the node didn’t actually host guests, didn’t know where to route the traffic (so was sending it back to default gateway), thus causing networking mayhem and Denial of Service. :)
We quickly identified the problem on the separate routing table the host had for that particular routed network. Remember, as shown earlier, the routing table must contain at least a default gateway route and a route for the whole vlan’s subnet, like:
➜ bare-metal-node-0 ~ # ip r show table public_90 default via 184.108.40.206 dev bond0.90 220.127.116.11/24 dev bond0.90 scope link
Instead, the routing table was empty, no routes at all! How did that happen?
During the works on that node, a ‘ifdown bond0 ; ifup bond0’ was issued by the operator. When bond0 interface went down the vlan interface, bond0.90, went down too. As a consequence all routing entries related to that device were removed from all routing tables, public_90 included. While ‘ifup bond0’ results in bond0.90 getting UP again, it’s not the same as ‘ifup bond0.90’. Namely the latter would run the scripts in ‘/etc/network/if-up.d/’ for that interface and would have reinstated entries in the public_90 routing table.
Thus, we simply restored the public_90 routing table and in fact the node stopped the ARP reply storm, resolving the issue.
Although we resolved the problem a question raised: was that an expected behavior or not? Namely, should the node respond to every ARP “who-has” on the interface with proxy_arp enabled, given that no routing entry existed for that subnet?
This question bugged me and finally produced this very post you read! :)
You may find the sequel of this story in the second part
p.s. Thanks to both kargig and cargious who assisted this incident investigation.