Why kernel is dropping frames?

Kernel is dropping packets

At work, when we did setup prometheus alerting we started getting nofitications like this:

description = ok10-01.okeanos.grnet.gr:9100 is dropping frames at a rate of 0.10340541666666667 frames per sec

To output this alert, prometheus-node-exporter reads /proc/net/softnet_stat filepath, which contains a line for every CPU. The 2nd column of each line counts dropped frames in hexadecimal.

A nicer output can be provided if we query prometheus directly from within the host:

➜ ok10-01.okeanos ~  # curl -s http://127.0.0.1:9100/metrics | ag softnet | ag dropped
node_softnet_stats{cpu="cpu0",statistic="dropped"} 10
node_softnet_stats{cpu="cpu1",statistic="dropped"} 3
node_softnet_stats{cpu="cpu10",statistic="dropped"} 3
node_softnet_stats{cpu="cpu11",statistic="dropped"} 3
node_softnet_stats{cpu="cpu12",statistic="dropped"} 0

So what is this alert saying?

That packets have been dropped before the kernel actually pushes them further up the stack. How importan is this?

Well, the IP protocol is already unreliable, upper layer protocols should be able to save the day, eg TCP retransmission. But if by modifying the number of allowed packets in the “queue” we may avoid retransmissions and minimize latency, this might be some gain.

netdev_max_backlog

Recents commits in our puppet codebase increased net.core.netdev_max_backlog to 4000 for all ~okeanos hardware nodes, as seen in puppet:///modules/okeanos/files/ganeti/netdev_max_backlog.conf

Copying from kernel documentation (it’s pretty descriptive):

netdev_max_backlog

Maximum number  of  packets,  queued  on  the  INPUT  side, when the interface
receives packets faster than kernel can process them.

The netdev_max_backlog is a queue within the Linux kernel where traffic is
stored after reception from the NIC, but before processing by the protocol
stacks (IP, TCP, etc). There is one backlog queue per CPU core. A given
core's queue can grow automatically, containing a number of packets up to
the maximum specified by the netdev_max_backlog setting. The
netif_receive_skb() kernel function will find the corresponding CPU for a
packet, and enqueue packets in that CPU's queue. If the queue for that
processor is full and already at maximum size, packets will be dropped.

Reading (once more!) the nice post from packagecloud.io about receiving data in Linux, there is this paragraph:

enqueue_to_backlog is not called in many places. It is called for RPS-enabled packet
processing and also from netif_rx. Most drivers should not be using netif_rx and
should instead be using netif_receive_skb. If you are not using RPS and your driver
is not using netif_rx, increasing the backlog won’t produce any noticeable effect
on your system as it is not used.

meaning that there exist driver implementations that will not take netdev_max_backlog in consideration.

The fact that we kept receiving ‘dropped frames’ alerts during ganeti instance migrations troubled me a bit. Wasn’t the netdev_max_backlog value high enough? There was a possibility that the NIC’s driver used in these particular hardware nodes was ignoring the settings, meaning our commits were useless.

NICs and drivers

The NICs used on our ~okeanos production hardware nodes are Broadcom 1G:

➜ ok10-01.okeanos ~  # lspci| ag etherne
04:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
04:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
05:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
05:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)

and the driver is:

➜ ok10-01.okeanos ~  # ethtool -i enp4s0f0 driver: bnx2
version: 2.2.5
firmware-version: bc 5.2.3 NCSI 2.0.12
bus-info: 0000:04:00.0

My initial thought, based on packagecloud blog post and the igbe driver, was that it would be straightforward (hoho) to understand bnx2 driver: if the driver implements RPS (Receive Packet Steering) and calls the netif_receive_skb function or calls netif_rx, then it would take netdev_max_backlog value in consideration.

RPS

First, Broadcom does indeed provide multiple queues per interface, eg:

➜ ok10-01.okeanos ~  # ls /sys/class/net/enp4s0f0/queues/    rx-0  rx-1  rx-2  rx-3  rx-4  rx-5  rx-6  rx-7  tx-0  tx-1  tx-2  tx-3  tx-4  tx-5  tx-6  tx-7

But is RPS actually enabled?

Even when compiled in, RPS remains disabled until
explicitly configured. The list of CPUs to which RPS may forward traffic
can be configured for each receive queue using a sysfs file entry:

/sys/class/net/<dev>/queues/rx-<n>/rps_cpus

This file implements a bitmap of CPUs. RPS is disabled when it is zero
(the default), in which case packets are processed on the interrupting
CPU. Documentation/IRQ-affinity.txt explains how CPUs are assigned to
the bitmap.

It seems to be disabled:

➜ ok10-01.okeanos ~  # cat /sys/class/net/enp4s0f0/queues/rx-*/rps_cpus
00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000

IRQs & affinity ===

So packets are actually processed by the CPU that got the interrupt:

➜ ok10-01.okeanos ~  # cat /proc/interrupts | grep enp4 | sed -n 's/ \+/ /gp'
 92: 3278739509 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge enp4s0f0-0
 93: 674359015 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge enp4s0f0-1
 94: 444436070 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge enp4s0f0-2
 95: 266153880 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge enp4s0f0-3
 96: 243480907 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge enp4s0f0-4
 97: 285323199 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge enp4s0f0-5
 98: 969765600 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge enp4s0f0-6
 99: 481861371 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge enp4s0f0-7
 101: 205373197 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge enp4s0f1-0
 102: 304832063 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge enp4s0f1-1
 103: 971281081 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge enp4s0f1-2
 104: 180727593 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge enp4s0f1-3
 105: 163196334 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge enp4s0f1-4
 106: 180986337 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge enp4s0f1-5
 107: 665049855 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge enp4s0f1-6
 108: 1232301015 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge enp4s0f1-7

the same stands for the other interfaces.

Since we have not modified IRQ affinity all the interrupts will end up on CPU0… Or have we modified it? To be precise the affinity setting permitted certain IRQs to end up on CPU0 and other CPUs too:

➜ ok10-01.okeanos ~  # cat /proc/irq/9*/smp_affinity_list
0,2,4,6,8,10
0,2,4,6,8,10
0,2,4,6,8,10
0,2,4,6,8,10
0,2,4,6,8,10
0,2,4,6,8,10
0,2,4,6,8,10
0,2,4,6,8,10
0,2,4,6,8,10
0,2,4,6,8,10
0,2,4,6,8,10

but the result is pretty dump. Every interrupt ends up on CPU0 in contrast with what the documentation states [3]. There are some references in the internet saying that different drivers may handle differently the same list of allowed IRQs.

IRQs in linux 4.9

Interesting sidenote: does the default IRQ affinity changes as we transition from linux 3.16 to 4.9?

We have observed/confirmed that for Intel x540 10G NIC with linux 4.9 affinity distributes the IRQs across all CPU cores. On the contrary, linux 4.9 does not change the default affinity for the NetXen 10G NICs (that our rados nodes at YPEPTH datacenter feature). As a matter of fact, this different IRQ affinity for x540 and linux 4.9 might be the reason we observed a considerable performance gain at our seconda ceph rados jewel cluster at KNOSSOS datacenter.

Dive into bnx2.c

But let’s get back to RPS. How does RPS correlates with netdev_max_backlog? Why do we need RPS enabled? Intuitively I would guess that the value of netdev_max_backlog would adapt on whether RPS is enabled or not, meaning whethere incoming packets processing is distributed across CPUs or not.

I decided to dive into kernel’s source code so as to get a better understanding of the relation between RSP and netdev_max_backlog. Namely, the starting point should be the bnx2 driver and then find out which core functions are called and how. That’s because similar functionalities are implemented differently between drivers, say between bnx2 and ixgbe.

Within linux’s source the drivers for ethernet NIC may be found under drivers/net/ethernet and our driver’s code reside in bnx2.c [2] .

So how do we “read” an ethernet driver’s source code? Without getting into many details about every specific function, what’s important is to find the path from the driver till the core networking kernel functions. To the point where a driver would call the netif_receive_skb() for example.

Taking a look at the net_device_ops struct, which points to all the basic functions that the kernel expects from the driver:

  static const struct net_device_ops bnx2_netdev_ops = {
      .ndo_open       = bnx2_open,

The bnx2_open function is a good candidate to begin our exploration. The functions call trace goes like this:

bnx2_open -> bnx2_init_napi -> bnx2_poll_msix -> bnx2_poll_work -> bnx2_rx_int

bnx2_rx_int is the on handling the RX interrupts and ends up calling napi_gro_receive() [4].

gro-what?

Generic Receive Offload

➜ ok10-01.okeanos ~  # ethtool -k enp4s0f0 | grep receive-offload generic-receive-offload: on
large-receive-offload: off [fixed]

linux net/core

napi_gro_receive is defined in net/core/dev.c [5] and, at last, we are within the linux core code.

Here the functions call trace goes like this:

napi_gro_receive -> napi_skb_finish -> netif_receive_skb_internal

We’re particularly interested in netif_receive_skb_internal [6] , since under circumstances, it may call enqueue_to_backlog [7].

static int netif_receive_skb_internal(struct sk_buff *skb)
#ifdef CONFIG_RPS
    if (static_key_false(&rps_needed)) {
    ...
        ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
    ...

static int enqueue_to_backlog(struct sk_buff *skb, int cpu,
    ...
    if (qlen <= netdev_max_backlog && !skb_flow_limit(skb, qlen)) {
    ...
drop:
    sd->dropped++;
    ...

These circumstances are important.

First, our kernel is indeed compiled with the CONFIG_RPS option, but we already know this is not enough. We now come across the ‘rps_needed’ static key (a really cool/weird thingy in the linux kernel). This key is false by default and will becom true if the rps map is changed.

We done ?

So here we are again. Since RPS is not enabled, neither the netif_rx function is called, we could conclude that enqueue_to_backlog is never called too. Meaning, netdev_max_backlog is not taken into consideration!

I looked at the source code over and over to be sure I didn’t make any mistake. I watched again the /proc/net/softnet_data of a hardware node while it was dropping packets. At this point I noticed something quite strange, that I had paid attention to before:

➜ ok10-01.okeanos ~  # head -n3 /proc/net/softnet_stat   c901dbed 00000013 0001124d 00000000 00000000 00000000 00000000 00000000 00000000 0000000 00000000
012b3b76 00000003 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
29c1ec13 00000011 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000

So apart from CPU0, other CPUs are dropping packets. But to drop a packet you got to receive it first! But… since all the IRQs are pinned to CPU0 and since RPS is disabled how come the other CPUs are receiving and processing or dropping packets??? :o . At this point I got stucked.

(What is more the received_rps counters in softnet_stat are all zeros)

Columbus egg

Ok let’s cut the crap and use some Brendan Gregg’s tools :P

I decided to resort to dynamic tracing in order to identify who and when calls the enqueue_to_backlog function. Unfortunately the ~okeanos production machines were running linux 3.16 and not 4.9, so I could not leverage the astonishing powers of BPF Compiler Collection (more about bpf in another huge post).

 ➜ ok0-00.dev ~  # functrace -H -d 1 'enqueue_to_backlog' > out Tracing "enqueue_to_backlog" for 1 seconds...
 # tracer: function
 #
 # entries-in-buffer/entries-written: 6/6   #P:24
 #
 #                              _-----=> irqs-off
 #                             / _----=> need-resched
 #                            | / _---=> hardirq/softirq
 #                            || / _--=> preempt-depth
 #                            ||| /     delay
 #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
 #              | |       |   ||||       |         |
  qemu-system-x86-28032 [004] .... 5798213.326794: enqueue_to_backlog <-netif_rx_internal
  qemu-system-x86-30569 [005] .... 5798213.447297: enqueue_to_backlog <-netif_rx_internal
  qemu-system-x86-28032 [004] .... 5798213.475525: enqueue_to_backlog <-netif_rx_internal
  qemu-system-x86-28032 [004] .... 5798213.498524: enqueue_to_backlog <-netif_rx_internal
  qemu-system-x86-28032 [006] .... 5798213.927472: enqueue_to_backlog <-netif_rx_internal
  qemu-system-x86-28032 [006] .... 5798213.967295: enqueue_to_backlog <-netif_rx_internal

 Ending tracing...

BOOM!

qemu processes were also calling enqueue_to_backlog! Being entirely focused on the incoming traffic flow I spent most of my time examining the internals of the bnx2 driver, to understand what’s going with packets arriving to the host’s NIC from the network.

But at the same time I almost missed entirely the fact that other processes may call the same kernel function via some system call. This is particularly important/evident when running virtual machines (that’s what ~okeanos service is offering) so you have qemu processes running and producing network traffic.

Conclusion?

So does this conclude our journey? Can we now be assured that increasing the netdev_max_backlog value would mitigate the dropping frames problem? I actually experimented with some hardware nodes that were dropping frames during ganeti instance migration. I kept bumping up the value up to some dozens of thousands. To my surprised the problem persisted! So there is something more to be discovered here. Perhaps in a following post… ;)