As network speeds continue to increase from 10Gb/s, to 40Gb/s, and very soon to 100Gb/s the rate at which packets can be transported increases putting pressure on the kernel to drive ever faster packet rates. This article will cover what we currently know about the limits of the Linux kernel in terms of small packet networking performance, ongoing efforts to push those limits further, and a number of tips and tricks to get the most out of the kernel networking stack.
One of the first thing that need to be realized when dealing with Linux kernel networking data-path is that the kernel has to support a multitude of functions. Everything from ARP to VXLAN in terms of protocols, and it needs to be done securely. As a result, it ends up needing a significant amount of time to process each packet. However, with the current speed of network devices, it isn’t normally given that much time. On a 10 Gbps link it is possible to pass packets at a rate of 14.88 Mpps. At that rate that we have just 67.2ns per packet. Also you have to keep in mind that an L3 cache hit, yes that is a “hit” and not a miss, costs us something on the order of 12 nanoseconds. When all of this comes together it means that we cannot process anything near line rate, at least with a single CPU, so instead we need to look at being able to scale up per CPU in order to handle packets at these rates.
One of the biggest ones that impact the scalability of systems for the x86 architecture is NUMA (Non-Uniform Memory Access). Basically what NUMA represents is that there are different costs for accessing memory for certain regions of the system. In a two socket Xeon E5 system both the PCIe devices and the memory will belong to one of the two NUMA nodes in the system, each node essentially represents a separate CPU socket. Most network devices have to make use of interrupts in order to handle packet reception and to trigger Tx clean-up. There are a number of things about the interrupts that can impact network performance. For example, if the traffic is mostly for an application running in a certain socket it might be beneficial to have the packet processing occur on that socket rather than forcing the application to work with data that was just processed on a different one. As a result applications like irq balance can sometimes cause issues and must be taken into account when testing for performance.