eBPF Longest Prefix Match Maps
Coming soon to a 4.11 kernel near you, eBPF maps that can do longest prefix matches for things like IP routing.
Awesome.
XDP
There is a lot happening on the XDP front in the Linux kernel these days. This presentation provides a good overview.
I love the idea that eBPF is becoming a policy language for the kernel.
Linux 4.10
As always, KernelNewbies is the place to go for a great summary of new kernel features. Here’s some 4.10 highlights I’m interested in.
BPF for lightweight tunnel encapsulation commit, https://git.kernel.org/torvalds/c/f74599f7c5309b21151233b98139e9b723fd1110
TCP: sender chronographs instrumentation. This feature exports the sender chronograph stats via the socket SO_TIMESTAMPING channel. Currently it can instrument how long a particular application unit of data was queued in TCP by tracking SOF_TIMESTAMPING_TX_SOFTWARE and SOF_TIMESTAMPING_TX_SCHED. Having these sender chronograph stats exported simultaneously along with these timestamps allow further breaking down the various sender limitation. For example, a video server can tell if a particular chunk of video on a connection takes a long time to deliver because TCP was experiencing small receive window commit, commit, commit, commit, commit, commit
sched/act_mirred: Implement the corresponding ingress actions TCA_INGRESS_REDIR and TCA_INGRESS_MIRROR (Up until now, action mirred supported only egress actions (either TCA_EGRESS_REDIR or TCA_EGRESS_MIRROR). This allows attaching filters whose target is to hand matching skbs into the rx processing of a specified device commit, commit, commit, commit
Add the LRU versions of the existing BPF_MAP_TYPE_HASH and BPF_MAP_TYPE_PERCPU_HASH maps: BPF_MAP_TYPE_LRU_HASH and BPF_MAP_TYPE_LRU_PERCPU_HASH sample, commit, commit, commit, commit, commit
Since the dawn of time, the way Linux synchronizes to disk the data written to memory by processes (aka. background writeback) has sucked. When Linux writes all that data in the background, it should have little impact on foreground activity. That’s the definition of background activity…But for a long as it can be remembered, heavy buffered writers have not behaved like that. For instance, if you do something like $ dd if=/dev/zero of=foo bs=1M count=10k, or try to copy files to USB storage, and then try and start a browser or any other large app, it basically won’t start before the buffered writeback is done, and your desktop, or command shell, feels unreponsive. These problems happen because heavy writes -the kind of write activity caused by the background writeback- fill up the block layer, and other IO requests have to wait a lot to be attended (for more details, see the LWN article).
This release adds a mechanism that throttles back buffered writeback, which makes more difficult for heavy writers to monopolize the IO requests queue, and thus provides a smoother experience in Linux desktops and shells than what people was used to. The algorithm for when to throttle can monitor the latencies of requests, and shrinks or grows the request queue depth accordingly, which means that it’s auto-tunable, and generally, a user would not have to touch the settings. This feature needs to be enabled explicitly in the configuration (and, as it should be expected, there can be regressions)
Recommended LWN article: Toward less-annoying background writeback
Cloud Spanner
Not surprised Google made this a service. It will be difficult for others to duplicate.
Revisiting Network I/O APIs: The netmap Framework – ACM Queue
http://queue.acm.org/detail.cfm?id=2103536
Neat. My Pnet (2006) looks a lot like this.
http://git.coverfire.com/?p=pnet.git;a=summary
eBPF Map Size Limits
Recently, I’ve done some work with eBPF and specifically the in-kernel maps that are manipulated and shared by both kernel and user space code.
When doing this I ran into permission errors when installing large maps. It took a little while to figure out that the cause of this was the root user’s locked memory limit being too low (thanks Daniel Borkmann).
The locked memory limit is modified with ulimit:
ulimit -l unlimited
How is Writing Lord of the Rings Like Writing Software? http://highscalability.com/blog/2017/1/4/how-is-writing-lord-of-the-rings-like-writing-software.html