Category Archives: General

Making the Linux flow classifier tunnel aware

Flow Classifier

The Linux kernel has many different tools for managing traffic. One of them is the flow classifier which allows the user to configure which fields of the packet headers should be used to create a hash which is then used to identify flows and manage them. For example, if the user selects src,dst,proto,proto-src,proto-dst they get a unique value for each flow (within the limits of the hash). Alternatively, using only src as the key will result in all flows being grouped by the source IP address.

The Problem

Below is a slightly simplified version of my home network.

Figure 1: Simplified home network

All of the traffic, both IPv4 and IPv6, is tunnelled through a Linux router which lives at my service provider. The reason for this complicated setup is that it gives me control of the traffic in both the upstream and downstream. By shaping the traffic to just below the maximum rate in each direction I am able to avoid Bufferbloat problems and prioritize latency sensitive traffic such as SSH, DNS and Vonage. Especially under load, my QoS scripts make marked difference in how fast the Internet feels.

The multiple tunnels present a problem for implementing my QoS scheme because from the perspective of the underlying interface there are only two flows on the network. One for the IP-IP tunnel and one for the IP-IPv6 tunnel. A work around I used for a while was to apply the QoS rules to the IP-IP tunnel interface because that’s where the bulk of the traffic flows. However, this meant that IPv6 traffic was not properly controlled and any time I had a significant amount of IPv6 traffic I lost all the advantages of my QoS scheme.

To solve this properly I needed a way to look into the tunnels in order to identify the inner network flows. So I’ve extended the flow classifier with the keys in the following table. IP-IP, IP-IPv6, IPv6-IP and IPv6-IPv6 tunnels are supported.

Key Description
tunnel-src Extract the source IP from the inner header
tunnel-dst Extract the destination IP from the inner header
tunnel-proto Extract the protocol from the inner header
tunnel-proto-src Extract the transport protocol source port from the inner header
tunnel-proto-dst Extract the transport protocol destination port from the inner header

 Results

In order to validate that this works I started a couple SCP uploads to max the upstream bandwidth and then ran ping-exp to measure the latency. At the start of the test the flow classifier keys were src,dst,proto,proto-src,proto-dst. Approximately half way through I changed the keys to src,dst,proto,proto-src,proto-dst,tunnel-src,tunnel-dst,tunnel-proto,tunnel-proto-src,tunnel-proto-dst. The advantage of keeping the non-tunnel keys is that any traffic created by the router itself is still classified properly. Here is the tc script I used. You can see the results of this test in the figure 2 below.

Figure 2: Before and after tunnel keys

For the first half of the test you can see the high latency. This is due to all the traffic from the SCP upload and ICMP pings being placed into the same queue because from the perspective of the flow classifier there is only one flow. In the second half of the test the addition of the tunnel keys allows the flow classifier to place the ICMP packets into a different queue which is not affected by the SCP upload and therefore has much lower latency. The large amount of packet loss during the key change is because the script I used creates a large number of queues. While these queues are being created packets are dropped.

While my network setup may be a bit unique I think it’s likely that many home networks will have some form of tunnelling in the near future as tunnels are part of several IPv6 migration strategies. So hopefully this little addition will be useful in many different contexts.

Below are links to the two patches that are required. I’ll post them to Netdev for review shortly.

clsflow-tunnel-20111016.patch

iproute-clsflow-tunnel-20111016.patch

Linux flow classifier proto-dst and TOS

Recently I’ve been playing around with the Linux flow classifier on my gateway. The flow classifier provides the ability to group network flows by configuring which parts of the packet headers (referred to as keys) are used in a hash calculation which chooses the output queue.

All of my Internet traffic travels over an IPIP tunnel to another Linux box. I do this so I have control of the QoS in both the upstream and the downstream. A result of this configuration is that from the perspective of the output interface there is only a single network flow.

I configured the flow classifier to use the src,dst,proto,proto-src,proto-dst keys which aims to provide 5-tuple flow fairness. Here’s the simple tc script I used. Due to the IPIP tunnel I expected to see that all traffic would be placed into the same queue. Strangely, the below is what my little ping-exp utility showed when running at the same time as an SCP upload.

Figure 1: Unexpected flow classifier behaviour

Coincidentally I ran ping-exp configured to send three different streams of ICMP traffic with different IP TOS values. Note that SCP automatically sets the IP TOS to the equivalent of the “Low” stream in the test.

Notice that the pings using the high and default TOS values appear to be unaffected by low priority ping and SCP traffic. This was unexpected because none of src,dst,proto,proto-src or proto-dst keys should be affected by the TOS value.

After a bit of experimentation I determined that the proto-dst key was the source of the problem. If you spend a bit of time with the flow_get_proto_dst() function in cls_flow.c you’ll see that if the protocol is ICMP or IPIP, as it is in my test, then the following value is returned:

return addr_fold(skb_dst(skb)) ^ (__force u16)skb->protocol;

skb_dst() returns a pointer to a dst_entry structure. Since Linux maintains separate dst_entry structures for each destination,TOS pair the source of the unexpected behaviour is obvious.

I’m not knowledgeable enough about the Linux network stack to be certain but I don’t see any value in returning a value for proto-dst which is random with respect to the actual traffic on the wire. At the very least this is not intuitive behaviour.

If you look at flow_get_proto_src() you’ll see something similar:

return addr_fold(skb->sk);

In this case a pointer to the local socket structure is used as a fallback. Again, this has no relation to the actual packets on the wire and if the packet does not originate at the local machine then no socket exists which causes this value to be zero anyway.

It seems to me that the most intuitive behaviour would be to have the proto-src and proto-dst keys return zero when they are applied to traffic that doesn’t have the notion of transport layer ports.

I’ll post to Netdev about this and see what the kernel devs have to say.

Related to this, I have a patch to the flow classifier that adds tunnel awareness which I plan post to Netdev this weekend as well.

People will give write access to their Twitter account for that?

A couple weeks ago I attended StartupCamp Waterloo. The Willet team was there to do a demo of Willet Social which is a new service they’ve developed which allows website owners to see who’s Tweeting about their site. This service is implemented as a button that the website owner adds to their site. When the user visiting the site clicks on the button they can quickly Tweet a link to the site to their followers. More recently this new service has gotten some attention on Hacker News.

In order for this to work, the user visiting the site has to give the Willet Social service write access to their Twitter account. This is done via OAuth so it’s not like Willet has complete access to the user’s account but the reality is that very few people understand this detail. What I find interesting about this is that people are actually willing to give a random service they have never used before access to their Twitter account in order to Tweet a link.

Social media isn’t really my area of expertise but had I come up with the idea for Willet Social I probably would have dismissed it outright because I, probably too optimistically, think it’s doubtful and scary that any significant number of people would give this access.

pfifo_fast and ECN

Summary

The default queuing discipline used on Linux network interfaces deprioritizes ECN enabled flows because it uses a deprecated definition of the IP TOS byte.

The problem

By default Linux attaches a pfifo_fast queuing discipline (QDisc) to each network interface. The pfifo_fast QDisc has three internal classes (also known as bands) numbered zero to two which are serviced in priority order. That is, any packets in class zero are sent before servicing class one, any packets in class one are sent before servicing class two. Packets are selected for each class based on the TOS value in the IP header.

The TOS byte in the IP header has an interesting history having been redefined several times. Pfifo_fast is based on the RFC 1349 definition.

   0     1     2     3     4     5     6     7
+-----+-----+-----+-----+-----+-----+-----+-----+
|   PRECEDENCE    |       TOS             | MBZ |  RFC 1349 (July 1992)
+-----+-----+-----+-----+-----+-----+-----+-----+

Note that in the above definition there is a TOS field within the TOS byte.

Each bit in the TOS field indicates a particular QoS parameter to optimize for.

Value Meaning
1000 Minimize delay (md)
0100 Maximize throughput (mt)
0010 Maximize reliability (mr)
0001 Minimize monetary cost (mmc)

Pfifo_fast uses the TOS bits to map packets into the priority classes using the following table. The general idea is to map high priority packets into class 0, normal traffic into class 1, and low priority traffic into class 2.

IP TOS field value Class
0000 1
0001 2
0010 1
0011 1
0100 2
0101 2
0110 2
0111 2
1000 0
1001 0
1010 0
1011 0
1100 1
1101 1
1110 1
1111 1

This approach looks reasonable except that RFC 1349 has been deprecated by RFC 2474 which changes the definition of the TOS byte.

   0     1     2     3     4     5     6     7
+-----+-----+-----+-----+-----+-----+-----+-----+
|               DSCP                |    CU     |  RFC 2474 (October 1998) and
+-----+-----+-----+-----+-----+-----+-----+-----+    RFC 2780 (March 2000)

In this more recent definition, the first six bits of the TOS byte are used for the Diffserv codepoint (DSCP) and the last two bits are reserved for use by explicit congestion notification (ECN). ECN allows routers along a packet’s path to signal that they are nearing congestion. This information allows the sender to slow the transmit rate without requiring a lost packet as a congestion signal. The meanings of the ECN codepoints are outlined below.

   6     7
+-----+-----+
|  0     0  |  Non-ECN capable transport
+-----+-----+

   6     7
+-----+-----+
|  1     0  |  ECN capable transport - ECT(1)
+-----+-----+

   6     7
+-----+-----+
|  0     1  |  ECN capable transport - ECT(0)
+-----+-----+

   6     7
+-----+-----+
|  1     1  |  Congestion encountered
+-----+-----+

[Yes, the middle two codepoints have the same meaning. See RFC 3168 for more information.]

When ECN is enabled, Linux sets the ECN codepoint to ECT(1) or 10 which indicates to routers on the path that ECN is supported.

Since most applications do not modify the TOS/DSCP value, the default of zero is by far the most commonly used. A zero value for the DSCP field combined with ECT(1) results in the IP TOS byte being set to 00000010.

Looking pfifo_fast’s TOS field to class mapping table (above), we can see that that a TOS field value of 00000010 results in ECN enabled packets being placed into the lowest priority (2) class. However, packets which do not use ECN, those with TOS byte 00000000, are placed into the normal priority class (1). The result is that ECN enabled packets with the default DSCP value are unduly deprioritized relative to non-ECN enabled packets.

The rest of the mappings in the pfifo_fast table effectively ignore the MMC bit so this problem is only present when the DSCP/TOS field is set to the default value (zero).

This problem could be fixed by either changing pfifo_fasts’ default priority to class mapping in sch_generic.c or changing the ip_tos2prio lookup table in route.c.

Network latency experiments

Recently a series of blog posts by Jim Gettys has started a lot of interesting discussions and research around the Bufferbloat problem. Bufferbloat is the term Gettys’ coined to describe huge packet buffers in network equipment which have been added through ignorance or a misguided attempt to avoid packet loss. These oversized buffers have the affect of greatly increasing latency when the network is under load.

If you’ve ever tried to use an application which requires low latency, such as VoIP or a SSH terminal at the same time as a large data transfer and experienced high latency then you have likely experienced Bufferbloat. What I find really interesting about this problem is that it is so ubiquitous that most people think this is how it is supposed to work.

I’m not going to repeat all of the details of the Bufferbloat problem here (see bufferbloat.net) but note that Bufferbloat occurs at may different places in the network. It is present within network interface device drivers, software interfaces, modems and routers.

For many the first instinct of how to respond to Bufferbloat is add traffic classification, which is often referred to simply as QoS. While this can also be a useful tool on top of the real solution it does not solve the problem. The only way to solve Bufferbloat is a combination of properly sizing the buffers and Active Queue Management (AQM).

As it turns out I’ve been mitigating the effects of Bufferbloat (to great benefit) on my home Internet connection for some time. This has been accomplished through traffic shaping, traffic classification and using sane queue lengths with Linux’s queuing disciplines. I confess to not understanding, until the recent activity, that interface queues and driver internal queues are also a big part of the latency problem. I’ve since updated my network configuration to take this into account.

In the remainder of this post I will show the effects that a few different queuing configurations have on network latency. The results will be presented using a little utility I developed called Ping-exp. The name is a bit lame but Ping-exp has made it a lot easier for me to compare the results of different network traffic configurations.

Continue reading