pfifo_fast and ECN

Summary

The default queuing discipline used on Linux network interfaces deprioritizes ECN enabled flows because it uses a deprecated definition of the IP TOS byte.

The problem

By default Linux attaches a pfifo_fast queuing discipline (QDisc) to each network interface. The pfifo_fast QDisc has three internal classes (also known as bands) numbered zero to two which are serviced in priority order. That is, any packets in class zero are sent before servicing class one, any packets in class one are sent before servicing class two. Packets are selected for each class based on the TOS value in the IP header.

The TOS byte in the IP header has an interesting history having been redefined several times. Pfifo_fast is based on the RFC 1349 definition.

   0     1     2     3     4     5     6     7
+-----+-----+-----+-----+-----+-----+-----+-----+
|   PRECEDENCE    |       TOS             | MBZ |  RFC 1349 (July 1992)
+-----+-----+-----+-----+-----+-----+-----+-----+

Note that in the above definition there is a TOS field within the TOS byte.

Each bit in the TOS field indicates a particular QoS parameter to optimize for.

Value Meaning
1000 Minimize delay (md)
0100 Maximize throughput (mt)
0010 Maximize reliability (mr)
0001 Minimize monetary cost (mmc)

Pfifo_fast uses the TOS bits to map packets into the priority classes using the following table. The general idea is to map high priority packets into class 0, normal traffic into class 1, and low priority traffic into class 2.

IP TOS field value Class
0000 1
0001 2
0010 1
0011 1
0100 2
0101 2
0110 2
0111 2
1000 0
1001 0
1010 0
1011 0
1100 1
1101 1
1110 1
1111 1

This approach looks reasonable except that RFC 1349 has been deprecated by RFC 2474 which changes the definition of the TOS byte.

   0     1     2     3     4     5     6     7
+-----+-----+-----+-----+-----+-----+-----+-----+
|               DSCP                |    CU     |  RFC 2474 (October 1998) and
+-----+-----+-----+-----+-----+-----+-----+-----+    RFC 2780 (March 2000)

In this more recent definition, the first six bits of the TOS byte are used for the Diffserv codepoint (DSCP) and the last two bits are reserved for use by explicit congestion notification (ECN). ECN allows routers along a packet’s path to signal that they are nearing congestion. This information allows the sender to slow the transmit rate without requiring a lost packet as a congestion signal. The meanings of the ECN codepoints are outlined below.

   6     7
+-----+-----+
|  0     0  |  Non-ECN capable transport
+-----+-----+

   6     7
+-----+-----+
|  1     0  |  ECN capable transport - ECT(1)
+-----+-----+

   6     7
+-----+-----+
|  0     1  |  ECN capable transport - ECT(0)
+-----+-----+

   6     7
+-----+-----+
|  1     1  |  Congestion encountered
+-----+-----+

[Yes, the middle two codepoints have the same meaning. See RFC 3168 for more information.]

When ECN is enabled, Linux sets the ECN codepoint to ECT(1) or 10 which indicates to routers on the path that ECN is supported.

Since most applications do not modify the TOS/DSCP value, the default of zero is by far the most commonly used. A zero value for the DSCP field combined with ECT(1) results in the IP TOS byte being set to 00000010.

Looking pfifo_fast’s TOS field to class mapping table (above), we can see that that a TOS field value of 00000010 results in ECN enabled packets being placed into the lowest priority (2) class. However, packets which do not use ECN, those with TOS byte 00000000, are placed into the normal priority class (1). The result is that ECN enabled packets with the default DSCP value are unduly deprioritized relative to non-ECN enabled packets.

The rest of the mappings in the pfifo_fast table effectively ignore the MMC bit so this problem is only present when the DSCP/TOS field is set to the default value (zero).

This problem could be fixed by either changing pfifo_fasts’ default priority to class mapping in sch_generic.c or changing the ip_tos2prio lookup table in route.c.

Network latency experiments

Recently a series of blog posts by Jim Gettys has started a lot of interesting discussions and research around the Bufferbloat problem. Bufferbloat is the term Gettys’ coined to describe huge packet buffers in network equipment which have been added through ignorance or a misguided attempt to avoid packet loss. These oversized buffers have the affect of greatly increasing latency when the network is under load.

If you’ve ever tried to use an application which requires low latency, such as VoIP or a SSH terminal at the same time as a large data transfer and experienced high latency then you have likely experienced Bufferbloat. What I find really interesting about this problem is that it is so ubiquitous that most people think this is how it is supposed to work.

I’m not going to repeat all of the details of the Bufferbloat problem here (see bufferbloat.net) but note that Bufferbloat occurs at may different places in the network. It is present within network interface device drivers, software interfaces, modems and routers.

For many the first instinct of how to respond to Bufferbloat is add traffic classification, which is often referred to simply as QoS. While this can also be a useful tool on top of the real solution it does not solve the problem. The only way to solve Bufferbloat is a combination of properly sizing the buffers and Active Queue Management (AQM).

As it turns out I’ve been mitigating the effects of Bufferbloat (to great benefit) on my home Internet connection for some time. This has been accomplished through traffic shaping, traffic classification and using sane queue lengths with Linux’s queuing disciplines. I confess to not understanding, until the recent activity, that interface queues and driver internal queues are also a big part of the latency problem. I’ve since updated my network configuration to take this into account.

In the remainder of this post I will show the effects that a few different queuing configurations have on network latency. The results will be presented using a little utility I developed called Ping-exp. The name is a bit lame but Ping-exp has made it a lot easier for me to compare the results of different network traffic configurations.

Continue reading

The end of Google Wave

On May 28th, 2009 Google announced Wave to much fanfare. Wave was going to change the world by merging blogs, wikis and IM and finally replacing email as the digital world’s main collaborative tool.

On August 4th, 2010, 14 months later, Google announced they have stopped working on Wave due to a lack of users.

Despite my interest in the underlying technology, I never used Wave for anything productive. The main reason for this is that I didn’t know anyone else using it regularly. Of course this is the catch-22 faced by every new communications technology and it is at the heart of Wave’s failure. The underlying problem is that Wave didn’t add enough value for users working on a document alone. Wave’s revolutionary power would have eventually come from its collaborative features but the problem with focusing on these aspects early is that they do not add any value until there are users to collaborate with. In episode #68 of This Week in Startups, Marco Zappacosta defines the amount of value a service brings to users without network effects as network independent value (NIV). Wave had next to zero NIV.

It didn’t have to be this way. There are several aspects of Wave that could have been focused on to create a superior or at least unique single user experience. The most obvious would be to have made basic document editing and management a better experience. Or they could have focused on use cases not supported by other document systems. The Wave API allows for third party applications (bots) to contribute to documents at the same time as users. This could have been used to build bots which automate tasks that are time consuming or annoying in Word or Google Docs. On the collaborative side, Google could have concentrated on a specific, practical use case such as a shared white board (there are several bots which do white-boarding) or built interesting applications such as Gravity by SAP.

Wave’s failure saddens me because Google was really doing this right. They were publishing the protocol specifications and most of the source code. Even more importantly they architected Wave based on a distributed/federated model which allowed for Wave servers to exist within every organization just like email servers do. This is a much harder problem than a centralized, all Google architecture but it is critical that single organizations do not control (what could have become) a core Internet protocol.

One also has to wonder where the revolutionary innovation that is required to replace email will come from if even Google gives up after only 14 months. Wave represents a large change that requires time to diffuse and for infrastructure to be built up around it. It is completely unreasonable to expect that Wave would have had large success in just 14 months. One has to feel for the Wave team whom it seems were given the impossible mission of changing the world in 14 months.

I believe that years from now we’ll look back on Google Wave and realize that it was closer to the solution than we thought. One of the key features that makes me believe this is the bot API. The idea of allowing third party applications equal access to a live document is very powerful and could spawn huge amount of innovation. For example, there is no competition to Microsoft’s grammar checker in Word or Google’s spell checker in Google docs. There cannot be as these are functions which are part of the application. Now imagine a world where a document system like Wave is the norm. Any user could select which spell checker to use just by adding a different bot to the Wave. I believe this flexibility would spawn competition that would drive a great deal of innovation. This speaks to the power of decentralized systems such as Wave.

There may also be an opportunity for a startup or open source project to take the Wave ecosystem and run with it. A lot of the hard work has already been done.

Blackberry Torch

I really hope that RIM has a successful device in the Torch and Blackberry OS 6. It would be such a sad story for technology in Canada if RIM continues to ride the slow train to irrelevance.

That said, what is the deal with having a touch screen, track pad and a keyboard?

This screams weak, unprincipled design. Take a stand! Lead instead of trying to mash together the best of every device on the market into some Franken-input system.

Next is Now

Just stumbled on this video created by Rogers.

One of a few good quotes:

“10 years ago it took 72 hours to download Godfather… – Today it takes 10 minutes – It still takes 3 hours to watch”

ChangeCamp London

Thanks to the organizers and participants involved with ChangeCamp London yesterday. It was amazing to see such a strong turnout of people interested in making London better. I hope everyone got as much out of it as I did.

For anyone who couldn’t attend, you can get a feel for the event by looking at the #ccldn tweets and following up with the actions which will be posted on the website.