Summary
The default queuing discipline used on Linux network interfaces deprioritizes ECN enabled flows because it uses a deprecated definition of the IP TOS byte.
The problem
By default Linux attaches a pfifo_fast queuing discipline (QDisc) to each network interface. The pfifo_fast QDisc has three internal classes (also known as bands) numbered zero to two which are serviced in priority order. That is, any packets in class zero are sent before servicing class one, any packets in class one are sent before servicing class two. Packets are selected for each class based on the TOS value in the IP header.
The TOS byte in the IP header has an interesting history having been redefined several times. Pfifo_fast is based on the RFC 1349 definition.
0 1 2 3 4 5 6 7 +-----+-----+-----+-----+-----+-----+-----+-----+ | PRECEDENCE | TOS | MBZ | RFC 1349 (July 1992) +-----+-----+-----+-----+-----+-----+-----+-----+
Note that in the above definition there is a TOS field within the TOS byte.
Each bit in the TOS field indicates a particular QoS parameter to optimize for.
Value | Meaning |
1000 | Minimize delay (md) |
0100 | Maximize throughput (mt) |
0010 | Maximize reliability (mr) |
0001 | Minimize monetary cost (mmc) |
Pfifo_fast uses the TOS bits to map packets into the priority classes using the following table. The general idea is to map high priority packets into class 0, normal traffic into class 1, and low priority traffic into class 2.
IP TOS field value | Class |
0000 | 1 |
0001 | 2 |
0010 | 1 |
0011 | 1 |
0100 | 2 |
0101 | 2 |
0110 | 2 |
0111 | 2 |
1000 | 0 |
1001 | 0 |
1010 | 0 |
1011 | 0 |
1100 | 1 |
1101 | 1 |
1110 | 1 |
1111 | 1 |
This approach looks reasonable except that RFC 1349 has been deprecated by RFC 2474 which changes the definition of the TOS byte.
0 1 2 3 4 5 6 7 +-----+-----+-----+-----+-----+-----+-----+-----+ | DSCP | CU | RFC 2474 (October 1998) and +-----+-----+-----+-----+-----+-----+-----+-----+ RFC 2780 (March 2000)
In this more recent definition, the first six bits of the TOS byte are used for the Diffserv codepoint (DSCP) and the last two bits are reserved for use by explicit congestion notification (ECN). ECN allows routers along a packet’s path to signal that they are nearing congestion. This information allows the sender to slow the transmit rate without requiring a lost packet as a congestion signal. The meanings of the ECN codepoints are outlined below.
6 7 +-----+-----+ | 0 0 | Non-ECN capable transport +-----+-----+ 6 7 +-----+-----+ | 1 0 | ECN capable transport - ECT(1) +-----+-----+ 6 7 +-----+-----+ | 0 1 | ECN capable transport - ECT(0) +-----+-----+ 6 7 +-----+-----+ | 1 1 | Congestion encountered +-----+-----+
[Yes, the middle two codepoints have the same meaning. See RFC 3168 for more information.]
When ECN is enabled, Linux sets the ECN codepoint to ECT(1) or 10 which indicates to routers on the path that ECN is supported.
Since most applications do not modify the TOS/DSCP value, the default of zero is by far the most commonly used. A zero value for the DSCP field combined with ECT(1) results in the IP TOS byte being set to 00000010.
Looking pfifo_fast’s TOS field to class mapping table (above), we can see that that a TOS field value of 00000010 results in ECN enabled packets being placed into the lowest priority (2) class. However, packets which do not use ECN, those with TOS byte 00000000, are placed into the normal priority class (1). The result is that ECN enabled packets with the default DSCP value are unduly deprioritized relative to non-ECN enabled packets.
The rest of the mappings in the pfifo_fast table effectively ignore the MMC bit so this problem is only present when the DSCP/TOS field is set to the default value (zero).
This problem could be fixed by either changing pfifo_fasts’ default priority to class mapping in sch_generic.c or changing the ip_tos2prio lookup table in route.c.
oh, wow. nice discovery.
Pingback: Fixed: pfifo_fast and ECN | Dan Siemon
The problem described in this article has now been fixed.