Skip to content

Queues Management

Petr Machata edited this page Feb 5, 2020 · 18 revisions
Table of Contents
  1. Qdisc
    1. Features by Version
    2. Qdiscs: Brief Introduction
    3. Creating a Qdisc
    4. Listing Qdiscs and Statistics
  2. DCB Incompatibility
  3. ETS
    1. ETS Bands
    2. Priomap
    3. Band Number Mapping
    4. Statistics
    5. TC Queue Depth
    6. Example
  4. PRIO
    1. Example
  5. Leaf Qdiscs
    1. Statistics
  6. RED
    1. RED Parameters
    2. UC and BUM Traffic Classes
    3. Statistics
    4. Example
  7. TBF
    1. TBF Parameters
    2. UC and BUM Traffic Classes
    3. Statistics
    4. Example

Qdisc

Traffic control in Linux is managed by the TC subsystem. Documentation can be found here and in the TC man page.

Features by Version

Kernel Version
4.15 RED as root qdisc (ECN supported)
4.16 PRIO qdisc as root qdisc
4.17 RED as child of PRIO
5.6 ETS qdisc as root, RED and TBF as children of ETS or PRIO

Qdiscs: Brief Introduction

Qdiscs, for "queuing disciplines", are entities that take care of queuing up and later scheduling of traffic to be transmitted by a network interface. From this point of view, a qdisc has two interesting operations: enqueue requests that a packet be queued up for later transmission; dequeue requests that one of the queued-up packets be chosen for immediate transmission. Since there is no one right way to manage packet queues, a number of qdiscs of different types exist.

Qdiscs conceptually form a tree: one qdisc is at the root of the tree, and it may have zero or more children, which in turn can have more children of their own. How many children any given qdisc permits, if any at all, depends on the qdisc in question. The points where children can be attached are called classes, and qdiscs that can have non-zero number of classes are called classful.

Classful qdiscs do not store any packets themselves. Instead, they pass enqueue and dequeue requests down to one of their children, according to criteria specific to the qdisc itself. Eventually this recursive message passing ends up at one of the leaves, where the packets are actually stored. (Or where the packets are picked up from in case of dequeuing.)

(To be fully correct: qdiscs actually form a DAG, directed acyclic graph. Some qdiscs can be attached at multiple classes. However most of the time the simple tree structure is all that is needed, and is the only one that mlxsw is capable of offloading.)

Each qdisc is identified by its handle, which is a 16-bit hexadecimal number with a colon attached, such as 1: or abcd:. That number is called qdisc major number. If a qdisc has any classes, their identifiers are formed as a pair of two numbers: <major>:<minor>, such as abcd:1. The numbering scheme for the minor numbers depends on the qdisc type. Sometimes the numbering is systematic, where the first class has the ID <major>:1, the second one <major>:2, and so on. Some qdiscs allow the user to set class minor number arbitrarily as the class is created.

Creating a Qdisc

To create a new qdisc and attach it at a given point, use "add" or "replace" command:

# tc qdisc [add | replace] dev <dev name> \
    [root | parent <parent ID>] [handle <handle ID>] \
    <qdisc type> [<qdiscs params>]

Handle ID is a 16 bit number, written as <major>:. If the user does not specify a handle by hand, a new one is picked automatically.

A qdisc can be set as a root qdisc or as a child of another qdisc. In the latter case, parent ID is the ID of the class where the qdisc should be attached.

The difference between "add" and "replace" operations is in handling of qdiscs that exist at the attachment point prior to the creation. When a new class is created, it always comes with either a pFIFO or bFIFO qdisc attached by default. "add" allows detaching this implicit qdisc and attaching a new one instead. As soon as a qdisc has been added explicitly, the "replace" command has to be used to replace that qdisc. In practice it is possible to just always use "replace".

Listing Qdiscs and Statistics

To list qdiscs at a given interface, use the "show" command:

# tc qdisc show dev swp1
qdisc ets 10: root refcnt 2 offloaded bands 8 strict 8 priomap 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
qdisc tbf 101: parent 10:1 offloaded rate 400Mbit burst 131050b lat 18.4ms

Pass the -s flag to see the statistics as well:

# tc -s qdisc show dev swp1
qdisc ets 10: root refcnt 2 offloaded bands 8 strict 8 priomap 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc tbf 101: parent 10:1 offloaded rate 400Mbit burst 131050b lat 18.4ms
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0

The meaning of the individual statistical counters is elaborated at each qdisc's section.

DCB Incompatibility

The TC subsystem allows configuration of traffic classification, of individual TCs, and traffic shaping. These aspects are configurable also through the DCB subsystem. Therefore configuring qdiscs may overwrite the corresponding DCB configuration present at the time, and vice versa--DCB configuration will overwrite any preexisting qdisc configuration.

In particular, if the egress path should be configured through qdiscs, it is important to make sure that lldpad is stopped and possibly disabled:

# systemctl disable --now lldpad.service

ETS

The ETS qdisc describes mapping of packets to traffic classes based on their priority, and scheduling of individual traffic classes relative to one another. There are correspondingly two components that are present in every ETS qdisc: bands describe the traffic classes, and priomap describes the classification function.

ETS is meant to be used as a root qdisc on front panel port interfaces. It will not be offloaded otherwise.

ETS Bands

Each ETS qdisc has a set of classes, called "bands". Each band represents one logical traffic class. Since each band is a qdisc class, a qdisc can be attached at each of them.

ETS bands are split to two groups: a (possibly empty) set of strict bands, followed by a (possibly empty) set of DWRR bands. The strict bands, if any, are always the lower-numbered ones.

The way ETS dequeues packets is that it first tries to dequeue traffic from the strict bands, if there are any. It proceeds in order of band number, first band 0, then band 1, and so on, until all strict bands are tried.

An ETS qdisc with just strict bands can be created this way:

# tc qdisc add dev swp1 root handle 1: \
     ets bands 8 strict 8
# tc qdisc show dev swp1
qdisc ets 1: root refcnt 2 offloaded bands 8 strict 8 priomap 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7

DWRR bands are tried next. A value, quantum, is assigned to each DWRR band. The value is number of bytes that the band is allowed to dequeue before it yields to the next DWRR band in row.

When creating DWRR bands, instead of mentioning the number of bands like was the case above with strict bands, one lists quanta for the individual bands.

This dequeuing algorithm is respected when the ETS qdisc is offloaded. Thus strict bands in the SW datapath represent strict TCs in the HW one, and DWRR bands DWRR TCs. For purposes of offloading, the quanta at DWRR bands are converted to percentage of available bandwidth, and the ASIC then aims to split the available bandwidths according to these percentages. At most 8 bands can be offloaded--if the qdisc has more bands, mlxsw will not be able to offload it.

The following example creates an ETS qdisc with 4 strict bands and 4 DWRR ones, where bandwidth is split 25% : 25% : 25% : 25%:

# tc qdisc add dev swp1 root handle 1: \
     ets bands 8 strict 4 quanta 2000 2000 2000 2000
# tc qdisc show dev swp1
qdisc ets 1: root refcnt 2 offloaded bands 8 strict 4 quanta 2000 2000 2000 2000 priomap 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7

Priomap

ETS supports several traffic classification algorithms, but the only one offloaded by mlxsw is called priomap. priomap is composed of a list of numbers, one for each priority. The number indicates the number of band that the packets with that priority should go to: 0 for the first band, 1 for the second, and so on:

                        p7 ----------------.
                        ..                 |
                        p2 ------.         |
                        p1 ----. |   ...   |
                        p0 --. | |         |
                             | | |         |
                             v v v         v
# tc ... ets bands 4 priomap 3 3 2 2 1 1 0 0
                             | | |   ...   |
                             | | |         '-> band 0
                             | | |              ...
                             | | '-----------> band 2
                             | '-------------> band 3
                             '---------------> band 3

For details on how priority is assigned to packets, see Quality of Service.

Note that ETS supports up to 16 priorities in a priomap. For purposes of offloading, the only relevant priorities are 0-7. Priorities 8-15 are ignored and can be omitted when configuring ETS.

Band Number Mapping

mlxsw uses bands to denote logical traffic classes. Each band is mapped in the ASIC to a pair of TCs, one for known unicast traffic, the other one for BUM traffic (for broadcast, unknown unicast, multicast). The UC TC has strict priority over the BUM TC.

The unicast TC is derived from the band number as follows: band 0 maps to TC 7, band 1 to TC 6, etc., until band 7 maps to TC 0. The TC for BUM traffic is then the number of unicast TC + 8. The TC numbers are important for checking some per-TC ethtool counters and for shared buffer binding configuration.

For purposes of attaching a child to a band, the qdisc class ID of the band is its band number + 1.

The following table summarizes the band mapping described above:

Band no. Class ID UC TC BUM TC Priority
0 X:1 7 15 Highest
1 X:2 6 14
2 X:3 5 13
3 X:4 4 12
4 X:5 3 11
5 X:6 2 10
6 X:7 1 9
7 X:8 0 8 Lowest

Statistics

The tc -s show command will list the current ETS configuration including the full priomap and statistics:

$ tc -s qdisc show dev swp1
qdisc ets 1: root refcnt 2 offloaded bands 8 strict 8 priomap 7 6 5 4 3 2 1 0 7 7 7 7 7 7 7 7
 Sent 30510403042 bytes 20289261 pkt (dropped 5199870, overlimits 0 requeues 0)
 backlog 222720b 0p requeues 0

The statistics represent the sum of the statistics of all the bands. If RED is configured on any of ETS classes, the child drops will be counted by the parent as well.

When using strict bands, packets in lower priority bands will not be sent until all the higher-priority bands are empty. In this situation, packets in the HW datapath might be dropped due to switch lifetime timeouts. These drops are not counted towards the number of dropped packets.

TC Queue Depth

Backlog values on FIFO qdiscs do not reflect HW datapath at all, and for the offloaded qdiscs the backlog value is composed of values of the two constituent TCs. To find the value for individual TCs, it is necessary to inspect the ethtool counter tc_transmit_queue_tc_<TC>. That shows number of bytes queued up at individual traffic classes:

$ ethtool -S swp1 | grep tc_transmit_queue_tc
	tc_transmit_queue_tc_0: 0     \
	tc_transmit_queue_tc_1: 0      | UC TCs
	[...]                          |
	tc_transmit_queue_tc_7: 0     /
	tc_transmit_queue_tc_8: 0     \
	tc_transmit_queue_tc_9: 0      | BUM TCs
	[...]                          |
	tc_transmit_queue_tc_15: 0    /

Example

Add an ETS qdisc with handle 10:, with 8 bands, 4 of which are strict, and the remaining 4 split traffic 40% : 30% : 20% : 10%. The quanta sum up to 10000, which makes it easy to mentally map from the per-band quantum to the corresponding percentage.

Traffic is mapped to bands in a reversed 1:1 manner to make priority-0 traffic the least prioritized and priority-7 traffic the most prioritized. That means that priority 0 goes to TC 0, 1 goes to TC 1, and so on. (Except BUM traffic, which goes to TC 8, TC 9, and so on instead.)

# tc qdisc replace dev swp1 root handle 10: \
     ets bands 8 strict 4 quanta 4000 3000 2000 1000 \
     priomap 7 6 5 4 3 2 1 0

As indicated in the table above, band 0 has the class ID X:1, band 1 X:2, and so on. In order to attach a child qdisc to a band, use that ID as parent reference when creating a new qdisc. E.g. to attach RED to the first band and TBF to the second one:

# tc qdisc replace dev swp1 parent 10:1 handle 101: \
     red limit 2M avpkt 1000 probability 0.1 min 500K max 1.5M
# tc qdisc replace dev swp1 parent 10:2 handle 102: \
     tbf rate 400Mbit burst 128K limit 1M

PRIO

The PRIO qdisc is in most ways the same as ETS qdisc configured with only strict bands. The only differences are:

  • PRIO cannot be configured with fewer than 3 bands.
  • PRIO does not permit DWRR bands, all bands are always strict. Correspondingly, there is no strict keyword when creating the qdisc, just bands.
  • PRIO has different priomap defaults.

One can learn about how PRIO works by reading the ETS section above and focusing only on the parts that deal with strict priority.

Example

Create a PRIO qdisc that configures 8 (strict) bands, and maps traffic to bands in reversed 1:1 fashion.

# tc qdisc replace dev swp1 root handle 1: \
     prio bands 8 priomap 7 6 5 4 3 2 1 0

Leaf Qdiscs

Where ETS and PRIO describe assignment of traffic to TCs and relation of individual TCs to each other, qdiscs attached to individual bands configure the TC itself.

  • A RED child will enable RED or ECN for the UC TC associated with the band.
  • A TBF child will configure shaper for the pair of BUM and UC TCs associated with the band.

Additionally, each PRIO and ETS class, unless overridden, contains a FIFO qdisc child. Those are not shown by default in the "tc qdisc show" output, but can be shown by passing an invisible flag.

Statistics

The show command will show the current configuration including statistics. For offloaded qdiscs the values include the traffic in the HW datapath. Note in particular that FIFO is currently not offloaded and therefore counters on most bands will not include HW datapath numbers unless they contain RED or TBF qdisc.

# tc -s qdisc show dev swp1
qdisc ets 10: root refcnt 2 offloaded [...]
qdisc tbf 101: parent 10:1 offloaded rate 400Mbit burst 131050b lat 18.4ms
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0

The reported counters are:

  • bytes, pkt - The number of bytes resp. packets that were sent through the qdisc.

    The Spectrum ASICs do not track the number of sent bytes and packets on per-TC basis, but rather on per-packet-priority basis. For child qdiscs, mlxsw deduces the per-TC counters from the priorities mapped to a given band by PRIO or ETS priomap. If the priomap is changed in a way that impacts bands that contain offloaded child qdiscs, these child qdiscs will lose HW stats accumulated prior to the change.

  • dropped - The number of packets dropped on either the unicast TC, or the BUM TC corresponding to the band that the qdisc is in.

  • overlimits - The meaning depends on the qdisc.

  • requeues - The meaning depends on the qdisc, and mlxsw does not currently set this counter.

  • backlog - The number of bytes and packets waiting in queue. In offloaded qdiscs, the number of bytes includes the HW datapath queue depth. However the number of packets always includes only the SW datapath, because the corresponding counters are not available on Spectrum ASICs.

RED

RED is a queuing discipline designed for congestion avoidance. It can run in one of two modes: in RED mode, it drops packets according to a simple linear probability function. In ECN mode, it uses the same probability function, but marks ECN-capable packets with ECN-CE (congestion encountered) tags instead of dropping them.

The probability to drop or mark a packet is zero until the queue's average size reaches the minimum limit. From there, the probability will rise linearly until it reaches the maximum probability at a point where the queue's average size reaches the maximum limit. When the queue's average size is above the maximum, the probability to drop a packet is 1 (See figure below).

figure 1

The drops due to the RED algorithm are called early drops. They differ from tail drops, which are caused by shared buffer quota exhaustion.

RED is meant to be configured on one of the ETS or PRIO bands, but can be set as root qdisc as well.

RED Parameters

The following parameters are offloaded:

  • min - The minimum limit.
  • max - The maximum limit.
  • probability - The probability to drop a packet when the average queue size is at maximum limit. 1.0 means 100%.
  • ecn - If set, puts the qdisc into ECN mode.

The following parameters are not offloaded, but are mandatory for the software datapath qdisc:

  • limit - Hard limit for the queue's size.
  • avpkt - Average queue size calculation parameter. 1000 is recommended.

UC and BUM Traffic Classes

As described above, each ETS (and PRIO) band represents a pair of traffic classes. Adding RED at a band configures RED only on the UC TC, not on BUM one. There is currently no way to configure RED on BUM traffic classes.

When RED is used as a root qdisc, it enables RED on TC 0.

Statistics

The show command will show the current configuration including RED's statistics.

$ tc -s qdisc show dev swp1
qdisc ets 10: root refcnt 2 offloaded [...]
qdisc red 101: parent 10:1 offloaded limit 2Mb min 500Kb max 1536Kb
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  marked 0 early 0 pdrop 0 other 0

Besides the usual counters, RED supports the following values:

  • dropped - On RED this counter includes the number of packets early-dropped on the UC TC. (No packets are early-dropped on BUM TCs.)

  • marked - The number of packets that were ECN-marked.

    Up until and including Linux kernel 5.6 this counter reports global number of all ECN-marked packets (despite being reported at a particular band). In 5.7 this global counter will be moved to ethtool's ecn_marked counter.

  • overlimits - The number of packets that were early-dropped or ECN-marked (with the caveat mentioned at marked above).

  • early - The number of packets that were early-dropped.

  • pdrop - The number of packets that were tail-dropped. Note that the tail-dropped count does include the numbers from BUM TC.

  • other - This counter is not used for HW datapath.

Example

The following example attaches a RED qdisc under band 0 of an ETS parent whose handle is 10:. Between the queue depths of 500KiB and 1.5MiB, the dropping probability will gradually rise from 0 to 10%.

# tc qdisc add dev swp1 parent 10:1 handle 101: \
     red limit 2M avpkt 1000 probability 0.1 min 500K max 1.5M

TBF

The TBF queuing discipline implements a shaper based on Token Bucket algorithm.

TBF is meant to be configured on one of the ETS or PRIO bands, but can be set as root qdisc as well.

TBF Parameters

The following parameters are offloaded:

  • rate - The speed with which the queued traffic will be sent. The guaranteed granularity is 200Mbps.

  • burst - The number of bytes of traffic that is dequeued before the shaper rate takes effect. The value needs to be a power of 2. The range of valid values depending on system type is summarized below.

    Switch ASIC Valid range
    Spectrum-1 2K .. 2G
    Spectrum-2 128K .. 2G
    Spectrum-3 2K .. 2G

The following parameter is not offloaded, but is mandatory for the software datapath:

  • limit - Hard limit for the queue's size.

UC and BUM Traffic Classes

As described above, each ETS (and PRIO) band represents a pair of traffic classes. Configuring TBF at a band sets up a shaper that applies to both UC and BUM traffic together.

Note that besides the shaper configured through TBF, mlxsw also automatically adds a minimum shaper of 200Mbps at a BUM TC. Thus any BUM traffic is guaranteed to get at least 200Mbps. Only on top of that does the TBF shaper apply to the combination of both traffic types.

When used as a root qdisc, TBF enables shaper for UC TC 0 and BUM TC 8.

Statistics

The show command will show the current configuration including TBF's statistics.

# tc -s qdisc show dev swp1
qdisc ets 10: root refcnt 2 offloaded [...]
qdisc tbf 101: parent 10:1 offloaded rate 400Mbit burst 131050b lat 18.4ms
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0

TBF reports all the usual counters.

Example

The following example attaches a TBF qdisc under band 0 of an ETS parent whose handle is 10:. It configures a 400Mbps shaper with a burst size of 128KiB.

# tc qdisc add dev swp1 parent 10:1 handle 101: \
     tbf rate 400Mbit burst 128K limit 1M

Further Resources

  1. man tc
  2. man tc-ets
  3. man tc-prio
  4. man tc-red
  5. man tc-tbf
  6. Traffic Control HOWTO
Clone this wiki locally