-
Notifications
You must be signed in to change notification settings - Fork 39
Queues Management
Traffic control in Linux is managed by the TC subsystem. Documentation can be found here and in the TC man page.
Kernel Version | |
---|---|
4.15 | RED as root qdisc (ECN supported) |
4.16 | PRIO qdisc as root qdisc |
4.17 | RED as child of PRIO |
5.6 | ETS qdisc as root, RED and TBF as children of ETS or PRIO |
5.7 | FIFO stats offload, RED nodrop mode |
5.9 |
early_drop qevent with actions mirred and trap
|
Qdiscs, for "queuing disciplines", are entities that take care of queuing up and later scheduling of traffic to be transmitted by a network interface. From this point of view, a qdisc has two interesting operations: enqueue requests that a packet be queued up for later transmission; dequeue requests that one of the queued-up packets be chosen for immediate transmission. Since there is no one right way to manage packet queues, a number of qdiscs of different types exist.
Qdiscs conceptually form a tree: one qdisc is at the root of the tree, and it may have zero or more children, which in turn can have more children of their own. How many children any given qdisc permits, if any at all, depends on the qdisc in question. The points where children can be attached are called classes, and qdiscs that can have non-zero number of classes are called classful.
Classful qdiscs do not store any packets themselves. Instead, they pass enqueue and dequeue requests down to one of their children, according to criteria specific to the qdisc itself. Eventually this recursive message passing ends up at one of the leaves, where the packets are actually stored. (Or where the packets are picked up from in case of dequeuing.)
(To be fully correct: qdiscs actually form a DAG, directed acyclic graph. Some
qdiscs can be attached at multiple classes. However most of the time the simple
tree structure is all that is needed, and is the only one that mlxsw
is
capable of offloading.)
Each qdisc is identified by its handle, which is a 16-bit hexadecimal number
with a colon attached, such as 1:
or abcd:
. That number is called qdisc
major number. If a qdisc has any classes, their identifiers are formed as a pair
of two numbers: <major>:<minor>
, such as abcd:1
. The numbering scheme for
the minor numbers depends on the qdisc type. Sometimes the numbering is
systematic, where the first class has the ID <major>:1
, the second one
<major>:2
, and so on. Some qdiscs allow the user to set class minor number
arbitrarily as the class is created.
To create a new qdisc and attach it at a given point, use "add" or "replace" command:
# tc qdisc [add | replace] dev <dev name> \
[root | parent <parent ID>] [handle <handle ID>] \
<qdisc type> [<qdiscs params>]
Handle ID is a 16 bit number, written as <major>:
. If the user does not
specify a handle by hand, a new one is picked automatically.
A qdisc can be set as a root qdisc or as a child of another qdisc. In the latter case, parent ID is the ID of the class where the qdisc should be attached.
The difference between "add" and "replace" operations is in handling of qdiscs that exist at the attachment point prior to the creation. When a new class is created, it always comes with either a pFIFO or bFIFO qdisc attached by default. "add" allows detaching this implicit qdisc and attaching a new one instead. As soon as a qdisc has been added explicitly, the "replace" command has to be used to replace that qdisc. In practice it is possible to just always use "replace".
To list qdiscs at a given interface, use the "show" command:
# tc qdisc show dev swp1
qdisc ets 10: root refcnt 2 offloaded bands 8 strict 8 priomap 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
qdisc tbf 101: parent 10:1 offloaded rate 400Mbit burst 131050b lat 18.4ms
Pass the -s
flag to see the statistics as well:
# tc -s qdisc show dev swp1
qdisc ets 10: root refcnt 2 offloaded bands 8 strict 8 priomap 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
qdisc tbf 101: parent 10:1 offloaded rate 400Mbit burst 131050b lat 18.4ms
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
The meaning of the individual statistical counters is elaborated at each qdisc's section.
The TC subsystem allows configuration of traffic classification, of individual TCs, and traffic shaping. These aspects are configurable also through the DCB subsystem. Therefore configuring qdiscs may overwrite the corresponding DCB configuration present at the time, and vice versa--DCB configuration will overwrite any preexisting qdisc configuration.
In particular, if the egress path should be configured through qdiscs, it is
important to make sure that lldpad
is stopped and possibly disabled:
# systemctl disable --now lldpad.service
Qdiscs may invoke user-configured actions when certain interesting events take place in the qdisc. The object through which these actions are configured is called a "qevent". Each qevent can either be unused, or can have a shared block attached to it. The filters at this block are executed when the corresponding interesting event takes place.
As an example, the RED qdisc supports an early_drop
qevent. Packets that are
early-dropped due to the RED algorithm are then passed through the filters at
the block that is configured for this qevent.
mlxsw
is capable of offloading filters added to qevent blocks as long as the
following conditions are satisfied:
- The switch ASIC is Spectrum-2 or above.
- The qevent is supported (see below).
- Only a single filter shall be attached at the configured block, at chain 0
(the default), and its classifier shall be
matchall
. - The filter shall have
hw_stats
set todisabled
- The filter shall have a single action, which shall be supported (see below).
The following qevents are supported:
The following actions are supported:
-
mirred egress mirror
, which configures a SPAN, RSPAN or ERSPAN session to which the matching packets are directed. -
trap
, which mirrors the impacted packet to the CPU. The trap under which the packet is reported depends on the qevent.
Qevents are configured when a qdisc is created. The general form is as follows:
# tc qdisc add dev swp1 root handle 1: \
<qdisc_kind> <qdisc parameters> \
qevent <qevent_name> block <block-index>
This way, a shared block with a given index is bound to the given qevent. Then filters added to this block are considered for offloading:
# tc qdisc replace dev swp1 root handle 1: \
red limit 2M avpkt 1000 probability 0.1 min 500K max 1.5M \
qevent early_drop block 10
# tc filter add block 10 matchall skip_sw \
action vlan pop hw_stats disabled
Error: Unsupported action.
We have an error talking to the kernel
# tc filter add block 10 matchall skip_sw \
action mirred egress mirror dev swp6
Error: HW counters not supported on qevents.
We have an error talking to the kernel
# tc filter add block 10 matchall skip_sw \
action mirred egress mirror dev swp6 hw_stats disabled
The ETS qdisc describes mapping of packets to traffic classes based on their priority, and scheduling of individual traffic classes relative to one another. There are correspondingly two components that are present in every ETS qdisc: bands describe the traffic classes, and priomap describes the classification function.
ETS is meant to be used as a root qdisc on front panel port interfaces. It will not be offloaded otherwise.
Each ETS qdisc has a set of classes, called "bands". Each band represents one logical traffic class. Since each band is a qdisc class, a qdisc can be attached at each of them.
ETS bands are split to two groups: a (possibly empty) set of strict bands, followed by a (possibly empty) set of DWRR bands. The strict bands, if any, are always the lower-numbered ones.
The way ETS dequeues packets is that it first tries to dequeue traffic from the strict bands, if there are any. It proceeds in order of band number, first band 0, then band 1, and so on, until all strict bands are tried.
An ETS qdisc with just strict bands can be created this way:
# tc qdisc add dev swp1 root handle 1: \
ets bands 8 strict 8
# tc qdisc show dev swp1
qdisc ets 1: root refcnt 2 offloaded bands 8 strict 8 priomap 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
DWRR bands are tried next. A value, quantum
, is assigned to each DWRR band.
The value is number of bytes that the band is allowed to dequeue before it
yields to the next DWRR band in row.
When creating DWRR bands, instead of mentioning the number of bands like was the case above with strict bands, one lists quanta for the individual bands.
This dequeuing algorithm is respected when the ETS qdisc is offloaded. Thus
strict bands in the SW datapath represent strict TCs in the HW one, and DWRR
bands DWRR TCs. For purposes of offloading, the quanta at DWRR bands are
converted to percentage of available bandwidth, and the ASIC then aims to split
the available bandwidths according to these percentages. At most 8 bands can be
offloaded--if the qdisc has more bands, mlxsw
will not be able to offload it.
The following example creates an ETS qdisc with 4 strict bands and 4 DWRR ones, where bandwidth is split 25% : 25% : 25% : 25%:
# tc qdisc add dev swp1 root handle 1: \
ets bands 8 strict 4 quanta 2000 2000 2000 2000
# tc qdisc show dev swp1
qdisc ets 1: root refcnt 2 offloaded bands 8 strict 4 quanta 2000 2000 2000 2000 priomap 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
ETS supports several traffic classification algorithms, but the only one
offloaded by mlxsw
is called priomap
. priomap
is composed of a list of
numbers, one for each priority. The number indicates the number of band that the
packets with that priority should go to: 0 for the first band, 1 for the second,
and so on:
p7 ----------------.
.. |
p2 ------. |
p1 ----. | ... |
p0 --. | | |
| | | |
v v v v
# tc ... ets bands 4 priomap 3 3 2 2 1 1 0 0
| | | ... |
| | | '-> band 0
| | | ...
| | '-----------> band 2
| '-------------> band 3
'---------------> band 3
For details on how priority is assigned to packets, see Quality of Service.
Note that ETS supports up to 16 priorities in a priomap. For purposes of offloading, the only relevant priorities are 0-7. Priorities 8-15 are ignored and can be omitted when configuring ETS.
mlxsw
uses bands to denote logical traffic classes. Each band is mapped in the
ASIC to a pair of TCs, one for known unicast traffic, the other one for BUM
traffic (for broadcast, unknown unicast, multicast). The UC TC has strict
priority over the BUM TC.
The unicast TC is derived from the band number as follows: band 0 maps to TC 7,
band 1 to TC 6, etc., until band 7 maps to TC 0. The TC for BUM traffic is then
the number of unicast TC + 8. The TC numbers are important for checking some
per-TC ethtool
counters and for shared
buffer binding configuration.
For purposes of attaching a child to a band, the qdisc class ID of the band is its band number + 1.
The following table summarizes the band mapping described above:
Band no. | Class ID | UC TC | BUM TC | Priority |
---|---|---|---|---|
0 | X:1 | 7 | 15 | Highest |
1 | X:2 | 6 | 14 | |
2 | X:3 | 5 | 13 | |
3 | X:4 | 4 | 12 | |
4 | X:5 | 3 | 11 | |
5 | X:6 | 2 | 10 | |
6 | X:7 | 1 | 9 | |
7 | X:8 | 0 | 8 | Lowest |
The tc -s show
command will list the current ETS configuration including the
full priomap and statistics:
$ tc -s qdisc show dev swp1
qdisc ets 1: root refcnt 2 offloaded bands 8 strict 8 priomap 7 6 5 4 3 2 1 0 7 7 7 7 7 7 7 7
Sent 30510403042 bytes 20289261 pkt (dropped 5199870, overlimits 0 requeues 0)
backlog 222720b 0p requeues 0
The statistics represent the sum of the statistics of all the bands. If RED is configured on any of ETS classes, the child drops will be counted by the parent as well.
When using strict bands, packets in lower priority bands will not be sent until all the higher-priority bands are empty. In this situation, packets in the HW datapath might be dropped due to switch lifetime timeouts. These drops are not counted towards the number of dropped packets.
The backlog values reported on offloaded qdiscs is composed of values of the two
constituent TCs. To find the value for individual TCs, it is necessary to
inspect the ethtool counter tc_transmit_queue_tc_<TC>
. That shows number of
bytes queued up at individual traffic classes:
$ ethtool -S swp1 | grep tc_transmit_queue_tc
tc_transmit_queue_tc_0: 0 \
tc_transmit_queue_tc_1: 0 | UC TCs
[...] |
tc_transmit_queue_tc_7: 0 /
tc_transmit_queue_tc_8: 0 \
tc_transmit_queue_tc_9: 0 | BUM TCs
[...] |
tc_transmit_queue_tc_15: 0 /
Add an ETS qdisc with handle 10:
, with 8 bands, 4 of which are strict, and the
remaining 4 split traffic 40% : 30% : 20% : 10%. The quanta sum up to 10000,
which makes it easy to mentally map from the per-band quantum to the
corresponding percentage.
Traffic is mapped to bands in a reversed 1:1 manner to make priority-0 traffic the least prioritized and priority-7 traffic the most prioritized. That means that priority 0 goes to TC 0, 1 goes to TC 1, and so on. (Except BUM traffic, which goes to TC 8, TC 9, and so on instead.)
# tc qdisc replace dev swp1 root handle 10: \
ets bands 8 strict 4 quanta 4000 3000 2000 1000 \
priomap 7 6 5 4 3 2 1 0
As indicated in the table above, band 0 has the class ID X:1
, band 1 X:2
,
and so on. In order to attach a child qdisc to a band, use that ID as parent
reference when creating a new qdisc. E.g. to attach RED to the first
band and TBF to the second one:
# tc qdisc replace dev swp1 parent 10:1 handle 101: \
red limit 2M avpkt 1000 probability 0.1 min 500K max 1.5M
# tc qdisc replace dev swp1 parent 10:2 handle 102: \
tbf rate 400Mbit burst 128K limit 1M
The PRIO qdisc is in most ways the same as ETS qdisc configured with only strict bands. The only differences are:
- PRIO cannot be configured with fewer than 3 bands.
- PRIO does not permit DWRR bands, all bands are always strict. Correspondingly,
there is no
strict
keyword when creating the qdisc, justbands
. - PRIO has different priomap defaults.
One can learn about how PRIO works by reading the ETS section above and focusing only on the parts that deal with strict priority.
Create a PRIO qdisc that configures 8 (strict) bands, and maps traffic to bands in reversed 1:1 fashion.
# tc qdisc replace dev swp1 root handle 1: \
prio bands 8 priomap 7 6 5 4 3 2 1 0
Where ETS and PRIO describe assignment of traffic to TCs and relation of individual TCs to each other, qdiscs attached to individual bands configure the TC itself.
- A RED child will enable RED or ECN for the UC TC associated with the band.
- A TBF child will configure shaper for the pair of BUM and UC TCs associated with the band.
Additionally, each PRIO and ETS class, unless overridden, contains a FIFO qdisc
child. Those are not shown by default in the "tc qdisc show" output, but can be
shown by passing an invisible
flag. Alternatively it is possible
to replace them explicitly by qdiscs with non-null handles.
The show command will show the current configuration including statistics. For offloaded qdiscs the values include the traffic in the HW datapath.
Note: On kernels older than 5.7, FIFO is not offloaded, and therefore counters on the bands that do not contain another qdisc do not include HW datapath numbers.
# tc -s qdisc show dev swp1
qdisc ets 10: root refcnt 2 offloaded [...]
qdisc tbf 101: parent 10:1 offloaded rate 400Mbit burst 131050b lat 18.4ms
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
The reported counters are:
-
bytes
,pkt
- The number of bytes resp. packets that were sent through the qdisc.The Spectrum ASICs do not track the number of sent bytes and packets on per-TC basis, but rather on per-packet-priority basis. For child qdiscs,
mlxsw
deduces the per-TC counters from the priorities mapped to a given band by PRIO or ETS priomap. If the priomap is changed in a way that impacts bands that contain offloaded child qdiscs, these child qdiscs will lose HW stats accumulated prior to the change. -
dropped
- The number of packets dropped on either the unicast TC, or the BUM TC corresponding to the band that the qdisc is in. -
overlimits
- The meaning depends on the qdisc. -
requeues
- The meaning depends on the qdisc, andmlxsw
does not currently set this counter. -
backlog
- The number of bytes and packets waiting in queue. In offloaded qdiscs, the number of bytes includes the HW datapath queue depth. However the number of packets always includes only the SW datapath, because the corresponding counters are not available on Spectrum ASICs.
pFIFO or bFIFO qdiscs are by default attached to newly-created qdisc classes and
classful qdiscs. These default qdiscs have a handle of 0:0, and will only be
shown by tc qdisc show
when an invisible
flag is passed. They can also be
created explicitly with non-null handle, in which case they are normally shown.
The difference between pFIFO and bFIFO is that pFIFO limits the queue length by
number of packets, whereas bFIFO by number of bytes. From the perspective of
mlxsw
, these two qdisc kinds are equivalent.
There are also related but distinct qdiscs, pFIFO-head_drop, and pFIFO-fast. These are not offloaded.
mlxsw
offloads FIFO under the following circumstances:
- If it has a handle of 0:0: when it is a direct child of an offloaded PRIO or ETS qdisc.
- If it has a valid handle: when is a root qdisc, or is a child of an offloaded root qdisc.
When a FIFO qdisc is offloaded, its stats reflect HW datapath traffic.
No FIFO parameters are offloaded and there are no mandatory SW-datapath parameters.
As described above, each ETS (and PRIO) band represents a pair of traffic classes.
When a FIFO qdisc is offloaded, its stats reflect HW datapath traffic flowing through the corresponding traffic class. Both UC and BUM traffic is counted.
The show command will show the current configuration including FIFO's
statistics. Pass the invisible
flag in order to see qdiscs with a null handle.
# tc -s q sh dev swp7 invisible
qdisc ets 10: root refcnt 2 offloaded bands 3 strict 3 priomap 0 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2
Sent 62736398886 bytes 7793372 pkt (dropped 29552253, overlimits 0 requeues 0)
backlog 814464b 0p requeues 0
qdisc pfifo 0: parent 10:3 offloaded limit 1000p
Sent 636 bytes 6 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
qdisc pfifo 0: parent 10:2 offloaded limit 1000p
Sent 10215514 bytes 1270 pkt (dropped 22780159, overlimits 0 requeues 0)
backlog 411264b 0p requeues 0
qdisc pfifo 0: parent 10:1 offloaded limit 1000p
Sent 62726182736 bytes 7792096 pkt (dropped 6772094, overlimits 0 requeues 0)
backlog 403200b 0p requeues 0
FIFO reports all the usual counters.
In the following example, FIFO at ETS band 1 is replaced with one that has a non-zero handle, so that it is visible with normal tc dumps:
# tc qdisc add dev swp1 parent 10:1 handle 101: pfifo
RED is a queuing discipline designed for congestion avoidance. It can run in one of three modes:
-
In RED mode, the qdisc drops packets according to a simple linear probability function described below.
-
In ECN mode, it uses the same probability function, but marks ECN-capable packets with ECN-CE (congestion encountered) tags instead of dropping them. Non-ECN-capable packets are still early-dropped. Unlike in RED mode, in ECN mode the queue can be filled completely, and excess packets are then tail-dropped.
-
ECN nodrop mode is like pure ECN mode, but does not drop non-ECN-capable packets. ECN nodrop therefore never early-drops, but can still tail-drop packets if the queue grows too large.
The probability to drop or mark a packet is zero until the queue's average size reaches the minimum limit. From there, the probability will rise linearly until it reaches the maximum probability at a point where the queue's average size reaches the maximum limit. When the queue's average size is above the maximum, the probability to drop a packet is 1 (See figure below).
The drops due to the RED algorithm are called early drops. They differ from tail drops, which are caused by shared buffer quota exhaustion.
RED is meant to be configured on one of the ETS or PRIO bands, but can be set as root qdisc as well. When RED is used as a root qdisc, it enables RED on TC 0.
The following parameters are offloaded:
-
min
- The minimum limit. -
max
- The maximum limit. -
probability
- The probability to drop a packet when the average queue size is at maximum limit. 1.0 means 100%. -
ecn
- If set, puts the qdisc into ECN mode. -
nodrop
- If set together withecn
, puts the qdisc into "ECN nodrop" mode.
The following parameters are not offloaded, but are mandatory for the software datapath qdisc:
-
limit
- Hard limit for the queue's size. -
avpkt
- Average queue size calculation parameter.1000
is recommended.
As described above, each ETS (and PRIO) band represents a pair of traffic classes. Adding RED at a band configures RED only on the UC TC, not on BUM one. There is currently no way to configure RED on BUM traffic classes.
The show command will show the current configuration including RED's statistics.
$ tc -s qdisc show dev swp1
qdisc ets 10: root refcnt 2 offloaded [...]
qdisc red 101: parent 10:1 offloaded limit 2Mb min 500Kb max 1536Kb
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
marked 0 early 0 pdrop 0 other 0
Besides the usual counters, RED supports the following values:
-
dropped
- On RED this counter includes the number of packets early-dropped on the UC TC. (No packets are early-dropped on BUM TCs.) -
marked
- The number of packets that were ECN-marked.Up until and including Linux kernel 5.6 this counter reports global number of all ECN-marked packets (despite being reported at a particular band). In 5.7 this global counter will be moved to ethtool's
ecn_marked
counter. -
overlimits
- The number of packets that were early-dropped or ECN-marked (with the caveat mentioned atmarked
above). -
early
- The number of packets that were early-dropped. -
pdrop
- The number of packets that were tail-dropped. Note that the tail-dropped count does include the numbers from BUM TC. -
other
- This counter is not used for HW datapath.
The following qevents can be offloaded for RED qdisc:
-
early_drop
- Filters at the configured block are invoked on packets that are early-dropped by the RED algorithm. Packets trapped by this qevent are reported under theearly_drop
trap.
The following example attaches a RED qdisc under band 0 of an ETS parent whose handle is 10:. Between the queue depths of 500KiB and 1.5MiB, the dropping probability will gradually rise from 0 to 10%.
# tc qdisc add dev swp1 parent 10:1 handle 101: \
red limit 2M avpkt 1000 probability 0.1 min 500K max 1.5M
The following example creates a RED qdisc with the same configuration, but puts it to "ecn nodrop" mode:
# tc qdisc add dev swp1 parent 10:1 handle 101: \
red ecn nodrop limit 2M avpkt 1000 probability 0.1 min 500K max 1.5M
The TBF queuing discipline implements a shaper based on Token Bucket algorithm.
TBF is meant to be configured on one of the ETS or PRIO bands, but can be set as root qdisc as well.
The following parameters are offloaded:
-
rate
- The speed with which the queued traffic will be sent. The guaranteed granularity is 200Mbps. -
burst
- The number of bytes of traffic that is dequeued before the shaper rate takes effect. The value needs to be a power of 2. The range of valid values depending on system type is summarized below.Switch ASIC Valid range Spectrum-1 2K .. 2G Spectrum-2 128K .. 2G Spectrum-3 2K .. 2G
The following parameter is not offloaded, but is mandatory for the software datapath:
-
limit
- Hard limit for the queue's size.
As described above, each ETS (and PRIO) band represents a pair of traffic classes. Configuring TBF at a band sets up a shaper that applies to both UC and BUM traffic together.
Note that besides the shaper configured through TBF, mlxsw
also automatically
adds a minimum shaper of 200Mbps at a BUM TC. Thus any BUM traffic is guaranteed
to get at least 200Mbps. Only on top of that does the TBF shaper apply to the
combination of both traffic types.
When used as a root qdisc, TBF enables shaper for UC TC 0 and BUM TC 8.
The show command will show the current configuration including TBF's statistics.
# tc -s qdisc show dev swp1
qdisc ets 10: root refcnt 2 offloaded [...]
qdisc tbf 101: parent 10:1 offloaded rate 400Mbit burst 131050b lat 18.4ms
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
TBF reports all the usual counters.
The following example attaches a TBF qdisc under band 0 of an ETS parent whose handle is 10:. It configures a 400Mbps shaper with a burst size of 128KiB.
# tc qdisc add dev swp1 parent 10:1 handle 101: \
tbf rate 400Mbit burst 128K limit 1M
- man tc
- man tc-ets
- man tc-prio
- man tc-pfifo, man tc-bfifo
- man tc-red
- man tc-tbf
- Traffic Control HOWTO
General information
System Maintenance
Network Interface Configuration
- Switch Port Configuration
- Netdevice Statistics
- Persistent Configuration
- Quality of Service
- Queues Management
- How To Configure Lossless RoCE
- Port Mirroring
- ACLs
- OVS
- Resource Management
- Precision Time Protocol (PTP)
Layer 2
Network Virtualization
Layer 3
- Static Routing
- Virtual Routing and Forwarding (VRF)
- Tunneling
- Multicast Routing
- Virtual Router Redundancy Protocol (VRRP)
Debugging