The section gives a simplified description of the path a packet goes
through from the time it enters the ASIC until it is scheduled for transmission.
It then continues to describe the various quality of service (QoS) mechanisms
supported by the ASIC and how they affect the previously described path.

##### Table of Contents
1. [Packet Forwarding](#packet-forwarding)
1. [Tools](#tools)
    1. [Linux Kernel Configuration](#linux-kernel-configuration)
    1. [`iproute2` `dcb`](#iproute2-dcb)
    1. [Open LLDP](#open-lldp)
1. [Priority Assignment](#priority-assignment)
    1. [Trust PCP](#trust-pcp)
    1. [Trust DSCP](#trust-dscp)
    1. [Default Priority](#default-priority)
    1. [ACL-Based Priority Assignment](#acl-based-priority-assignment)
    1. [Priority Update in Forwarded IPv4 Packets](#priority-update-in-forwarded-ipv4-packets)
1. [Priority Group Buffers](#priority-group-buffers)
    1. [DCB Mode](#dcb-mode)
    1. [TC Mode](#tc-mode)
1. [Configuring Lossless Traffic](#configuring-lossless-traffic)
    1. [Shared Buffer Configuration](#shared-buffer-configuration)
    1. [The Xoff Threshold](#the-xoff-threshold)
    1. [Remote End Configuration](#remote-end-configuration)
    1. [Reception of PFC Packets](#reception-of-pfc-packets)
    1. [PAUSE Frames](#pause-frames)
1. [Traffic Scheduling](#traffic-scheduling)
    1. [Incompatibility Between DCB and TC](#incompatibility-between-dcb-and-tc)
    1. [Priority Map](#priority-map)
    1. [Transmission Selection Algorithms](#transmission-selection-algorithms)
    1. [Shared Buffer Configuration](#shared-buffer-configuration-1)
    1. [Remote End Configuration](#remote-end-configuration-1)
1. [DSCP Rewrite](#dscp-rewrite)
    1. [Trust DSCP](#trust-dscp-1)
    1. [ACL-Based DSCP Rewrite](#acl-based-dscp-rewrite)
1. [Shared Buffers](#shared-buffers)
    1. [Admission Rules](#admission-rules)
    1. [Pool Size](#pool-size)
    1. [Pool Threshold](#pool-threshold)
    1. [Pool Binding](#pool-binding)
    1. [Pool Occupancy](#pool-occupancy)
    1. [Handling of BUM Traffic](#handling-of-bum-traffic)
    1. [Default Shared Buffer Configuration](#default-shared-buffer-configuration)
    1. [Descriptor Buffers](#descriptor-buffers)
1. [Control Plane Policing (CoPP)](#control-plane-policing-copp)
    1. [Monitoring Using Prometheus](#monitoring-using-prometheus)
1. [Further Resources](#further-resources)

Packet Forwarding
-----------------

When a packet enters the chip, it is assigned Switch Priority (SP), an internal
identifier that informs how the packet is treated in context of other traffic.
The assignment is made based on packet headers and switch configuration—see
[Priority Assignment](#priority-assignment) for details.

Afterwards, the packet is directed to a priority group (PG) buffer in the port's
headroom based on its SP. The port's headroom buffer is used to store incoming
packets on the port while they go through the switch's pipeline and also to
store packets in a lossless flow if they are not allowed to enter the switch's
shared buffer. However, if there is no room for the packet in the headroom, it
gets dropped. Mapping an SP to a PG buffer is explained in the [Priority Group
Buffers section](#priority-group-buffers).

Once outside the switch's pipeline, the packet's ingress port, internal
SP, egress port, and traffic class (TC) are known. Based on these
parameters the packet is classified to ingress and egress pools in the
switch's shared buffer. In order for it to be eligible to enter the shared
buffer, certain quotas associated with the packet need to be checked.
These quotas and their configuration are described in the [shared
buffers section](#shared-buffers).

The packet stays in the shared buffer until it is transmitted from its
designated egress port. The packet is queued for transmission according
to its Traffic Class (TC). Once in its assigned queue, the packet is
scheduled for transmission based on the chosen traffic selection
algorithm (TSA) employed on the TC and its various parameters. The
mapping from SP to TC and TSA configuration are discussed in the [Traffic
Scheduling section](#traffic-scheduling).

Packets which are not eligible to enter the shared buffer stay in the
headroom if they are associated with a lossless flow (mapped to a
lossless PG).  Otherwise, they are dropped. The configuration of
lossless flows is discussed in the
[Configuring Lossless Traffic section](#configuring-lossless-traffic).

#### Features by Version

| Kernel Version |                                                                  |
|:--------------:|:---------------------------------------------------------------- |
| **4.7**        | Shared buffers, Trust PCP                                        |
| **4.19**       | Trust DSCP, DSCP rewrite, `net.ipv4.ip_forward_update_priority`  |
|                | Dedicated pool & TCs for BUM traffic                             |
| **4.20**       | BUM pool & TCs exposed in devlink-sb                             |
|                | Minimum shaper configured on BUM TCs                             |
| **5.1**        | Spectrum-2 support                                               |
| **5.6**        | Setting port-default priority                                    |
| **5.7**        | ACL-based priority assignment, ACL-based DSCP rewrite            |
| **5.10**       | Support for DCB buffer commands                                  |

Tools
-----

Quality of service on mlxsw is configured using predominantly three interfaces:
DCB, TC and devlink. All three can be configured through `iproute2` tools.
[`Open LLDP`][4] can be used to configure DCB as well, and is covered together
with the `iproute2` tool in this document.

For an overview of DCB and particularly DCB operation in Linux please refer to
[this article][3].

#### Linux Kernel Configuration

To enable DCB support in the `mlxsw_spectrum` driver,
`CONFIG_MLXSW_SPECTRUM_DCB=y` is needed.

#### `iproute2` `dcb`

`iproute2` has had support for configuration of DCB through the `dcb` tool since
the release 5.11, although the `app` object has not been added until 5.12.

#### Open LLDP

Open LLDP is a daemon that handles the LLDP protocol, and can configure
individual stations in the LLDP network as to losslessness of traffic, egress
traffic scheduling, and other attributes. Once run, it takes over the switch and
manages individual interfaces through the Linux DCB interface.

For further information on LLDP support, it is advised to go over the [LLDP
document](Link-Layer-Discovery-Protocol).

For most of the features, it is immaterial whether Open LLDP or `iproute2` is
chosen. However the `iproute2` tool has a more complete support, and e.g. DCB
buffer object can currently not be configured through Open LLDP.

Priority Assignment
-------------------

Switch Priority (SP) of a packet can be derived from packet headers, or assigned
by default. Which headers are used when deriving SP of a packet depends on
configuration of Trust Level of the port through which a given packet ingresses.
`mlxsw` currently recognizes two trust levels: "Trust PCP" (or Trust L2; this is
the default) and "Trust DSCP" (or Trust L3).

**Note:** In Linux, packets are stored in a data structure called [socket buffer
(SKB)][5]. One of the fields of the SKB is the `priority` field. In the ASIC,
the entity corresponding to SKB priority is the Switch Priority (SP). Note that
these two values, SKB priority and Switch priority, are distinct, and the Switch
priority assigned to a packet is not projected to the SKB priority. If
forwarding of part of the traffic is handled on the CPU, and traffic
prioritization is important, SKBs need to be assigned priority anew, e.g. using
software [tc filters](ACLs).

#### Trust PCP

By default, ports are in trust-PCP mode. In that case, the switch will
prioritize a packet based on its [IEEE 802.1p][1] priority. This priority is
specified in the packet's [Priority Code Point (PCP)][2] field, which is part of
the packet's VLAN header. The mapping between PCP and SP is 1:1 and cannot be
changed.

If the packet does not have 802.1p tagging, it is assigned [port-default
priority](#default-priority).

#### Trust DSCP

Each port can be configured to set Switch Priority of packets based on the [DSCP
field][9] of IP and IPv6 headers. The priority is assigned to individual DSCP
values through the DCB APP entries.

After the first APP rule is added for a given port, this port's trust level is
toggled to DSCP. It stays in this mode until all DSCP APP rules are removed
again.

Use `dcb app` attribute `dscp-prio` to manage the DSCP APP entries:

```
$ dcb app add dev <port> dscp-prio <DSCP>:<SP> # Insert rule.
$ dcb app del dev <port> dscp-prio <DSCP>:<SP> # Delete rule.
$ dcb app show dev <port> dscp-prio            # Show rules.
```

For example, to assign priority 3 to packets with DSCP 24 (symbolic name CS3):

```
$ dcb app add dev swp1 dscp-prio 24:3   # Either using numerical value
$ dcb app add dev swp1 dscp-prio CS3:3  # Or using symbolic name
$ dcb app show dev swp1 dscp-prio
dscp-prio CS3:3
$ dcb -N app show dev swp1 dscp-prio
dscp-prio 24:3
```

In Open LLDP, the DCB APP entries are configured as follows:

```
$ lldptool -T -i <port> -V APP    app=<SP>,5,<DSCP> # Insert rule.
$ lldptool -T -i <port> -V APP -d app=<SP>,5,<DSCP> # Delete rule.
$ lldptool -t -i <port> -V APP -c app               # Show rules.
```

**Note:** The support for the DSCP APP rules in openlldpad was introduced by
[this patch][8]. If the above commands give a "selector out of range" error,
the reason is that the package that you are using does not carry the patch.

**Note:** The use of selector 5 is described in the draft standard 802.1Qcd/D2.1
Table D-9.

The Linux DCB interface allows a very flexible configuration of DSCP-to-SP
mapping, to the point of permitting configuration of two priorities for the same
DSCP value. These conflicts are resolved in favor of the highest configured
priority. For example:

```
$ dcb app add dev swp1 dscp-prio 24:3      # Configure 24->3.
$ dcb app add dev swp1 dscp-prio 24:2      # Keep 24->3.
$ dcb app show dev swp1 dscp-prio
dscp-prio CS3:2 CS3:3
$ dcb app del dev swp1 dscp-prio 24:3      # Configure 24->2.
```

`dcb` has syntax sugar for the above sequence in the form of the `replace`
command. The following runs the exact same set of DCB commands under the hood:

```
$ dcb app replace dev swp1 dscp-prio 24:3  # Configure 24->3.
$ dcb app replace dev swp1 dscp-prio 24:2  # Configure 24->2.
```

When a packet arrives with a DSCP value that does not have a corresponding APP
rule, or a non-IP packet arrives, it is assigned [port-default
priority](#default-priority) instead.

Trust-DSCP mode disables PCP prioritization even for non-IP packets that have no
DSCP value, and even if they have 802.1p tagging. Such packets get the
[port-default priority](#default-priority) assigned instead.

**Note:** The Spectrum chip supports also a "trust both" mode, where Switch
Priority is assigned based on PCP when DSCP is not available, however this mode
is currently not supported by `mlxsw`.

#### Default Priority

A default value is assigned to packet's SP when:

- it does not have an 802.1p tagging and ingresses through a trust-PCP port
- it is a non-IPv4/non-IPv6 packet and ingresses through a trust-DSCP port

The default value for port-default priority is 0. It can however be configured
similarly to how the DSCP-to-priority mapping is. As with the DSCP APP rules,
Linux allows configuration of several default priorities. Again, `mlxsw` chooses
the highest one that is configured.

Use `dcb app` attribute `default-prio` to configure the default priority:

```
$ dcb app add dev <port> default-prio <SP>      # Insert rule for default priority <SP>.
$ dcb app del dev <port> default-prio <SP>      # Delete rule for default priority <SP>.
$ dcb app replace dev <port> default-prio <SP>  # Set default priority to <SP>.
```

When using Open LLDP, this is configured as an "APP" rule with selector 1
(Ethertype) and PID 0, which denote "default application priority [...] when
application priority is not otherwise specified":

```
$ lldptool -T -i <port> -V APP    app=<SP>,1,0  # Insert rule for default priority <SP>.
$ lldptool -T -i <port> -V APP -d app=<SP>,1,0  # Delete rule for default priority <SP>.
```

**Note:** The use of selector 1 is standardized in 802.1Q-2014 in Table D-9.

#### ACL-Based Priority Assignment

Spectrum switches allow overriding in the ACL engine the prioritization decision
described in the previous paragraphs. The [[ACLs]] page talks about how to
configure filters. In order to change packet priority, use the action `skbedit
priority`:

```
$ tc filter add dev swp1 ingress flower action skbedit priority 5
```

In the SW datapath, it is possible to assign arbitrary priorities and use the
`skbedit priority` action to select a particular traffic class, but this usage
is not offloaded. Only priorities 0-7 are currently allowed for the HW datapath.

The priority override is allowed both on ingress and on egress of a netdevice.

**Note:** The priority override only happens after the packet's priority group
is resolved, and does not change this decision. Therefore when changing from a
lossless priority to a lossy one, the now-lossy packet is still subject to flow
control, and vice versa. When changing from one lossless priority to another,
the flow control is performed on the wrong priority. Therefore the only
reprioritization that works and is supported is from one lossy priority to
another lossy priority. This is a HW limitation.

#### Priority Update in Forwarded IPv4 Packets

In Linux, the priority of forwarded IPv4 SKBs is updated according to the TOS
value in IPv4 header. The mapping is fixed, and cannot be configured. `mlxsw`
configures the switch to update packet priority after routing as well using the
same values. However unlike the software path, this is done for both IPv4 and
IPv6.

**Note:** The actual mapping from TOS to SKB priority is shown in man tc-prio,
section "QDISC PARAMETERS", parameter "priomap".

As of 4.19, it is possible to turn this behavior off through sysctl:

```
$ sysctl -w net.ipv4.ip_forward_update_priority=0
```

This disables IPv4 priority update after forwarding in slow path, as well as
both IPv4 and IPv6 post-routing priority update in the chip. In other
words, when this sysctl is cleared the priority assigned to a packet at
ingress will be preserved after the packet is routed.

Priority Group Buffers
----------------------

Priority group buffers (PG buffers) are the area of the switch shared buffer
where packets are kept while they go through the switch's pipeline. For lossless
flows, this is also the area where traffic is kept before it can be admitted to
the shared buffer.

After packet priority is assigned through the [trust PCP](#trust-pcp), [trust
DSCP](#trust-dscp) or [default priority](#default-priority) mechanisms, the
switch inspects priority map for the port, and determines which PG buffer should
host a given packet. PGs 0 to 7 can be configured this way. PG8 is never used
and control traffic is always directed to PG9.

**Note:** changing the priority in the ACL and in the router does not impact PG
  selection.

The way the PG buffers and priority map are configured depends on the way that
the port egress is configured. The rest of this section describes the details.

#### Buffer Size Granularity

The Spectrum ASIC allocates the chip memory in units of cells. Cell size depends
on a particular chip, and is reported by `devlink`:

```
$ devlink sb pool show
pci/0000:03:00.0:
  sb 0 pool 0 type ingress size 13768608 thtype dynamic cell_size 96
[...]
```

Computed or requested buffer sizes are rounded up to the nearest cell size
boundary.

#### DCB Mode

When traffic scheduling on the port is configured using the [DCB ETS
commands](#traffic-scheduling), the port is in DCB mode.

In that case, buffer sizes of PGs 0-7 are configured automatically. Unused
buffers will have a zero size.

Priority map is deduced from the [ETS priority map](#priority-map), as mapping
different flows to different TCs at egress also implies that they should be
separated at ingress.

E.g. the following command will, in addition to the egress configuration,
configure headroom priority map such that traffic with priorities 0-3 is
assigned to PG0, and traffic with priorities 4-7 to PG1:

```
$ dcb ets set dev swp1 prio-tc {0..3}:0 {4..7}:1
```

Use `dcb buffer` to inspect the buffer configuration:

```
$ dcb buffer show dev swp1
prio-buffer 0:0 1:0 2:0 3:0 4:1 5:1 6:1 7:1            <-- priority map
buffer-size 0:3Kb 1:3Kb 2:0b 3:0b 4:0b 5:0b 6:0b 7:0b
total-size 16416b
```

**Note:** The interface only allows showing the size of the first 8 PG buffers,
  out of the total of 10 buffers and internal mirroring buffer that the Spectrum
  ASIC has. The reported `total_size` shows the full allocated size of the
  headroom, including these hidden components, and therefore is not a simple sum
  of the individual PG buffer sizes.

#### TC Mode

When egress traffic scheduling is configured using
[qdiscs](Queues-Management), the port transitions to TC mode. In this
mode, the sizes of PG buffers are configured by hand using `dcb buffer`:

```
$ dcb buffer set dev swp1 buffer-size all:0 0:25K 1:25K
```

**Note:** There is a minimum allowed size for a used PG buffer. The system will
  always configure at least this minimum size, which is equal to [the Xoff
  threshold](#the-xoff-threshold).

The priority map is also configured directly, through the `dcb buffer` command:

```
$ dcb buffer set dev swp1 prio-buffer {0..3}:0 {4..7}:1
```

And to inspect the configuration:

```
$ dcb buffer show dev swp1
prio-buffer 0:0 1:0 2:0 3:0 4:1 5:1 6:1 7:1
buffer-size 0:25632b 1:25632b 2:0b 3:0b 4:0b 5:0b 6:0b 7:0b
total-size 61536b
```

**Note:** As [above](#dcb-mode), `total_size` is not a simple sum of shown PG
sizes.

**Note:** Configuration of the sizes of PG buffers and priority map
while the port is in DCB mode is forbidden.

Configuring Lossless Traffic
----------------------------

Packets which are not eligible to enter the shared buffer can stay in the
headroom, if they are associated with a lossless flow. To configure a PG as
lossless use `dcb pfc` attribute `prio-pfc`. For example, to enable PFC for
priorities 1, 2 and 3, run:

```
$ dcb pfc set dev swp1 prio-pfc all:off 1:on 2:on 3:on
$ dcb pfc show dev swp1 prio-pfc
prio-pfc 0:off 1:on 2:on 3:on 4:off 5:off 6:off 7:off
```

This command enables PFC for both Rx and Tx.

In Open LLDP, this is configured through the `enabled` attribute of `PFC` TLV
using `lldptool`:

```
$ lldptool -T -i swp1 -V PFC enabled=1,2,3
```

**Note:** Setting a priority as lossless and mapping it to a PG buffer along
with lossy priorities yields unpredictable behavior.

#### Shared Buffer Configuration

Besides enabling PFC, [Shared Buffers](#shared-buffers) need to be configured
suitably as well. Packets associated with a lossless flow can only stay in the
headroom if the following conditions hold:

* The per-Port and per-{Port, TC} quotas of the egress port are set to the
  maximum allowed value.

* When binding a lossless PG to a pool, the size of the pool and the quota
  assigned to the {Port, PG} must be non-zero.

**Note:** The threshold type of the default egress pool (pool #4) is
dynamic and cannot be changed. Therefore, when PFC is enabled a
different egress pool needs to be used.

#### The Xoff Threshold

Packets with PFC-enabled priorities are allowed to stay in their assigned PG
buffer in the port's headroom, but if the PG buffer is full they are dropped. To
prevent that from happening, there is a point in the PG buffer called the Xoff
threshold. Once the amount of traffic in the PG reaches that threshold, the
switch sends a PFC packet telling the transmitter to stop sending (Xoff) traffic
for all the priorities sharing this PG. Once the amount of data in the PG buffer
goes below the threshold, a PFC packet is transmitted telling the other side to
resume transmission (Xon) again.

The Xon/Xoff threshold is autoconfigured and always equal to 2*(MTU rounded up
to [cell size](#buffer-size-granularity)). Furthermore, 8-lane ports (regardless
of the negotiated number of lanes) use two buffers among which the configured
value is split, and the Xoff threshold size thus needs to be doubled again.

Even after sending the PFC packet, traffic will keep arriving until the
transmitter receives and processes the PFC packet. This amount of traffic is
known as the PFC delay allowance.

In **DCB mode**, the delay allowance can be configured through `dcb pfc`
attribute `delay`:

```
$ dcb pfc set dev swp1 delay 32768
```

When using Open LLDP, this can be set by using the PFC `delay` key:

```
$ lldptool -T -i swp1 -V PFC delay=32768
```

Maximum delay configurable through this interface is 65535 bit.

**Note:** In the worst case scenario the delay will be made up of packets that
  are all of size that is one byte larger than the cell size, which means each
  packet will require almost twice its true size when buffered in the switch.
  Furthermore, when the PAUSE or PFC frame is received, the host already may
  have started transmitting another MTU-sized frame. The full formula for delay
  allowance size therefore is 2 * (delay in bytes rounded up to [cell
  size](#buffer-size-granularity)) + (MTU rounded up to cell size).

In **TC mode**, instead of configuring the delay allowance, the PG buffer size
is set exactly. The Xoff is still autoconfigured, and whatever is above the Xoff
mark is the delay allowance.

The resulting PG buffer:

```
                         +----------------+   +
                         |                |   |
                         |                |   |
                         |                |   |
                         |                |   |
                         |                |   |
                         |                |   | Delay
                         |                |   |
                         |                |   |
                         |                |   |
                         |                |   |
                         |                |   |
    Xon/Xoff threshold   +----------------+   +
                         |                |   |
                         |                |   | 2 * MTU
                         |                |   |
                         +----------------+   +
```

If packets continue to be dropped, then the delay value should be increased.
The Xoff threshold is always autoconfigured.

#### Remote End Configuration

For PFC to work properly, both sides of the link need to be configured
correctly. One can use `dcb pfc` on the remote end in the same way that it is
used on the switch. However, if LLDP is used, it can take care of configuring
the remote end. Run:

```
$ lldptool -T -i swp1 -V PFC willing=no
$ lldptool -T -i swp1 -V PFC enableTx=yes
```

And on the host connected to the switch:

```
$ lldptool -T -i enp6s0 -V PFC willing=yes
```

When the host receives the switch's PFC TLV, it will use its settings:

```
host$ lldptool -t -i enp6s0 -V PFC
IEEE 8021QAZ PFC TLV
         Willing: yes
         MACsec Bypass Capable: no
         PFC capable traffic classes: 8
         PFC enabled: 1 2 3
```

#### Reception of PFC Packets

When a PFC packet is received by a port, it stops the TCs to which the
priorities set in the PFC packet are mapped.

**Note:** A PFC packet received for a PFC enabled priority stops lossy
priorities from transmitting if they are mapped to the same TC as the lossless
priority.

#### PAUSE Frames

To enable PAUSE frames on a port, run:

```
$ ethtool -A swp1 autoneg off rx on tx on
```

To query PAUSE frame parameters, run:

```
$ ethtool -a swp1
Pause parameters for swp1
Autonegotiate:  off
RX:             on
TX:             on
```

Unlike PFC configuration, it is not possible to set the delay parameter.
Therefore, this delay is hardcoded as 155 Kbit. This is larger than what PFC
allows, and is set according to a worst-case scenario of 100m cable. Due to this
setting, if too many PG buffers are used and MTU is too large, the configuration
may not fit to the port headroom limits, and may be rejected. E.g.:

```
$ dcb ets set dev swp1 prio-tc 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7
$ ethtool -A swp1 autoneg off rx on tx on
$ ip link set dev swp1 mtu 10000
RTNETLINK answers: No buffer space available
```

**Note:** It is not possible to have both PFC and PAUSE frames enabled on a port
at the same time. Trying to do so generates an error.

Traffic Scheduling
------------------

Enhanced Transmission Selection (ETS) is an 802.1q-standardized way of assigning
available bandwidth to traffic lined up to egress through a given port. Based
on priority (Switch priority in case of the HW datapath), each packet is put in
one of several available queues, called traffic classes (TCs).

#### Incompatibility Between DCB and TC

`mlxsw` supports two interfaces to configure traffic scheduling: DCB (described
here) and [TC](Queues-Management). It is necessary to choose one of these
approaches and stick to it. Configuring qdiscs will overwrite the DCB
configuration present at the time, and configuring DCB will overwrite qdisc
configuration.

#### Priority Map

When the forwarding pipeline determines the egress port of a packet, Switch
priority together with the priority map is used to decide which TC the packet
should be put to.

Use `dcb ets` attribute `prio-tc` to configure the priority map:

```
$ dcb ets set dev swp1 prio-tc 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7
```

The above command creates 1:1 mapping from SP to TC. In addition, as explained
[elsewhere](#dcb-mode), the command creates 1:1 mapping between SP and PG
buffer.

With Open LLDP, use the `ETS-CFG` TLV's `up2tc` field. E.g.:

```
$ lldptool -T -i swp1 -V ETS-CFG up2tc=0:0,1:1,2:2,3:3,4:4,5:5,6:6,7:7
```

#### Transmission Selection Algorithms

When assigning bandwidth to the queued-up traffic, the switch takes into account
what transmission selection algorithm (TSA) is configured on each TC. `mlxsw`
supports two TSAs: the default Strict Priority algorithm, and ETS.

When selecting packets for transmission, strict TCs are tried first, in the
order of decreasing TC index. When there is no more traffic in any of the strict
bands, bandwidth is distributed among the traffic from the ETS bands according
to their configured weights.

The TSAs can be configured through the `tc-tsa` attribute:

```
$ dcb ets set dev swp1 tc-tsa all:strict
```

With Open LLDP, TSAs are configured through the `ETS-CFG` TLV's `tsa` field. For
example:

```
$ lldptool -T -i swp1 -V ETS-CFG                                                    \
        tsa=0:strict,1:strict,2:strict,3:strict,4:strict,5:strict,6:strict,7:strict
```

ETS is implemented in the ASIC using a [weighted round robin (WRR)][6]
algorithm. The device requires that the sum of the weights used amounts to
`100`. Otherwise, changes do not take effect. Thus when configuring TSAs to use
ETS, it is typically also necessary to adjust bandwidth percentage allotment.
This is done through `dcb ets` attribute `tc-bw`:

```
$ dcb ets set dev swp1 tc-tsa all:ets tc-bw {0..3}:12 {4..7}:13
$ dcb ets show dev swp1 tc-tsa tc-bw
tc-tsa 0:ets 1:ets 2:ets 3:ets 4:ets 5:ets 6:ets 7:ets
tc-bw 0:12 1:12 2:12 3:12 4:13 5:13 6:13 7:13
```

In Open LLDP that is done through `ETS-CFG` TLV's `tcbw` field, whose argument
is a list of bandwidth allotments, one for each TC:

```
$ lldptool -T -i swp1 -V ETS-CFG                            \
        tsa=0:ets,1:ets,2:ets,3:ets,4:ets,5:ets,6:ets,7:ets \
        tcbw=12,12,12,12,13,13,13,13
```

#### Shared Buffer Configuration

In case of a congestion, if the criteria for admission of a packet to shared
buffer are not met, and the packet in question is not in a lossless PG, it will
be dropped. That is the case even if that packet is of higher priority or mapped
to a higher-precedence TC than the ones already admitted to the shared buffer.
In other words, once the packet is in the shared buffer, there is no way to punt
it except through blunt tools such as Switch Lifetime Limit.

It is therefore necessary to configure pool sizes and quotas in such a way that
there is always room for high-priority traffic to be admitted.

#### Remote End Configuration

To allow neighbouring hosts to know about the ETS configuration when using LLDP,
run:

```
$ lldptool -T -i swp1 -V ETS-CFG enableTx=yes
```

This can be verified on a neighbouring host by running:

```
host$ lldptool -i enp6s0 -n -t -V ETS-CFG
IEEE 8021QAZ ETS Configuration TLV
         Willing: yes
         CBS: not supported
         MAX_TCS: 8
         PRIO_MAP: 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7
         TC Bandwidth: 12% 12% 12% 12% 13% 13% 13% 13%
         TSA_MAP: 0:ets 1:ets 2:ets 3:ets 4:ets 5:ets 6:ets 7:ets
```


DSCP Rewrite
------------

#### Trust DSCP

Packets that ingress the switch through a port that is in trust-DSCP mode, will
have their DSCP value updated as they egress the switch. The same DSCP APP rules
that are used for packet prioritization are used to configure the rewrite as
well. If several priorities end up resolving to the same DSCP value (as they
probably will), the highest DSCP is favored. For example:

```
$ dcb app add dev swp1 dscp-prio 26:3      # Configure 3->26
$ dcb app add dev swp1 dscp-prio 24:3      # Keep 3->26
$ dcb app del dev swp1 dscp-prio 26:3      # Configure 3->24
```

If there is no rule matching the priority of a packet, DSCP of that packet is
rewritten to zero. As a consequence, if there are no rules at all configured at
the egress port, all DSCP values are rewritten to zero.

It is worth repeating that the decision to rewrite is made as the packet
ingresses the switch through a trust-DSCP port, however the rewrite map to use
is taken from the egress port.

As a consequence, in mixed-trust switches, if a packet ingresses through a
trust-PCP port, and egresses through a trust-DSCP port, its DSCP value will stay
intact. That means that a DSCP value can leak from one domain to another, where
it may have a different meaning. Therefore when running a switch with a mixture
of trust levels, one needs to be careful that this is not a problem.

#### ACL-Based DSCP Rewrite

Spectrum switches allow rewriting packet's DSCP value in the ACL engine. The
[[ACLs]] page talks about how to configure filters. In order to change packet
DSCP, add a filter with the [action `pedit`](ACLs#editing-packet-headers). For
IPv4 like this:

```
$ tc filter add dev swp1 ingress prot ip flower skip_sw \
     action pedit ex munge ip tos set $((dscp << 2)) retain 0xfc
```

And for IPv6 like this:

```
$ tc filter add dev swp1 ingress prot ipv6 flower skip_sw \
     action pedit ex munge ip6 traffic_class set $((dscp << 2)) retain 0xfc
```

The packets whose DSCP value is rewritten this way are not subject to the DSCP
rewrite described [above](#trust-dscp-1).

Shared Buffers
--------------

As explained above, packets are admitted to the switch's shared buffer from the
port's headroom and stay there until they are transmitted out of the switch.

The device has two types of pools, ingress and egress. The pools are used as
containers for packets and allow a user to limit the amount of traffic:
* From a port
* From a {Port, PG buffer}
* To a port
* To a {Port, TC}

The limit can be either a specified amount of bytes (static) or a percentage of
the remaining free space (dynamic).

#### Admission Rules

Once out of the switch's pipeline, a packet is admitted into the shared buffer
only if all four quotas mentioned above are below the configured threshold:

- `Ingress{Port}.Usage < Thres`
- `Ingress{Port,PG}.Usage < Thres`
- `Egress{Port}.Usage < Thres`
- `Egress{Port,TC}.Usage < Thres`

A packet admitted to the shared buffer updates all four usages.

#### Pool Size

To configure a pool's size and threshold type, run:

```
$ devlink sb pool set pci/0000:03:00.0 pool 0 size 12401088 thtype dynamic
```

To see the current settings of a pool, run:

```
$ devlink sb pool show pci/0000:03:00.0 pool 0
```

**Note:** Control packets (e.g. LACP, STP) use ingress pool number 9 and cannot
be bound to a different pool. It is therefore important to configure it using a
suitable size. Prior to kernel 5.2 such packets were using ingress pool number
3.

#### Pool Threshold

Limiting the usage of a flow in a pool can be done by using either a static or
dynamic threshold. The threshold type is a pool property and is set as follows:

```
$ devlink sb pool set pci/0000:03:00.0 pool 0 size 12401088 thtype static
```

To set a dynamic threshold, run:

```
$ devlink sb pool set pci/0000:03:00.0 pool 0 size 12401088 thtype dynamic
```

#### Pool Binding

To bind packets originating from a {Port, PG} to an ingress pool, run:

```
$ devlink sb tc bind set pci/0000:03:00.0/1 tc 0 type ingress pool 0 th 9
```

Or use port name instead:

```
$ devlink sb tc bind set swp1 tc 0 type ingress pool 0 th 9
```

If the pool's threshold is dynamic, then the value specified as the threshold
is used to calculate the `alpha` parameter:

```
alpha = 2 ^ (th - 10)
```

The range of the passed value is between 3 and 16. The computed `alpha` is used
to determine the maximum usage of the flow according to the following formula:

```
max_usage = alpha / (1 + alpha) * Free_Buffer
```

Where `Free_Buffer` is the amount of non-occupied buffer in the relevant pool.

The following table shows the possible `th` values and their corresponding
maximum usage:

| th   | alpha      | max_usage   |
|:----:|:----------:|:-----------:|
| 3    | 0.0078125  | 0.77%       |
| 4    | 0.015625   | 1.53%       |
| 5    | 0.03125    | 3.03%       |
| 6    | 0.0625     | 5.88%       |
| 7    | 0.125      | 11.11%      |
| 8    | 0.25       | 20%         |
| 9    | 0.5        | 33.33%      |
| 10   | 1          | 50%         |
| 11   | 2          | 66.66%      |
| 12   | 4          | 80%         |
| 13   | 8          | 88.88%      |
| 14   | 16         | 94.11%      |
| 15   | 32         | 96.96%      |
| 16   | 64         | 98.46%      |

To see the current settings of binding of {Port, PG} to an ingress pool, run:

```
$ devlink sb tc bind show swp1 tc 0 type ingress
swp1: sb 0 tc 0 type ingress pool 0 threshold 10
```

Similarly for egress, to bind packets directed to a {Port, TC} to an egress
pool, run:

```
$ devlink sb tc bind set swp1 tc 0 type egress pool 4 th 9
```

If the pool's threshold is static, then its value is treated as the maximum
number of bytes that can be used by the flow.

The admission rule requires that the port's usage is also smaller than the
maximum usage. To set a threshold for a port, run:

```
$ devlink sb port pool set swp1 pool 0 th 15
```

The static threshold can be used to set minimal and maximal usage. To set
minimal usage, the static threshold should be set to 0, in which case the
flow never enters the specified pool. Maximal usage can be configured by
setting the threshold to the pool's size or larger.

#### Pool Occupancy

It is possible to take a snapshot of the shared buffer usage with the following
command:

```
$ devlink sb occupancy snapshot pci/0000:03:00.0
```

Once the snapshot is taken, the user may query the current and maximum usage
by:
* Pool
* {Port, Pool}
* {Port, PG}
* {Port, TC}

This is especially useful when trying to determine optimal sizes and thresholds.

**Note:** The snapshot is not atomic. However, the interval between the
different measurements composing it is as minimal as possible, thus making the
snapshot as accurate as possible.

**Note:** Queries following a failed snapshot are invalid.

To monitor the current and maximum occupancy of a port in a pool and to
display current and maximum usage of {Port, PG/TC} in a pool, run:

```
$ devlink sb occupancy show swp1
swp1:
  pool: 0:          0/0         1:          0/0         2:          0/0         3:          0/0
        4:          0/0         5:          0/0         6:          0/0         7:          0/0
        8:          0/0         9:          0/0        10:          0/0
  itc:  0(0):       0/0         1(0):       0/0         2(0):       0/0         3(0):       0/0
        4(0):       0/0         5(0):       0/0         6(0):       0/0         7(0):       0/0
  etc:  0(4):       0/0         1(4):       0/0         2(4):       0/0         3(4):       0/0
        4(4):       0/0         5(4):       0/0         6(4):       0/0         7(4):       0/0
        8(8):       0/0         9(8):       0/0        10(8):       0/0        11(8):       0/0
       12(8):       0/0        13(8):       0/0        14(8):       0/0        15(8):       0/0
```

For the CPU port, run:

```
$ devlink sb occupancy show pci/0000:03:00.0/0
pci/0000:03:00.0/0:
  pool: 0:             0/0          1:             0/0          2:             0/0          3:             0/0
        4:             0/0          5:             0/0          6:             0/0          7:             0/0
        8:             0/0          9:             0/0         10:             0/0
  itc:  0(0):          0/0          1(0):          0/0          2(0):          0/0          3(0):          0/0
        4(0):          0/0          5(0):          0/0          6(0):          0/0          7(0):          0/0
  etc:  0(10):         0/0          1(10):         0/0          2(10):         0/0          3(10):         0/0
        4(10):         0/0          5(10):         0/0          6(10):         0/0          7(10):         0/0
        8(10):         0/0          9(10):         0/0         10(10):         0/0         11(10):         0/0
       12(10):         0/0         13(10):         0/0         14(10):         0/0         15(10):         0/0
```

Where `pci/0000:03:00.0/0` represents the CPU port.

**Note:** For the CPU port, the egress direction represents traffic
trapped to the CPU. The ingress direction is reserved.

To clear the maximum usage (watermark), run:

```
$ devlink sb occupancy clearmax pci/0000:03:00.0
```

A new maximum usage is tracked from the time the clear operation is performed.

#### Handling of BUM Traffic

Flood traffic is subject to special treatment by the ASIC. Such traffic is
commonly referred to as BUM, for broadcast, unknown-unicast and multicast
packets. BUM traffic is prioritized and scheduled separately from other traffic,
using egress TCs 8&ndash;15, as opposed to TCs 0&ndash;7 for unicast traffic.
Thus packets that would otherwise be assigned to some TC X, are assigned to TC
X+8 instead, if they are BUM packets.

The pairs of unicast and corresponding BUM TCs are then configured to strictly
prioritize the unicast traffic: TC 0 is strictly prioritized over TC 8, 1 over
9, etc. This configuration is necessary to mitigate an issue in Spectrum chips
where an overload of BUM traffic shuts all unicast traffic out of the system.

However, strictly prioritizing unicast traffic has, under sustained unicast
overload, the effect of blocking e.g. ARP traffic. These packets are admitted to
the system, stay in the queues for a while, but if the lower-numbered TC stays
occupied by unicast traffic, their lifetime eventually expires, and these
packets are dropped. To prevent this scenario, a minimum shaper of 200Mbps is
configured on the higher-numbered TCs to allow through a guaranteed trickle of
BUM traffic even under unicast overload.

#### Default Shared Buffer Configuration

By default, one ingress pool (0) and one egress pool (4) are used for
most of the traffic. In addition to that, pool 8 is dedicated for [BUM
traffic](#handling-of-bum-traffic).

Pools 9 and 10 are also special and are used for traffic trapped to the
CPU. Specifically, pool 9 is used for accounting of incoming control
packets such as STP and LACP that are trapped to the CPU. Pool 10 is an
egress pool used for accounting of all the packets that are trapped to
the CPU.

Two types of pools (ingress and egress) are required because from the
point of view of the shared buffer, traffic that is trapped to the CPU
is like any other traffic. Such traffic enters the switch from a front
panel port and transmitted through the CPU port, which does not have a
corresponding netdev. Instead of being put on the wire, such packets
cross the bus (e.g., PCI) towards the host CPU.

The major difference in the default configuration between Spectrum-1 and
Spectrum-2 is the size of the shared buffer as can be seen in the output
below.

Spectrum-1:

```
$ devlink sb pool show
pci/0000:01:00.0:
  sb 0 pool 0 type ingress size 12440064 thtype dynamic cell_size 96
  sb 0 pool 1 type ingress size 0 thtype dynamic cell_size 96
  sb 0 pool 2 type ingress size 0 thtype dynamic cell_size 96
  sb 0 pool 3 type ingress size 0 thtype dynamic cell_size 96
  sb 0 pool 4 type egress size 13232064 thtype dynamic cell_size 96
  sb 0 pool 5 type egress size 0 thtype dynamic cell_size 96
  sb 0 pool 6 type egress size 0 thtype dynamic cell_size 96
  sb 0 pool 7 type egress size 0 thtype dynamic cell_size 96
  sb 0 pool 8 type egress size 15794208 thtype static cell_size 96
  sb 0 pool 9 type ingress size 256032 thtype dynamic cell_size 96
  sb 0 pool 10 type egress size 256032 thtype dynamic cell_size 96
```

Spectrum-2:

```
$ devlink sb pool show
pci/0000:06:00.0:
  sb 0 pool 0 type ingress size 40960080 thtype dynamic cell_size 144
  sb 0 pool 1 type ingress size 0 thtype static cell_size 144
  sb 0 pool 2 type ingress size 0 thtype static cell_size 144
  sb 0 pool 3 type ingress size 0 thtype static cell_size 144
  sb 0 pool 4 type egress size 40960080 thtype dynamic cell_size 144
  sb 0 pool 5 type egress size 0 thtype static cell_size 144
  sb 0 pool 6 type egress size 0 thtype static cell_size 144
  sb 0 pool 7 type egress size 0 thtype static cell_size 144
  sb 0 pool 8 type egress size 41746464 thtype static cell_size 144
  sb 0 pool 9 type ingress size 256032 thtype dynamic cell_size 144
  sb 0 pool 10 type egress size 256032 thtype dynamic cell_size 144
```

Spectrum-3:

```
$ devlink sb pool show
pci/0000:07:00.0:
  sb 0 pool 0 type ingress size 60561360 thtype dynamic cell size 144
  sb 0 pool 1 type ingress size 0 thtype static cell size 144
  sb 0 pool 2 type ingress size 0 thtype static cell size 144
  sb 0 pool 3 type ingress size 0 thtype static cell size 144
  sb 0 pool 4 type egress size 60561360 thtype dynamic cell size 144
  sb 0 pool 5 type egress size 0 thtype static cell size 144
  sb 0 pool 6 type egress size 0 thtype static cell size 144
  sb 0 pool 7 type egress size 0 thtype static cell size 144
  sb 0 pool 8 type egress size 60817536 thtype static cell size 144
  sb 0 pool 9 type ingress size 256032 thtype dynamic cell size 144
  sb 0 pool 10 type egress size 256032 thtype dynamic cell size 144
```

#### Descriptor Buffers

The shared buffers described above are used for storing of traffic payload.
Besides that, Spectrum machines have a separate resource for packet metadata, or
"descriptors". Descriptor buffer configuration is not exposed. All traffic is
currently configured to use descriptor pool 14. As of Linux 5.19, `mlxsw`
configures this pool to be "infinite", meaning all the available chip resources
can be used by traffic. Prior to 5.19, the default configuration has been used,
which was actually smaller.

Lack of descriptor buffer space can be observed as packets being rejected for
"no buffer space", e.g. the `tc_no_buffer_discard_uc_tc` `ethtool` counters,
without the corresponding pressure in the byte pools, as reported by `devlink sb
occupancy`. A workload that would cause such exhaustion is traffic pressure
caused by a flow with many small packets.

The descriptor pool size is chip-dependent and the following table shows the
number of descriptors available:

| ASIC       | Descriptors |
|:---------- |:-----------:|
| Spectrum-1 |       81920 |
| Spectrum-2 |      136960 |
| Spectrum-3 |      204800 |

Control Plane Policing (CoPP)
-----------------------------

The `mlxsw` driver is capable of reflecting the kernel's data path to
the Spectrum ASICs. This allows packets to be forwarded between front
panel ports without ever reaching the CPU.

While the data plane is offloaded to the ASIC, the control plane is
still running on the CPU and is responsible for important tasks such as
maintaining IP neighbours and delivering locally received packets to
relevant user space applications (e.g., a routing daemon).

To ensure that the control plane receives the packets it needs, the ASIC
contains various packet traps that are responsible for delivering such
packets to the CPU. Refer to [this page][10] for the complete list.

Since the ASIC is capable of handling packet rates that are several
orders of magnitude higher compared to those that can be handled by the
CPU, the ASIC includes packet trap policers to prevent the CPU from
being overwhelmed. These policers are bound to packet trap groups, which
are used to aggregate logically related packet traps.

The default binding is set by the driver during its initialization and
can be queried using the following command:

```
$ devlink trap group
pci/0000:01:00.0:
  name l2_drops generic true policer 1
  name l3_drops generic true policer 1
  name l3_exceptions generic true policer 1
  name tunnel_drops generic true policer 1
  name acl_drops generic true policer 1
  name stp generic true policer 2
  name lacp generic true policer 3
  name lldp generic true policer 4
  name mc_snooping generic true policer 5
  name dhcp generic true policer 6
  name neigh_discovery generic true policer 7
  name bfd generic true policer 8
  name ospf generic true policer 9
  name bgp generic true policer 10
  name vrrp generic true policer 11
  name pim generic true policer 12
  name uc_loopback generic true policer 13
  name local_delivery generic true policer 14
  name ipv6 generic true policer 15
  name ptp_event generic true policer 16
  name ptp_general generic true policer 17
  name acl_sample generic true
  name acl_trap generic true policer 18
```

To change the default binding and police, for example, BGP and BFD
packets using the same policer, run:

```
$ devlink trap group set pci/0000:01:00.0 group bgp policer 8
```

To unbind a policer use the `nopolicer` keyword:

```
$ devlink trap group set pci/0000:01:00.0 group bgp nopolicer
```

To query the parameters of policer `8`, run:

```
$ devlink trap policer show pci/0000:01:00.0 policer 8
pci/0000:01:00.0:
  policer 8 rate 20480 burst 1024
```

To set its rate to 5,000 packets per second (pps) and burst size to 256
packets, run:

```
$ devlink trap policer set pci/0000:01:00.0 policer 8 rate 5000 burst 256
```

When trapped packets exceed the policer's rate or burst size, they are
dropped by the policer. To query the number of packets dropped by
policer `8`, run:

```
$ devlink -s trap policer show pci/0000:01:00.0 policer 8
pci/0000:01:00.0:
  policer 8 rate 5000 burst 256
    stats:
        rx:
          dropped 13522938
```

#### Monitoring Using Prometheus

[Prometheus][11] is a popular time series database used for event
monitoring and alerting. Its main component is the [Prometheus
server][12] which periodically scrapes and stores time series data. The
data is scraped from various exporters that export their metrics over
HTTP.

Using [`devlink-exporter`][13] it is possible to export packets and
bytes statistics about each trap and trap group to Prometheus. In
addition, it is possible to export the number of packets dropped by each
trap policer.

[Grafana][14] can then be used to visualize the information:

![figure 1](pictures/devlink_trap_policer_grafana.png)

Further Resources
-----------------

1. [Wikipedia article on DCB][7]
2. man lldptool-ets
3. man lldptool-pfc
4. man lldptool-app
5. man ethtool
6. man devlink
7. man devlink-sb
8. man devlink-trap
9. man dcb
10. man dcb-app
11. man dcb-buffer
12. man dcb-ets
13. man dcb-pfc

[1]: https://en.wikipedia.org/wiki/IEEE_802.1p
[2]: https://en.wikipedia.org/wiki/IEEE_802.1Q#Frame_format
[3]: https://web.archive.org/web/20190205084918/http://open-lldp.org/dcb_overview
[4]: https://github.com/intel/openlldp
[5]: http://vger.kernel.org/~davem/skb.html
[6]: https://en.wikipedia.org/wiki/Weighted_round_robin
[7]: https://en.wikipedia.org/wiki/Data_center_bridging
[8]: https://github.com/intel/openlldp/pull/9
[9]: https://en.wikipedia.org/wiki/Differentiated_services
[10]: https://www.kernel.org/doc/html/latest/networking/devlink/devlink-trap.html
[11]: https://prometheus.io/
[12]: https://prometheus.io/docs/introduction/overview/
[13]: https://github.com/Mellanox/wjh-linux/tree/master/devlink-exporter
[14]: https://grafana.com/