Skip to content

Packet Sampling

Ido Schimmel edited this page May 31, 2021 · 3 revisions
Table of Contents
  1. Packet Sampling
  2. Basic Configuration
  3. Flow-Based Sampling
  4. Reported Metadata
  5. Monitoring Sampled Packets
    1. Using psample Utility
      1. Packet Dissection Using Wireshark
    2. Using Host sFlow
  6. Functional Limitations
  7. Further Resources

Packet Sampling

Packet sampling enables the sampling of packets going through a physical port (ingress or egress) to higher layers for inspection. Specifically, when sampling occurs in the hardware data path, sampled packets continue to be forwarded in hardware, but a copy is sent to the host CPU.

When received by mlxsw, sampled packets are passed to the psample kernel module along with relevant metadata (e.g., egress port, latency). In turn, the psample module encapsulates such packets (potentially truncated) in generic netlink packets with various metadata encoded in different attributes and emits a notification to user space.

mlxsw supports several principal sampling triggers:

  • Ingress sampling using a matchall classifier configured on the ingress of a physical port.
  • Egress sampling using a matchall classifier configured on the egress of a physical port.
  • Flow-based sampling using a flower classifier configured on the ingress or egress of a physical port.

Features by Version

Kernel Version
4.11 Ingress sampling
5.13 Egress and flow-based sampling on Spectrum-2 onwards

Basic Configuration

Configuration of packet sampling is done through TC filters, namely by attaching filters with action sample. See the linked section for details of what filters are, and how to add, remove and list them.

For a quick bootstrap, the following commands configure sampling of 1 out of 100 packets received by swp1:

# tc qdisc add dev swp1 clsact
# tc filter add dev swp1 ingress               \
	matchall skip_sw                       \
	action sample rate 100 group 1 trunc 64

The ingress keyword can be replaced with egress for egress sampling.

The skip_sw flag indicates that sampling should only take place in the hardware data path. Therefore, packets going via slow path will not be sampled.

The group keyword is mandatory and denotes the psample sampling group. Different sampling groups can be used to distinguish packets sampled from different triggers.

The trunc keyword is not mandatory, but it can be used to tell the psample module to truncate sampled packets to the given length before encapsulating them in generic netlink packets. This is useful in case only the packet headers are of interest, which is usually the case.

Note: Sampling can only be configured once on any combination of source port and direction (ingress or egress):

# tc filter add dev swp1 ingress \
	matchall skip_sw action sample rate 100 group 1 trunc 64
# tc filter add dev swp1 ingress \
	matchall skip_sw action sample rate 100 group 1 trunc 64
Error: mlxsw_spectrum: Sampling already enabled on port.
We have an error talking to the kernel

The kernel implements rule replacement by first installing the new rule and then deleting the old one. Therefore, due to the above mentioned limitation, it is not possible to replace or change sampling rules:

# tc filter add dev swp1 ingress handle 0x1 \
	matchall skip_sw action sample rate 100 group 1 trunc 64
# tc filter replace dev swp1 ingress handle 0x1 \
	matchall skip_sw action sample rate 100 group 2 trunc 64
Error: mlxsw_spectrum: Sampling already enabled on port.
We have an error talking to the kernel

Flow-Based Sampling

Starting with kernel 5.13, it is possible to configure packet sampling as result of a flower match. E.g. sampling of packets with a given destination IP incoming via swp1 can be done as follows:

# tc filter add dev swp1 ingress                      \
	protocol ip flower skip_sw dst_ip 192.168.0.4 \
	action sample rate 100 group 1 trunc 64

For further details about flow-based matching, see ACLs.

Note: Different flower filters can use the sampling action, but they all must use the same sampling parameters (e.g., group, rate):

# tc filter add dev swp1 ingress			\
	protocol ip flower skip_sw dst_ip 192.168.0.4	\
	action sample rate 100 group 1 trunc 64
# tc filter add dev swp2 egress				\
	protocol ip flower skip_sw dst_ip 192.168.0.5	\
	action sample rate 200 group 1 trunc 64
Error: mlxsw_spectrum: Sampling parameters do not match for an existing sampling trigger.
We have an error talking to the kernel

The kernel implements rule replacement by first installing the new rule and then deleting the old one. Therefore, due to the above mentioned limitation, it is not possible to replace or change sampling rules:

# tc filter add dev swp1 ingress handle 0x1		\
	protocol ip flower skip_sw dst_ip 192.168.0.4	\
	action sample rate 100 group 1 trunc 64
# tc filter replace dev swp1 ingress handle 0x1		\
	protocol ip flower skip_sw dst_ip 192.168.0.4	\
	action sample rate 100 group 2 trunc 64
Error: mlxsw_spectrum: Sampling parameters do not match for an existing sampling trigger.
We have an error talking to the kernel

Reported Metadata

The following metadata is reported for sampled packets:

  1. Input interface index
  2. Output interface index
  3. Output traffic class
  4. Output traffic class occupancy (bytes)
  5. Latency (nanoseconds)

Limitations

  1. Latency, output traffic class and output traffic class occupancy are not reported for sampled packets on Spectrum-1
  2. Latency is only reported for packets sampled via one of the egress triggers
  3. Latency is reported in granularity of 64 nanoseconds. Latency above 1 second is not reported
  4. Egress traffic class occupancy is reported in granularity of 8KB

Monitoring Sampled Packets

As previously explained, sampled packets are reported to user space via the psample kernel module (CONFIG_PSAMPLE). The sampled packets can be consumed by different applications that fit different use cases.

Using psample Utility

The psample utility, part of libpsample, can be used to interact with the psample kernel module. It is able to display both configuration information (e.g., active sampling groups) as well as metadata about sampled packets. For example:

$ psample -c
group 1 in-ifindex 32 out-ifindex 29 origsize 106 sample-rate 5 seq 1226 out-tc 0 out-tc-occ 0 timestamp Tue Mar 23 20:25:53 2021 958927903 nsec protocol 0x800
group 1 in-ifindex 32 out-ifindex 29 origsize 106 sample-rate 5 seq 1227 out-tc 0 out-tc-occ 0 timestamp Tue Mar 23 20:25:53 2021 960212878 nsec protocol 0x800

See the tool's official page for more information.

Packet Dissection Using Wireshark

It is possible to dissect sampled packets using Wireshark or its terminal equivalent, tshark.

After capturing packets using psample, they can be imported into Wireshark.

$ psample --write - | tshark -r - -V

It is also possible to filter on specific fields in the encapsulating netlink packet. For example, to filter sampled packets received from a particular netdev, run:

$ psample --write - | tshark -r - -V -Y 'netlink.psample.iifindex==5'

To list the fields exposed by the psample dissector, run:

$ tshark -G fields | grep psample
P       Linux psample protocol  psample
F       Command netlink.psample.cmd     FT_UINT8        psample BASE_DEC        0x0
F       Attribute type  netlink.psample.attr_type       FT_UINT16       psample BASE_DEC        0x3fff
F       Input interface index   netlink.psample.iifindex        FT_UINT16       psample BASE_HEX        0x0
F       Output interface index  netlink.psample.oifindex        FT_UINT16       psample BASE_HEX        0x0
F       Original size   netlink.psample.origsize        FT_UINT32       psample BASE_HEX        0x0
F       Sample group    netlink.psample.sample_group    FT_UINT32       psample BASE_DEC        0x0
F       Group sequence number   netlink.psample.group_seq_num   FT_UINT32       psample BASE_DEC        0x0
F       Sample rate     netlink.psample.sample_rate     FT_UINT32       psample BASE_DEC        0x0
F       Tunnel  netlink.psample.tunnel  FT_UINT32       psample BASE_HEX        0x0
F       Group reference count   netlink.psample.group_refcount  FT_UINT32       psample BASE_HEX        0x0
F       Output traffic class    netlink.psample.out_tc  FT_UINT16       psample BASE_DEC        0x0
F       Output traffic class occupancy  netlink.psample.out_tc_occ      FT_UINT64       psample BASE_DEC        0x0
F       Latency netlink.psample.latency FT_UINT64       psample BASE_DEC        0x0
F       Timestamp       netlink.psample.timestamp       FT_ABSOLUTE_TIME        psample         0x0
F       Protocol        netlink.psample.proto   FT_UINT16       psample BASE_HEX        0x0
F       Modification    synphasor.conf.phasor_mod.upsampled_extrapolation       FT_BOOLEAN      synphasor       16      0x4
F       Modification    synphasor.conf.phasor_mod.upsampled_interpolation       FT_BOOLEAN      synphasor       16      0x2

Note: To understand if your Wireshark version includes the dissector, check the output of tshark -G protocols | grep psample. In case the dissector is included, the output should be: Linux psample protocol psample psample. To install Wireshark from source, please refer to the Wireshark documentation.

Using Host sFlow

Host sFlow is an agent that can export performance metrics using the sFlow protocol. On Linux, the agent is able to configure ingress and egress sampling rules using the matchall classifier. The agent then reads sampled packets via the psample netlink channel and exports the information to an sFlow collector - such as sFlow-RT - over the sFlow protocol.

To compile and install the agent from source, run:

$ git clone https://github.com/sflow/host-sflow.git
$ cd host-sflow/
$ make FEATURES=DENT
# make install

More detailed instructions can be found here.

The following configuration file will instruct the agent to use psample and configure both ingress and egress sampling rules:

# /etc/hsflowd.conf
sflow {
  sampling.1G=100
  collector { ip=127.0.0.1 }
  psample { group=1 egress=on }
  dent { sw=off switchport=swp.* }
}

Care must be taken when configuring the sampling rate in order not to overwhelm the host CPU with sampled packets. More detailed information about the various configuration options can be found here.

The agent can be started and enabled using systemd:

# systemctl start hsflowd.service
# systemctl enable hsflowd.service

Once the agent is running, it can be coupled with a collector such as sflowtool or sFlow-RT that will process the information from the agent and visualize it, as can be seen in this blog post.

Functional Limitations

  1. Egress and flow-based sampling are not supported on Spectrum-1
  2. Packets sampled via one of the egress triggers are copied to the CPU after they were modified by the hardware data path (e.g., DMAC update after routing)

Further Resources

  1. man tc
  2. man tc-matchall
  3. QoS in Linux with TC and Filters by Phil Sutter (part of iproute documentation)
  4. man tc-sample
  5. Linux 4.11 kernel extends packet sampling support
  6. Transit delay and queueing
Clone this wiki locally