-
Notifications
You must be signed in to change notification settings - Fork 39
Packet Sampling
- Packet Sampling
- Basic Configuration
- Flow-Based Sampling
- Reported Metadata
- Monitoring Sampled Packets
- Functional Limitations
- Further Resources
Packet sampling enables the sampling of packets going through a physical port (ingress or egress) to higher layers for inspection. Specifically, when sampling occurs in the hardware data path, sampled packets continue to be forwarded in hardware, but a copy is sent to the host CPU.
When received by mlxsw
, sampled packets are passed to the psample
kernel module along with relevant metadata (e.g., egress port, latency).
In turn, the psample
module encapsulates such packets (potentially
truncated) in generic netlink packets with various metadata encoded in
different attributes and emits a notification to user space.
mlxsw
supports several principal sampling triggers:
- Ingress sampling using a
matchall
classifier configured on the ingress of a physical port. - Egress sampling using a
matchall
classifier configured on the egress of a physical port. - Flow-based sampling using a
flower
classifier configured on the ingress or egress of a physical port.
Kernel Version | |
---|---|
4.11 | Ingress sampling |
5.13 | Egress and flow-based sampling on Spectrum-2 onwards |
Configuration of packet sampling is done through TC filters,
namely by attaching filters with action sample
. See the linked section
for details of what filters are, and how to add, remove and list them.
For a quick bootstrap, the following commands configure sampling of 1
out of 100 packets received by swp1
:
# tc qdisc add dev swp1 clsact
# tc filter add dev swp1 ingress \
matchall skip_sw \
action sample rate 100 group 1 trunc 64
The ingress
keyword can be replaced with egress
for egress sampling.
The skip_sw
flag indicates that sampling should only take place in the
hardware data path. Therefore, packets going via slow path will not be
sampled.
The group
keyword is mandatory and denotes the psample
sampling
group. Different sampling groups can be used to distinguish packets
sampled from different triggers.
The trunc
keyword is not mandatory, but it can be used to tell the
psample
module to truncate sampled packets to the given length before
encapsulating them in generic netlink packets. This is useful in case
only the packet headers are of interest, which is usually the case.
Note: Sampling can only be configured once on any combination of source port and direction (ingress or egress):
# tc filter add dev swp1 ingress \
matchall skip_sw action sample rate 100 group 1 trunc 64
# tc filter add dev swp1 ingress \
matchall skip_sw action sample rate 100 group 1 trunc 64
Error: mlxsw_spectrum: Sampling already enabled on port.
We have an error talking to the kernel
The kernel implements rule replacement by first installing the new rule and then deleting the old one. Therefore, due to the above mentioned limitation, it is not possible to replace or change sampling rules:
# tc filter add dev swp1 ingress handle 0x1 \
matchall skip_sw action sample rate 100 group 1 trunc 64
# tc filter replace dev swp1 ingress handle 0x1 \
matchall skip_sw action sample rate 100 group 2 trunc 64
Error: mlxsw_spectrum: Sampling already enabled on port.
We have an error talking to the kernel
Starting with kernel 5.13, it is possible to configure packet sampling
as result of a flower
match. E.g. sampling of packets with a given
destination IP incoming via swp1 can be done as follows:
# tc filter add dev swp1 ingress \
protocol ip flower skip_sw dst_ip 192.168.0.4 \
action sample rate 100 group 1 trunc 64
For further details about flow-based matching, see ACLs.
Note: Different flower filters can use the sampling action, but they all must use the same sampling parameters (e.g., group, rate):
# tc filter add dev swp1 ingress \
protocol ip flower skip_sw dst_ip 192.168.0.4 \
action sample rate 100 group 1 trunc 64
# tc filter add dev swp2 egress \
protocol ip flower skip_sw dst_ip 192.168.0.5 \
action sample rate 200 group 1 trunc 64
Error: mlxsw_spectrum: Sampling parameters do not match for an existing sampling trigger.
We have an error talking to the kernel
The kernel implements rule replacement by first installing the new rule and then deleting the old one. Therefore, due to the above mentioned limitation, it is not possible to replace or change sampling rules:
# tc filter add dev swp1 ingress handle 0x1 \
protocol ip flower skip_sw dst_ip 192.168.0.4 \
action sample rate 100 group 1 trunc 64
# tc filter replace dev swp1 ingress handle 0x1 \
protocol ip flower skip_sw dst_ip 192.168.0.4 \
action sample rate 100 group 2 trunc 64
Error: mlxsw_spectrum: Sampling parameters do not match for an existing sampling trigger.
We have an error talking to the kernel
The following metadata is reported for sampled packets:
- Input interface index
- Output interface index
- Output traffic class
- Output traffic class occupancy (bytes)
- Latency (nanoseconds)
- Latency, output traffic class and output traffic class occupancy are not reported for sampled packets on Spectrum-1
- Latency is only reported for packets sampled via one of the egress triggers
- Latency is reported in granularity of 64 nanoseconds. Latency above 1 second is not reported
- Egress traffic class occupancy is reported in granularity of 8KB
As previously explained, sampled packets are reported to user space via
the psample
kernel module (CONFIG_PSAMPLE
). The sampled packets can
be consumed by different applications that fit different use cases.
The psample
utility, part of libpsample, can be used to interact
with the psample
kernel module. It is able to display both
configuration information (e.g., active sampling groups) as well as
metadata about sampled packets. For example:
$ psample -c
group 1 in-ifindex 32 out-ifindex 29 origsize 106 sample-rate 5 seq 1226 out-tc 0 out-tc-occ 0 timestamp Tue Mar 23 20:25:53 2021 958927903 nsec protocol 0x800
group 1 in-ifindex 32 out-ifindex 29 origsize 106 sample-rate 5 seq 1227 out-tc 0 out-tc-occ 0 timestamp Tue Mar 23 20:25:53 2021 960212878 nsec protocol 0x800
See the tool's official page for more information.
It is possible to dissect sampled packets using Wireshark or its terminal
equivalent, tshark
.
After capturing packets using psample
, they can be imported into Wireshark.
$ psample --write - | tshark -r - -V
It is also possible to filter on specific fields in the encapsulating netlink packet. For example, to filter sampled packets received from a particular netdev, run:
$ psample --write - | tshark -r - -V -Y 'netlink.psample.iifindex==5'
To list the fields exposed by the psample
dissector, run:
$ tshark -G fields | grep psample
P Linux psample protocol psample
F Command netlink.psample.cmd FT_UINT8 psample BASE_DEC 0x0
F Attribute type netlink.psample.attr_type FT_UINT16 psample BASE_DEC 0x3fff
F Input interface index netlink.psample.iifindex FT_UINT16 psample BASE_HEX 0x0
F Output interface index netlink.psample.oifindex FT_UINT16 psample BASE_HEX 0x0
F Original size netlink.psample.origsize FT_UINT32 psample BASE_HEX 0x0
F Sample group netlink.psample.sample_group FT_UINT32 psample BASE_DEC 0x0
F Group sequence number netlink.psample.group_seq_num FT_UINT32 psample BASE_DEC 0x0
F Sample rate netlink.psample.sample_rate FT_UINT32 psample BASE_DEC 0x0
F Tunnel netlink.psample.tunnel FT_UINT32 psample BASE_HEX 0x0
F Group reference count netlink.psample.group_refcount FT_UINT32 psample BASE_HEX 0x0
F Output traffic class netlink.psample.out_tc FT_UINT16 psample BASE_DEC 0x0
F Output traffic class occupancy netlink.psample.out_tc_occ FT_UINT64 psample BASE_DEC 0x0
F Latency netlink.psample.latency FT_UINT64 psample BASE_DEC 0x0
F Timestamp netlink.psample.timestamp FT_ABSOLUTE_TIME psample 0x0
F Protocol netlink.psample.proto FT_UINT16 psample BASE_HEX 0x0
F Modification synphasor.conf.phasor_mod.upsampled_extrapolation FT_BOOLEAN synphasor 16 0x4
F Modification synphasor.conf.phasor_mod.upsampled_interpolation FT_BOOLEAN synphasor 16 0x2
Note: To understand if your Wireshark version includes the
dissector, check the output of tshark -G protocols | grep psample
. In
case the dissector is included, the output should be: Linux psample protocol psample psample
. To install Wireshark from source, please
refer to the Wireshark documentation.
Host sFlow is an agent that can export performance metrics using
the sFlow protocol. On Linux, the agent is able to configure
ingress and egress sampling rules using the matchall
classifier. The
agent then reads sampled packets via the psample
netlink channel and
exports the information to an sFlow collector - such as sFlow-RT -
over the sFlow protocol.
To compile and install the agent from source, run:
$ git clone https://github.com/sflow/host-sflow.git
$ cd host-sflow/
$ make FEATURES=DENT
# make install
More detailed instructions can be found here.
The following configuration file will instruct the agent to use
psample
and configure both ingress and egress sampling rules:
# /etc/hsflowd.conf
sflow {
sampling.1G=100
collector { ip=127.0.0.1 }
psample { group=1 egress=on }
dent { sw=off switchport=swp.* }
}
Care must be taken when configuring the sampling rate in order not to overwhelm the host CPU with sampled packets. More detailed information about the various configuration options can be found here.
The agent can be started and enabled using systemd:
# systemctl start hsflowd.service
# systemctl enable hsflowd.service
Once the agent is running, it can be coupled with a collector such as sflowtool or sFlow-RT that will process the information from the agent and visualize it, as can be seen in this blog post.
- Egress and flow-based sampling are not supported on Spectrum-1
- Packets sampled via one of the egress triggers are copied to the CPU after they were modified by the hardware data path (e.g., DMAC update after routing)
- man tc
- man tc-matchall
-
QoS in Linux with TC and Filters by Phil Sutter (part of
iproute
documentation) - man tc-sample
- Linux 4.11 kernel extends packet sampling support
- Transit delay and queueing
General information
System Maintenance
Network Interface Configuration
- Switch Port Configuration
- Netdevice Statistics
- Persistent Configuration
- Quality of Service
- Queues Management
- How To Configure Lossless RoCE
- Port Mirroring
- ACLs
- OVS
- Resource Management
- Precision Time Protocol (PTP)
Layer 2
Network Virtualization
Layer 3
- Static Routing
- Virtual Routing and Forwarding (VRF)
- Tunneling
- Multicast Routing
- Virtual Router Redundancy Protocol (VRRP)
Debugging