Skip to content

Precision Time Protocol

Ido Schimmel edited this page Nov 7, 2022 · 7 revisions
Table of Contents
  1. Introduction
  2. Configuring PTP
    1. systemd Configuration
    2. Kernel Configuration
  3. PTP Management
    1. Statistics
  4. Synchronizing Real Time Clock
  5. Low-level Details
    1. PTP Hardware Clock (PHC)
    2. Hardware Timestamping
      1. Limitations in Spectrum-2 and Above
  6. Issues and Interactions
    1. PIM
    2. Socket Buffer
    3. Throughput of PTP-Enabled Ports
    4. Untimestamped Packets
  7. Further Reading

Introduction

Precision time protocol (commonly abbreviated as PTP) is a protocol designed to synchronize real-time clocks in the nodes of a distributed system that communicate using a network. On a local network, it can commonly achieve sub-microsecond accuracy. It is standardized as IEEE 1588-2002.

The nodes in the PTP network (which are called clocks) are organized into master-slave hierarchy. Slaves synchronize to their masters, which can in turn be slaves to other masters. The master-slave relationship is determined by running, on each clock, a so-called Best Master Clock (BMC) algorithm. By determining the relationships in the network dynamically, the PTP network can resolve cycles and prune the general interconnect graph down to a tree of masters and slaves.

The root of this tree is the ultimate master that all clocks (possibly indirectly) synchronize to, and is called grandmaster. Its time would in turn typically be synchronized by GPS.

Interior nodes in this network are called boundary clocks (BC). They have one slave port (leading to a master) and one or more master ports (leading to slave clocks).

On Linux, a prominent package that includes a PTP agent and other related tools is linuxptp. The following text assumes that this PTP package is installed.

Features by Version

Kernel Version
5.3 Support PTP in master, slave or BC role on Spectrum-1.
5.4 ethtool counters for garbage-collected packets and timestamps.
6.0 Spectrum-2 support.

Configuring PTP

The linuxptp daemon that implements the PTP protocol is called ptp4l. ptp4l can use different transport protocols to communicate with other clocks: Ethernet, UDP over IPv4 or UDP over IPv6. If it should be using UDP, the ports that ptp4l should run on need to be router ports. For example, in order to configure for UDP over IPv4:

# ip address add dev swp1 192.0.2.1/28
# ip address add dev swp2 192.0.2.17/28
# ip address add dev swp3 192.0.2.33/28

In simple situations all that is necessary to enable PTP on a switch is to run ptp4l, passing one or more -i options with interfaces to use:

# ptp4l -i swp1 -i swp2 -i swp3

By default, ptp4l uses IPv4 UDP as a transport protocol. The command line options -2, -4 and -6 can be used to select, respectively, Ethernet, UDP over IPv4 and UDP over IPv6 transport protocols.

A more detailed configuration can be done through a configuration file. One with a couple common options set might look like this:

[global]
time_stamping hardware
logSyncInterval 0
logMinDelayReqInterval 1
summary_interval 0
step_threshold 1.0
tx_timestamp_timeout 30
network_transport UDPv4

[swp1]

[swp2]

[swp3]

The global section holds options that are either global in nature and influence behavior of the daemon itself, or common for all interfaces unless overridden explicitly. The example also shows three sections with names matching the ports configured above. All of them are empty, but each of them serves as a cue for ptp4l to run PTP on a given port.

Of particular interest is the option tx_timestamp_timeout, which configures how long (in milliseconds) ptp4l will wait for timestamp of a sent packet. ptp4l defaults to 1ms, which is often enough unless the system is busy. On a system with high CPU load or one that needs to handle many PTP packets, it may be necessary to set the timeout higher. The value of 30ms shown above should be enough for the PTP traffic volume that the Spectrum system is designed to handle.

The option step_threshold configures the amount of inaccuracy, in seconds, that ptp4l will correct in one step, instead of by tweaking the clock frequency.

ptp4l does not use a configuration file by default, but one can be passed in a -f command line option. When the configuration file includes the interfaces to run on, they do not need to be listed on command line anymore. ptp4l just needs to be told the location of that file:

# ptp4l -f /etc/ptp4l.conf

systemd Configuration

On systems that use systemd, the ptp4l service file uses environment file /etc/sysconfig/ptp4l to configure the options that ptp4l is started with. The default options contain -i eth0, which is typically wrong on a switch. Because all of the configuration can be contained in /etc/ptp4l.conf, including the interfaces to run on, -f is actually the only necessary option:

# cat /etc/sysconfig/ptp4l
OPTIONS="-f /etc/ptp4l.conf"

If the ptp4l environment file is updated this way, and /etc/ptp4l.conf contains a suitable configuration, ptp4l can be started by systemd:

# systemctl start ptp4l

If ptp4l should be started by default during the system boot, the service needs to be enabled:

# systemctl enable ptp4l

Kernel Configuration

Make sure that your kernel has PTP_1588_CLOCK and NET_PTP_CLASSIFY enabled when you want to use the PTP support in mlxsw.

PTP Management

IEEE 1588 standardizes a mechanism for introspection and configuration of individual PTP clocks. This uses PTP management messages. The linuxptp suite contains a tool named pmc, for PTP management client, which can be used to send and receive PTP management messages.

As an example, a GET CURRENT_DATA_SET command can be sent to retrieve the current status of each clock:

# pmc -u 'GET CURRENT_DATA_SET'
sending: GET CURRENT_DATA_SET
        7cfe90.fffe.f5a351-0 seq 0 RESPONSE MANAGEMENT CURRENT_DATA_SET
                stepsRemoved     1
                offsetFromMaster -5.0
                meanPathDelay    738.0

The value stepsRemoved shows how far from the grand master, in PTP hops, a given clock is. offsetFromMaster then shows the estimation of how far this system's clock is from the grand master clock, in nanoseconds.

Similarly, one could use GET DEFAULT_DATA_SET to find out how each clock is configured:

# pmc -u 'GET DEFAULT_DATA_SET'
sending: GET DEFAULT_DATA_SET
        7cfe90.fffe.f5a351-0 seq 0 RESPONSE MANAGEMENT DEFAULT_DATA_SET
                twoStepFlag             1
                slaveOnly               0
                numberPorts             3
                priority1               100
                clockClass              248
                clockAccuracy           0xfe
                offsetScaledLogVariance 0xffff
                priority2               128
                clockIdentity           7cfe90.fffe.f5a351
                domainNumber            0

Some values can be also changed by using the SET command:

# pmc -u 'SET PRIORITY1 128'
sending: SET PRIORITY1
        7cfe90.fffe.f5a351-0 seq 0 RESPONSE MANAGEMENT PRIORITY1
                priority1 128

All the commands supported by pmc are described in the manual page, or by issuing pmc -u help.

Statistics

The message GET PORT_STATS_NP can be used to obtain counters of number of messages received or transmitted on each port:

# pmc -u -b0 'GET PORT_STATS_NP'
sending: GET PORT_STATS_NP
        7cfe90.fffe.f5a351-1 seq 0 RESPONSE MANAGEMENT PORT_STATS_NP
                portIdentity              7cfe90.fffe.f5a351-1
                rx_Sync                   80746
                rx_Delay_Req              0
                rx_Pdelay_Req             0
                rx_Pdelay_Resp            0
                rx_Follow_Up              80746
                rx_Delay_Resp             80542
                rx_Pdelay_Resp_Follow_Up  0
                rx_Announce               80747
                rx_Signaling              0
                rx_Management             0
                tx_Sync                   1
                tx_Delay_Req              80542
                tx_Pdelay_Req             0
                tx_Pdelay_Resp            0
                tx_Follow_Up              1
                tx_Delay_Resp             0
                tx_Pdelay_Resp_Follow_Up  0
                tx_Announce               2
                tx_Signaling              0
                tx_Management             0
[...]

This will show statistics for all the ports. To get just one of them, use the TARGET command:

# pmc -u -b0 'TARGET 7cfe90.fffe.f5a351-2' 'GET PORT_STATS_NP'
sending: GET PORT_STATS_NP
        7cfe90.fffe.f5a351-2 seq 0 RESPONSE MANAGEMENT PORT_STATS_NP
                portIdentity              7cfe90.fffe.f5a351-2
                rx_Sync                   0
                rx_Delay_Req              88293
[...]

Note: PTP message statistics and support for PORT_STATS_NP should be available in the next version of linuxptp after 2.0. Before they are, the two patches can be applied by hand.

Synchronizing Real Time Clock

Another tool that the linuxptp package contains is phc2sys, for synchronizing several system clocks. This is suitable for updating the Linux real time clock (what date shows) from a ptp4l-synchronized hardware clock.

# phc2sys -a -r -m

-a instructs phc2sys to connect to a running ptp4l daemon to determine what hardware clocks are synchronized. -r additionally selects Linux real-time clock. Since the real-time clock is not considered as time source unless -r is given twice, this will cause synchronization of real-time clock to the hardware clock.

Linux distributions typically contain a systemd service to run in the above mode. So you can start or enable phc2sys similarly to ptp4l:

# systemctl enable phc2sys
# systemctl start phc2sys
# systemctl enable --now phc2sys # both of the above at the same time

When running phc2sys make sure that systemd-timesyncd is disabled, otherwise it will have its own ideas about how to synchronize the real-time clock.

Low-level Details

PTP Hardware Clock (PHC)

When PTP is supported on a given Spectrum switch, each front panel port announces the existence of a hardware clock. (All of them expose the same clock.) ethtool can be used to check this:

# ethtool -T swp1
[...]
PTP Hardware Clock: 2
[...]

We can verify that ptp2 is indeed the right clock:

# cat /sys/class/ptp/ptp2/clock_name
mlxsw_sp_clock

The device that represents the PHC associated with the above port would be /dev/ptp2. Either that file or directly the port name can be used to access and tweak the hardware clock:

# phc_ctl swp1 get
phc_ctl[17852.520]: clock time is 1562947079.124677110 or Fri Jul 12 18:57:59 2019

# phc_ctl swp1 set 2000000000
phc_ctl[17990.104]: set clock time to 2000000000.000000000 or Wed May 18 06:33:20 2033

By adjusting clock frequency, one can have the clock tick slower or faster. In the following example, the clock frequency is increased by 10% (100M parts per billion), and indeed accumulates an extra second within 10 seconds of real time waiting:

# phc_ctl swp1 set 0 freq 100000000 wait 10 get
phc_ctl[18370.946]: set clock time to 0.000000000 or Thu Jan  1 02:00:00 1970
phc_ctl[18370.946]: adjusted clock frequency offset to 100000000.000000ppb
phc_ctl[18380.946]: process slept for 10.000000 seconds
phc_ctl[18380.946]: clock time is 11.001129199 or Thu Jan  1 02:00:11 1970

The user would not normally manipulate the clock by hand like this, instead leaving this to the PTP daemon. But it may be useful in order to verify low-level operation of the hardware clock.

Hardware Timestamping

PTP operation relies heavily on the ability of the hardware to timestamp the packets accurately as they enter and leave the switch. Seeing the operation of egress timestamping is tricky, because the timestamps are only available on a specially-opened socket. However it is easy enough to test the operation of ingress timestamping.

In order to inspect and configure timestamping of PTP packets, use the tool hwstamp_ctl from the linuxptp suite:

# hwstamp_ctl -i swp6
current settings:
tx_type 0
rx_filter 0

Let us enable ingress timestamping of all packets (that's the 1 in the following):

# hwstamp_ctl -i swp6 -r 1
current settings:
tx_type 0
rx_filter 0
new settings:
tx_type 0
rx_filter 1

Now to determine which packets are HW-timestamped, set the clock to the past and run tcpdump to capture:

# phc_ctl swp6 set 1000000000
phc_ctl[429855.465]: set clock time to 1000000000.000000000 or Sun Sep  9 04:46:40 2001
# tcpdump -j adapter_unsynced -tttt -i swp6
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on swp6, link-type EN10MB (Ethernet), capture size 262144 bytes
2019-08-05 16:33:08.221491 LLDP, length 59
2019-08-05 16:33:08.225781 LLDP, length 59
[...]

Now use mausezahn to send broken PTP Sync packets to that port, either from another machine, or from the same switch, if the two ports are connected by a loop-back cable:

# mausezahn swp6 -A 192.0.2.1 -B 224.0.1.129 -c 1 -a own -b bc -t udp \
        sp=319,dp=319,p=00:02:$(yes 00 | head -n 32 | tr '\n' ':')

On the tcpdump terminal you should see something like this:

2019-08-05 16:35:38.443538 LLDP, length 59
2019-08-05 16:35:38.447916 LLDP, length 59
2001-09-09 04:49:48.021207 IP 192.0.2.1.ptp-event > ptp-primary.mcast.net.ptp-event: UDP, length 34

The timestamps from 2019 are the software ones. The packets could be from another control protocol, or just data plane traffic that was trapped to the CPU. The timestamps from 2001 are from the HW clock.

Limitations in Spectrum-2 and Above

On Spectrum-2 and newer ASICs, user cannot enable timestamping only in ingress or egress. The following configuration is not supported:

# hwstamp_ctl -i swp6 -r 12 -t 0
current settings:
tx_type 0
rx_filter 0
SIOCSHWTSTAMP failed: Invalid argument

The Spectrum-2 and newer ASICs act as a transparent clock between front panel ports and the reference plane, which lies at the host interface ("CPU port"). When timestamping is enabled on a port, this also enables adjustment of the correction field in timestamped packets. However, it only makes sense to enable the transparent clock globally. Therefore when timestamping is enabled on one of the ports, PTP packets from all ports will have their correction field updated. Correspondingly, PTP packets from all ports will also get the timestamp attached.

On Spectrum-2 and newer ASICs, mlxsw permits timestamping of any PTP event packets. Due to the global nature of the setting, when timestamping of any PTP event is requested, all PTP event packets are timestamped. Correspondingly, HWTSTAMP_FILTER_SOME (the value 2) is returned through the API:

# hwstamp_ctl -i swp6 -r 4 -t 1
current settings:
tx_type 0
rx_filter 0
new settings:
tx_type 1
rx_filter 2

Issues and Interactions

PIM

When operating a multicast router, if the PIM rendezvous point (RP) mask covers the multicast address used by PTP (224.0.1.129), the multicast router will eventually set up routes for this address. If the master messages have TTL>1, they will not be terminated on the boundary clock, but instead be propagated to the slaves. The slaves will thus see directly master messages, will evaluate it as the best master clock, and attempt to synchronize directly to it. Because all the traffic goes through slow path, this leads to unpredictable path delay and the slave will be unable to synchronize reasonably close to the master clock.

The issue can be corrected by either of the following ways:

  • Configure the PIM RP mask that does not cover the PTP multicast address.

  • Set up an iptables rule that prevents propagation of PTP packets:

    # iptables -I FORWARD -p udp -m udp --dport 319 -j DROP
    # iptables -I FORWARD -p udp -m udp --dport 320 -j DROP
    
  • Another possibility is to have the master send packets such that when they arrive to the switch, they have TTL of 1. Such packets are not forwarded. In ptp4l, TTL of sent packets is configured by the udp_ttl option.

Socket Buffer

To deliver egress timestamps to ptp4l, Linux loops back the original packet to the error queue of the listening socket, with the timestamps attached. If that socket is already overwhelmed with ingress traffic, there may not be space to enqueue the timestamped egress traffic. While ptp4l will not notice a missed ingress packet, it needs to wait for the timestamp of the packets that it sent (see above for tx_timestamp_timeout), and will therefore notice and complain that the timestamp was not delivered.

Unfortunately, ptp4l does not allow setting of the socket receive buffer size, so if this happens, one needs to increase the default value before starting ptp4l, and then revert in back again. For example:

# sysctl -w net.core.rmem_max=4194304 # 4MB
# sysctl -w net.core.rmem_default=4194304

Throughput of PTP-Enabled Ports

When timestamping is enabled on a port with speed of 25Gbps or lower, mlxsw activates a PTP shaper. That decreases the bandwidth by about 4-6% in order to reduce the jitter and improve predictability of timestamping. The shaper is turned off again when PTP timestamping is disabled on a given port, or when the speed increases above 25Gbps.

Untimestamped Packets

On Spectrum-1, timestamps are delivered to the driver separately from the packets, and the driver has to match the two before passing the timestamped packet on to the kernel. Due to this separation it is possible that one of the pieces never arrives.

The driver runs a regular garbage collection process that cleans up unmatched packets and timestamps after they are about a second old. In order to provide visibility into the GC events, the mlxsw driver publishes, on Spectrum-1 only, the following four artificial counters:

# ethtool -S swp1 | grep ptp_
     ptp_rx_gcd_packets: 0
     ptp_rx_gcd_timestamps: 0
     ptp_tx_gcd_packets: 0
     ptp_tx_gcd_timestamps: 0

Packets are garbage-collected if the driver got the packet, but did not get the corresponding timestamp. On contrary, timestamps are collected when the packet was dropped.

One reason for these losses is host interface pressure. Currently the PTP policer is hardcoded at 24Kpps, which needs to include all incoming PTP event packets, and then the timestamp events themselves, for both ingress and egress. Although one timestamp event can carry up to four timestamps, the vast majority of them will arrive one by one. So 12Kpps of PTP event ingress will all but saturate the host interface policer.

On top of that, even below the policer limit, some timestamps will get lost simply because there happens to be a burst of PTP traffic on a single port, and the HW timestamp queue for that port has overflown.

The garbage-collected packets are simply forwarded without a timestamp, they are not thrown away unless the original port went away, for example due to a split. Some PTP stacks might still be able to use these packets in some way, though linuxptp will simply complain:

ptp4l[6000.268]: rms   19 max   46 freq  -1484 +/-  26 delay    24 +/-   2
ptp4l[6004.269]: rms   13 max   25 freq  -1476 +/-  19 delay    25 +/-   1
ptp4l[6008.271]: rms   12 max   28 freq  -1461 +/-  18 delay    26 +/-   1
ptp4l[6011.029]: port 3: received DELAY_REQ without timestamp

Further Reading

  1. man ptp4l
  2. man phc2sys
  3. man pmc
  4. Linux kernel timestamping documentation
  5. Fedora PTP documentation
  6. Slides for a PTP tutorial
Clone this wiki locally