Skip to content

(Mermaid Review) Data Format

Paran Lee edited this page Aug 21, 2022 · 1 revision

This page describes the structure of uftrace trace data which was generated by uftrace record command.

Overview

Figure 1. User Program Recording Output

graph TD
    B(uftrace record a.out)
    B --> C(uftrace.data/)
    C -->|info| D[info]
    C -->|.txt| E[task.txt]
    C -->|$tid.dat| F[11357.dat]
    C -->|$sid-$SESSION_ID.map| G[sid-da60795b9f84a68b.map]
    C -->|.opts| H[default.opts]
    C -->|.sym| I[*.sym]
Loading

Figure 2. Kernel Program Recording Output

graph TD
    B(uftrace -k record a.out)
    B --> C(uftrace.data/)
    C -->|kernel_header| D[kernel_header]
    C -->|kernel-cpu$N.dat| E[kernel-cpu0.dat ]
    C -->|perf-cpu$N.dat| F[perf-cpu0.dat]
    C -->|info| G[info]
    C -->|.txt| H[task.txt]
    C -->|$tid.dat| I[11357.dat]
    C -->|$sid-$SESSION_ID.map| J[sid-da60795b9f84a68b.map]
    C -->|.opts| K[default.opts]
    C -->|.sym| L[*.sym]
Loading

The uftrace data format is designed to be a portable and self-contained trace data, which means once the trace is recorded, it can be replayed anywhere even in different architecture. It makes the trace record can be shared easily.

The uftrace data is saved under a directory (uftrace.data by default) and it contains following files

  • info : various information about uftrace and running process
  • task.txt : task and session information
  • sid-<SESSION_ID>.map : memory mapping for a session
  • <PROGRAM>.sym : (function) symbol address and name
  • <PROGRAM>.dbg : debug information for each function (only if debug info is available)
  • <TID>.dat : trace data for each task of given <TID>
  • kernel_header : kernel ftrace header info (only if kernel tracing was used)
  • kallsyms : kernel symbol information (ditto)
  • kernel-cpuX.dat : per-cpu kernel tracing data (ditto)
  • perf-cpuX.dat : per-cpu perf event data (only if perf event is supported)
  • events.txt : user event mapping (only if user-event is used)
  • extern.dat : external event data not recorded by uftrace (users can add)

The actual trace data is saved in the .dat file. It needs to resolve the addresses (in .dat file) to a symbol which was saved in the .sym file. But the .sym file only has the relative offsets so it also needs to find a base address from a (per-session) .map file. The task.txt file provides information that which (session) map should be used to resolve an address of a task at a given timestamp.

The info file

The info file provides metadata about the format as well as process and system information at the time of the recording. This file consists of two parts - the first is binary form of metadata and the second is text form of the information (which can be shown with uftrace info command like the below example).

$ uftrace record tests/t-abc
$ uftrace info
# system information
# ==================
# program version     : uftrace v0.6.2-79-g3843
# recorded on         : Mon Apr 17 10:39:32 2017
# cmdline             : uftrace record tests/t-abc
# cpu info            : Intel(R) Core(TM) i7-5500U CPU @ 2.40GHz
# number of cpus      : 4 / 4 (online / possible)
# memory info         : 0.1 / 7.4 GB (free / total)
# system load         : 0.58 / 0.34 / 0.23 (1 / 5 / 15 min)
# kernel version      : Linux 4.5.0-rc4+
# hostname            : taeung-ThinkPad-X1-Carbon-3rd
# distro              : "Ubuntu 16.04 LTS"
#
# process information
# ===================
# number of tasks     : 1
# task list           : 24290
# exe image           : /home/taeung/git/uftrace/tests/t-abc
# exit status         : exited with code: 0
# cpu time            : 0.000 / 0.004 sec (sys / user)
# context switch      : 1 / 1 (voluntary / involuntary)
# max rss             : 3272 KB
# page fault          : 0 / 189 (major / minor)
# disk iops           : 0 / 16 (read / write)
  • The metadata starts with a 8-byte magic string which is 0x46 0x74 0x72 0x61 0x63 0x65 0x21 0x00 or "Ftrace!".
  • It's followed by a 4-byte number of file version and the current version is 4.
  • And then there's a 2-byte number of header (metadata) size and the current value is 40 (or 0x28).
  • The next byte identifies a byte-order (endian) in the data files. The value is same as the ELF format (EI_DATA: 1 is for the little-endian and 2 is for the big-endian).
  • The next byte tells the size of address or long int type also same as the ELF format (EI_CLASS: 1 is for 32-bit and 2 is for 64-bit).
  • Then 64-bit bit mask (feat_mask) of enabled features comes after it. The bit 0 is for PLT (library call) hooking, the bit 1 is for task and session info, the bit 2 is for kernel tracing, the bit 3 is for function arguments, the bit 4 is for function return value, the bit 5 is for whether symbol file contains relative offset or absolute address, and the bit 6 is for max (function) stack depth.
  • The next 64-bit mask (info_mask) is for which kind of process and system information was saved after the metadata.
  • And then it followed by a 2-byte number of maximum function call (stack) depth given by user.
  • The rest 6-byte is reserved for future use and should be filled with zero.

For example, you can check the metadata like below.

$ hexdump -C uftrace.data/info | head -3
00000000  46 74 72 61 63 65 21 00  04 00 00 00 28 00 01 02  |Ftrace!.....(...|
00000010  63 00 00 00 00 00 00 00  fd 03 00 00 00 00 00 00  |c...............|
00000020  00 04 00 00 00 00 00 00  65 78 65 6e 61 6d 65 3a  |........exename:|

The metadata is maintained as struct uftrace_file_header in the uftrace.h file.

After the metadata, info string follows in a "key:value" form. For example, you can see it like below.

$ cat uftrace.data/info
...
cmdline:uftrace record tests/t-abc
cpuinfo:lines=2
cpuinfo:nr_cpus=4 / 4 (online/possible)
cpuinfo:desc=Intel(R) Core(TM) i7-5500U CPU @ 2.40GHz
meminfo:0.1 / 7.4 GB (free / total)
osinfo:lines=3
osinfo:kernel=Linux 4.5.0-rc4+
osinfo:hostname=taeung-ThinkPad-X1-Carbon-3rd
osinfo:distro="Ubuntu 16.04 LTS"
taskinfo:lines=2
...

Basically a single bit in the info_mask corresponds to a single line in the info string. If it consists of two or more lines, the first line should tell how many lines comes after it. The uftrace info command will also show those info string in more readable format.

The task.txt file

This files shows relation between tasks and sessions. A session keeps a memory map of tasks which can be created when the (first) task was started or a new program was executed (by exec(3)). When a child task was forked, it inherits the session of parent (since it's memory mapping will be same unless child adds or removes mappings). So a session can be shared by multiple tasks and also a single task can have multiple sessions. The task.txt saves task (parent-child relationship) and session info with timestamp so that it can track the correct mappings.

A very simple trace data will contain a session and a task only like the below example:

$ cat uftrace.data/task.txt
SESS timestamp=42622.893819172 pid=24290 sid=5951ceee0be7fb17 exename="/home/taeung/git/uftrace/tests/t-abc"
TASK timestamp=42622.893967350 tid=24290 pid=24290

When a task calls dlopen(3) to load a library dynamically, new memory mappings will be created and it needs to be recorded. Instead of creating a new session for dlopen, a "DLOP" line will be added to the task.txt file with a name of the library and a base address where the library was loaded.

The session .map file

As you can see the above example, a session has a session id (sid) for identity. The session id a random 16-character string (or 8-byte hex number) and it's used as a file name of the map file (e.g. sid-5951ceee0be7fb17.map). A session contains memory mapping of tasks which provides base address of each module (library or executable). It's actually a copy of a /proc/<TID>/maps file.

For example, The uftrace copy the below output into the map file when TID of running target process is 24290.

$ cat /proc/24290/maps
00400000-00401000 r-xp 00000000 08:01 11729337                           /home/taeung/git/uftrace/tests/t-abc
00600000-00601000 rw-p 00000000 08:01 11729337                           /home/taeung/git/uftrace/tests/t-abc
02443000-02475000 rw-p 00000000 00:00 0                                  [heap]
...
7fba87229000-7fba8722b000 rw-p 001c3000 08:01 9968358                    /lib/x86_64-linux-gnu/libc-2.23.so
7fba8722b000-7fba8722f000 rw-p 00000000 00:00 0
7fba8722f000-7fba87249000 r-xp 00000000 08:01 6558157                    /usr/local/lib/libmcount-fast.so
...
7fba87676000-7fba87677000 rw-p 00000000 00:00 0
7ffe88f83000-7ffe88fa1000 rwxp 00000000 00:00 0                          [stack]
7ffe88fa1000-7ffe88fa4000 rw-p 00000000 00:00 0
7ffe88fbd000-7ffe88fc0000 r--p 00000000 00:00 0                          [vvar]
7ffe88fc0000-7ffe88fc2000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

The symbol (.sym) file

The uftrace saves the symbol table of the traced program so that it can resolve the symbol from address easily. The symbol file contains only function symbols and its format is almost identical to the output of nm(1) command. The difference is that it also saves PLT entries which is used to call library functions and it has 'P' type. You can see the below example.

$ cat uftrace.data/t-abc.sym
00000000000004e0 P getpid
00000000000004f0 P __libc_start_main
0000000000000500 P __monstartup
0000000000000510 P mcount
0000000000000520 P __cxa_atexit
0000000000000530 P atoi
0000000000000540 P __dynsym_end
0000000000000540 T _start
0000000000000570 T __gmon_start__
0000000000000667 t a
000000000000067a t b
000000000000068d t c
00000000000006be T main
0000000000000710 T __libc_csu_init
0000000000000780 T __libc_csu_fini
0000000000000790 T atexit
00000000000007b9 T __sym_end

The data (.dat) file

The data file contains actual trace data (record) for each task so the task id (tid) will be used as a file name. The data is two 64-bit numbers - first is a timestamp in nsec and second consists of 2-bit type, 1-bit marker, 3-bit magic, 10-bit depth and 48-bit address.

The type is one of 'ENTRY', 'EXIT', 'EVENT' or 'LOST'. The 'ENTRY' and 'EXIT' types are for function tracing and 'EVENT' type is reserved for event tracing like kernel-level tracepoint or user-level SDT. The 1-bit marker is whether this record has additional data (like argument or return value). The 3-bit magic is for data integrity and it should have a value of 5 (or 0b101). The 10-bit depth shows the function call depth (or level). And finally 48-bit address is to identify function (symbol); it's ok as most 64-bit systems only use 48-bit address space for now.

You can see the definition of the data record as struct uftrace_record in uftrace.h file.

/* reduced version of mcount_ret_stack */
struct uftrace_record {
  uint64_t time;
  uint64_t type:   2;
  uint64_t more:   1;
  uint64_t magic:  3;
  uint64_t depth:  10;
  uint64_t addr:   48; /* child ip or uftrace_event_id */
};

The debug info (.dbg) file

This file contains additional (DWARf) debug information related to functions. As of now, it's only recorded if uftrace is built with libdw and target program is built with -g option. This includes function arguments, return value and source location. Also enum definition is saved if it's used in arguments or return values. The actual contents look like below:

$ cat uftrace.data/enum.dbg
E: enum bar {E_FOO=-1,E_BAR,E_BAZ,E_XXX}
F: 11d9 uft::foo
L: 13 /home/namhyung/tmp/uftrace/enum.cpp
A: @arg1/e:bar
R: @retval/e:bar
F: 1224 main
...

The line started with E: is an enum definition. The F: line defines a function with an (in-file) offset and its symbol name. Lines below the F: (and up-to next F:) describe the function. The L: line is for source code location - line number and file name. The line started with A: and R: defines function arguments and return value respectively.

perf event data files

The perf-cpuX.dat files are used to read event data generated from the Linux kernel using perf_event_open system call. But it doesn't actually open real perf events but just use a dummy event to collect task-related context info like PERF_RECORD_FORK, PERF_RECORD_EXIT, PERF_RECORD_COMM and PERF_RECORD_SWITCH (if supported). Please refer 'MMAP Layout' in the man page of perf event syscall.

kernel trace files

The kernel_header, kallsyms and kernel-cpuX.dat files are used for kernel function/event tracing by libtraceevent.

(user) event file (events.txt)

This file contains mapping between event ID and userspace event information which is defined dynamically. Currently only USDT events use this and the format might change later (for argument support).

$ cat events.txt
EVENT: 1000000 uftrace:test

The number identifies the event (in the data file) and uftrace will use the symbol (in the form of "provider:event") for that event.

External event file (extern.dat)

This file contains messages displayed with uftrace output. Each line will be displayed as a comment in current function. So it needs a timestamp (with CLOCK_MONOTONIC) where it'd be printed and the format is same as uftrace dump or (uftrace replay -f time).

$ cat extern.dat
1234567.987654321  This is the message that will be printed