The high level Language Telemetry SDKs include higher-level metric abstractions (especially “Compound Metrics”, such as histograms, percentiles and moving averages) to represent, in a canonical way, commonly needed and used numeric/statistical data structures for application monitoring.
This document describes general principles and guidelines for how these should be constructed, across varying metric libraries.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" are to be interpreted as described in RFC 2119.
An implementation is not compliant if it fails to satisfy one or more of the MUST, MUST NOT, REQUIRED, SHALL, or SHALL NOT requirements for the protocols it implements. An implementation is compliant if it satisfies all the MUST, MUST NOT, REQUIRED, SHALL, and SHALL NOT requirements for the protocols it implements.
As we have been building exporters for common metric-monitoring frameworks, such as opencensus, dropwizard-metrics, and micrometer, we have identified a need for higher-order abstractions that can be built on top of our three standard metrics types.
These metric-monitoring frameworks generally have already done aggregation/statistical work when they provide the data to an exporter.
Because we only have 3 metric types to work with for now (Gauge
, Delta-Count
, and Summary
), there is a need to
figure out how to model these higher-level abstractions in our dimensional metric system. In order for these higher-order
abstractions to be displayable in NR-One with standard widgets and nerdlets, we should standardize the naming of attributes
that will be used to build the faceted NRQL queries that will provide the data for the visualizations.
We would like to enable our customers to create new extended metric types, as they may have cases that we have not considered.
In those cases, we may not have standard visualizations/nerdlets in NR-One that support them, but since they will be able to build their own nerdlets, we should make it easy for them to extend what we provide, and easily query for what we do by default.
-
Attribute names generated by the exporter library SHOULD be dot-delimited
camelCase
.In general, dots are used to separate modifiers and subjects. For example, if you are generating an attribute that denotes the histogram bucket that a gauge represents, you might call that attribute
histogram.bucket
. -
When generating groups of metrics that should be visualized together, the metric names for that group SHOULD be the customer-provided metric name, suffixed with the name for the type of group, separated by a
.
.For example, if you are generating a set of gauges that represent a histogram, you would suffix the customer-provided metric name with "
.buckets
". -
All metric names SHOULD start with the instrumentation-provided or customer-provided metric name that the metric library provides to the exporter.
-
All metrics that share a common name MUST also have at least one variable attribute, so they can be independently aggregated by the back end.
- This MUST be implemented with a single
Count
metric that has been diffed since the last report period.
- This MUST be implemented with a
Gauge
, reporting the current value of the counter.
**** Important: Histogram specifications are still a work-in-progress. We strongly recommend that you do not implement histogram exports at this time. ****
Often libraries will provide a histogram data type to represent distributions of metric values as bucketed data.
When exporting this metric type the following guidelines MUST be followed.
-
Histograms MUST be implemented as a set of metrics, one metric for each bucket.
- Each metric SHOULD be a
Count
metric. - Each metric value MUST represent the total number of observations occurring up to the upper bound of that bucket.
- Each metric MUST have attributes that identify the bucket bounds.
-
The
histogram.bucket.upperBound
attribute MUST be included and MUST represent the upper bound on the bucket as a signed floating-point number.If the bucket is unbounded, this CAN be elided and instead the additional "
.sum
" metric included (see below). -
The
histogram.bucket.lowerBound
attribute CAN be included. If it is included, it MUST represent the lower bound on the bucket as a signed floating-point number. -
The
histogram.bucket.number
attribute CAN be included. If it is included, it MUST be included for all buckets and MUST represent the bucket number as a 0-based integral value, where 0 is the bucket with the smallest upper bound.
-
- Each metric name MUST have a "
.buckets
" suffix.
- Each metric SHOULD be a
-
An additional metric with the sum of all values recorded in the histogram CAN be included.
- The metric SHOULD be a
Count
metric. - If a bucket was elided due to it being unbounded, this metric SHOULD be included with that omitted bucket's value (as that value should be equivalent to the total sum of all values).
- This metric MUST have a name with a suffix of "
.sum
".
- The metric SHOULD be a
-
If implementation of the bucketing algorithm is left up to the exporter, then histograms SHOULD be constructed as cumulative histograms.
This means bucket values represent the cumulative number of observations in all of the buckets up to the specified bucket.
-
This quantization metric type has low direct value to the end user with the current visualizations tools available.
Directly viewing the bucket counts in a two-dimensional representation fundamentally is limited to only looking at finite aggregate time-slices of the data (instead of continuous timeseries). With this in mind it is RECOMMENDED to additionally export this data as percentiles in order to better support visualization of this data.
Percentiles provide an ideal way to visualize a distribution of sampled events.
They are indispensable in identifying and defining relevant character of the distribution and are viewable in a two-dimensional space.
Ideally percentiles would be determinable by the end user when they query for the data.
Currently, support for this ability to construct percentiles at query-time is not possible.
It is, however, possible to calculate percentiles at measurement-time.
This calculation of percentiles at measurement-time is the RECOMMENDED way to present metrics about distributions of monitored events. To do this the following guidelines MUST be followed.
- Percentiles MUST be implemented as a set of metrics, one metric for each percentile.
- Each metric MUST be a
Gauge
metric. - Each metric value MUST represent a calculated percentile.
- The
percentile
attribute MUST be included with each metric. This attribute value MUST represent the percentile being measured as a floating-point number within the range [0.0, 100.0]. - Each metric name MUST have a "
.percentiles
" suffix.
- Each metric MUST be a
- An additional
Summary
metric for the set of percentiles CAN be included.- This
Summary
metric MUST have a name with a suffix of ".summary
".
- This
- If implementation of the percentiles algorithm is left up to the exporter than percentiles SHOULD be calculated as the linear interpolation between closest ranks.
-
Percentiles are quite problematic, when queried across varying populations (across hosts, transactions, really anything).
We should pass these through as a set of gauges, but there’s really no way to support this in NRDB/NRQL with any kind of aggregates at this point in time.
We chose gauges because the backend aggregation is only to choose the latest value. Other aggregations don’t make sense at all for percentiles, since the absolute meaning of 99%, etc changes continuously.
-
We chose to standardize on Percentiles instead of quantiles. The point being that most users talk about the percentage of events in a distributions that fall into a category making percentiles the clear choice. However, by and large, this choice is not a limiting one as percentiles and quantiles are homomorphic.
Measures, for example, throughput over varying time windows, for example, 1, 5, and 15 minutes.
- Implement this with N gauges, one for each time window. Provide faceting via a varying attribute:
"rate"
: a string representing the time window for the average rate. For example, the value would be "m1_rate
", "m5_rate
", etc.
- metric name suffix: "
.rates
" - If available, a
Count
metric representing the total number of samples.
- TODO: add guidance when we have more than one example to base it on.