Skip to content

Commit

Permalink
pageserver: exclude un-read layers from short residence statistic (#8396
Browse files Browse the repository at this point in the history
)

## Problem

The `evictions_with_low_residence_duration` is used as an indicator of
cache thrashing. However, there are situations where it is quite
legitimate to only have a short residence during compaction, where a
delta is downloaded, used to generate an image layer, and then
discarded. This can lead to false positive alerts.

## Summary of changes

- Only track low residence duration for layers that have been accessed
at least once (compaction doesn't count as an access). This will give us
a metric that indicates thrashing on layers that the _user_ is using,
rather than those we're downloading for housekeeping purposes.

Once we add "layer visibility" as an explicit property of layers, this
can also be used as a cleaner condition (residence of non-visible layers
should never be alertable)
  • Loading branch information
jcsp authored Jul 16, 2024
1 parent 0950866 commit f4f0869
Show file tree
Hide file tree
Showing 3 changed files with 45 additions and 6 deletions.
20 changes: 20 additions & 0 deletions pageserver/src/tenant/storage_layer.rs
Original file line number Diff line number Diff line change
Expand Up @@ -676,6 +676,26 @@ impl LayerAccessStats {
},
}
}

/// Whether this layer has been accessed (excluding in [`AccessStatsBehavior::Skip`]).
///
/// This indicates whether the layer has been used for some purpose that would motivate
/// us to keep it on disk, such as for serving a getpage request.
fn accessed(&self) -> bool {
let locked = self.0.lock().unwrap();
let inner = &locked.for_eviction_policy;

// Consider it accessed if the most recent access is more recent than
// the most recent change in residence status.
match (
inner.last_accesses.recent(),
inner.last_residence_changes.recent(),
) {
(None, _) => false,
(Some(_), None) => true,
(Some(a), Some(r)) => a.when >= r.timestamp,
}
}
}

/// Get a layer descriptor from a layer.
Expand Down
20 changes: 14 additions & 6 deletions pageserver/src/tenant/storage_layer/layer.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1469,14 +1469,22 @@ impl LayerInner {
let duration = SystemTime::now().duration_since(local_layer_mtime);
match duration {
Ok(elapsed) => {
timeline
.metrics
.evictions_with_low_residence_duration
.read()
.unwrap()
.observe(elapsed);
let accessed = self.access_stats.accessed();
if accessed {
// Only layers used for reads contribute to our "low residence" metric that is used
// to detect thrashing. Layers promoted for other reasons (e.g. compaction) are allowed
// to be rapidly evicted without contributing to this metric.
timeline
.metrics
.evictions_with_low_residence_duration
.read()
.unwrap()
.observe(elapsed);
}

tracing::info!(
residence_millis = elapsed.as_millis(),
accessed,
"evicted layer after known residence period"
);
}
Expand Down
11 changes: 11 additions & 0 deletions test_runner/regress/test_tenant_conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
from fixtures.pageserver.utils import assert_tenant_state, wait_for_upload
from fixtures.remote_storage import LocalFsStorage, RemoteStorageKind
from fixtures.utils import wait_until
from fixtures.workload import Workload


def test_tenant_config(neon_env_builder: NeonEnvBuilder):
Expand Down Expand Up @@ -265,6 +266,13 @@ def test_live_reconfig_get_evictions_low_residence_duration_metric_threshold(
(tenant_id, timeline_id) = env.initial_tenant, env.initial_timeline
ps_http = env.pageserver.http_client()

# When we evict/download layers, we will use this Workload to generate getpage requests
# that touch some layers, as otherwise the pageserver doesn't report totally unused layers
# as problems when they have short residence duration.
workload = Workload(env, tenant_id, timeline_id)
workload.init()
workload.write_rows(100)

def get_metric():
metrics = ps_http.get_metrics()
metric = metrics.query_one(
Expand All @@ -285,6 +293,7 @@ def get_metric():
assert default_value == "1day"

ps_http.download_all_layers(tenant_id, timeline_id)
workload.validate()
ps_http.evict_all_layers(tenant_id, timeline_id)
metric = get_metric()
assert int(metric.value) > 0, "metric is updated"
Expand All @@ -305,6 +314,7 @@ def get_metric():
assert int(metric.value) == 0

ps_http.download_all_layers(tenant_id, timeline_id)
workload.validate()
ps_http.evict_all_layers(tenant_id, timeline_id)
metric = get_metric()
assert int(metric.labels["low_threshold_secs"]) == 2 * 24 * 60 * 60
Expand All @@ -318,6 +328,7 @@ def get_metric():
assert int(metric.value) == 0, "value resets if label changes"

ps_http.download_all_layers(tenant_id, timeline_id)
workload.validate()
ps_http.evict_all_layers(tenant_id, timeline_id)
metric = get_metric()
assert int(metric.labels["low_threshold_secs"]) == 2 * 60 * 60
Expand Down

1 comment on commit f4f0869

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3201 tests run: 3068 passed, 0 failed, 133 skipped (full report)


Flaky tests (1)

Postgres 16

  • test_s3_eviction[0.2-True]: debug

Code coverage* (full report)

  • functions: 32.7% (6982 of 21378 functions)
  • lines: 50.1% (55019 of 109928 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
f4f0869 at 2024-07-16T21:24:04.632Z :recycle:

Please sign in to comment.