diff --git a/docs/guides/alerting/alarms.md b/docs/guides/alerting/alarms.md
new file mode 100644
index 0000000..af9e16d
--- /dev/null
+++ b/docs/guides/alerting/alarms.md
@@ -0,0 +1,358 @@
+# Alarms
+
+Alarms are used to evaluate whether or not some external condition should dispatch a notification to the configured endpoints
+
+:::caution
+Alarms will fire without attached endpoints, but if you do not attach any endpoints to your alarm it will not dispatch to any endpoints (it will still show as firing in the opni UI).
+:::
+
+## State
+
+Alarms have a state that reports their runtime status.
+
+- Unkown : State can't be reported or analyzed by Opni-Alerting
+- Pending : the alarm is waiting for dependencies it needs to be created before activating
+- Ok : The alarm is fine
+- Firing : The alarm has met its condition, expect to eventually receive a notification, the timing will depend on your settings
+- Silenced : The alarm is firing but has been silenced by the User.
+- Invalidated : The alarm can no longer evaluate to Ok or Firing, usually due to missing external requirements, for example when a cluster is delete or a required capability is uninstalled
+
+## Managing Alarms
+
+### Overview
+
+Overview tab will display a timeline of when alarms have fired.
+
+
+
+
+
+### Editing / Deleting Alarms
+
+In order to edit or delete alarms right click the condition you want to edit or delete :
+
+
+
+
+
+### Cloning
+
+:::caution attention
+Cloning alarms with specific external requirements to other cluster(s) may result in invalidated state alerts if those requirements are not met by the target cluster(s)
+:::
+
+As above, you can right click the alarm you want to clone, which will open a menu to select which
+clusters you want to clone to.
+
+
+
+
+
+
+
+You are allowed to clone to the same cluster, as well as clone any number of times to any cluster.
+
+### Silencing an Alarm
+
+If operators wish to silence a firing alarm, which will cause the alarm to no longer send any notifications to endpoints, then consider :
+
+
+
+
+
+
+
+They can do so by right clicking edit and navigating to the silence tab:
+
+
+
+
+
+
+
+Once the alarm is silenced, operators can always un-silence it by clicking the resume now.
+
+
+
+
+
+
+
+Tada! the alarm is silenced.
+
+
+
+
+
+
+
+:::note
+You can silence alarms that are not in the firing state, and they will prevent any notifications from
+being sent to endpoints if that alarm does enter the firing state
+:::
+
+## Alarm Types
+
+### Prometheus Query
+
+Alerts when the given prometheus query evaluates to non-empty on the discovered opni monitoring observability data.
+
+:::caution attention
+Requires the monitoring backend to be installed & one or more downstream agents
+to have the metrics capability.
+:::
+
+#### Options
+
+
+
+
+
+
+
+:::note
+The above query should always evaluate to true, and subsequently evaluate to firing.
+It can be used to sanity check your downstream agents with metrics installed.
+:::
+
+- Cluster : any cluster with an agent with metrics capabilities
+- Duration : period after which we should fire an alert
+- Query : any valid prometheus query
+
+#### Examples
+
+For users unsure where to begin with prometheus queries / alerts, here are [some starting ideas for alarms](https://awesome-prometheus-alerts.grep.to/rules.html).
+Note the `expr` section of the rule corresponds to Opni-Alerting's prometheus query section.
+
+### Agent Disconnect
+
+Alerts when an agent disconnects within the specified timeout.
+
+By default, whenever an agent is bootstrapped, for example consider this agent :
+
+
+
+
+
+
+
+A matching agent disconnect condition is created with a 10 minute timeout.
+
+
+
+
+
+
+
+:::note
+You are free to edit or delete this default condition as you see fit.
+:::
+
+
+
+#### Options
+
+
+
+
+
+
+
+- Cluster : agent this alarm applies to
+- Timeout : how long this agent has been disconnect before firing an alarm
+
+#### Recommended Options
+
+- Timeout : 10 or more minutes
+
+### Downstream Capability
+
+Alerts when an agent capability, e.g. Logging or Metrics, is in some unhealthy state for a certain amount of time.
+
+By default when an agent is bootstrapped, a matching downstream capability alarm is created that will alert if _any_ unhealthy state is sustained over a period of 10 minutes.
+
+
+
+
+
+
+
+:::note
+You are free to edit or delete this default condition as you see fit.
+:::
+
+#### Options
+
+
+
+
+
+
+
+- Cluster : cluster this applies to
+- Duration : period after which we decide to fire an alaram
+- One ore more capability states to track :
+ - `Failure` : An agent capability is experiencing errors
+ - `Pending` : A setup step or sync operation is hanging
+
+#### Recommended Options
+
+- Duration : 10 or more minutes
+
+### Monitoring Backend
+
+:::caution attention
+Requires the monitoring backend to be installed
+:::
+
+Alerts when the specified monitoring backend components are in an unhealthy state over
+some period of time
+
+#### Options
+
+
+
+
+
+
+
+- Duration : period after which we should fire an alarm if the specified backend components
+ are unhealthy, recommended to be 10 minutes or more
+- Backend components :
+ - `store-gateway` : responsible for persistent & remote storage, critical component.
+ - `distributor` : responsible for distributing remote writes to the ingester
+ - `ingester` : responsible for (persistent) buffering of incoming data
+ - `ruler` : responsible for applying stored prometheus queries and prometheus alerts
+ - `purger` : responsible for deleting cluster data
+ - `compactor` : responsible for buffer compaction before sending to persistent storage
+ - `query-frontend` : "api gateway" for the querier
+ - `querier` : handles prometheus queries from the user
+
+#### Recommended options
+
+- Duration : 10 minutes or more, but no more than 90 mins
+- Backend Components :
+ 1. track `store-gateway`, `distributor`, `ingester` & `compactor` as a high severity alarm
+ 2. track all components as a lesser severity alarm
+
+### Kube State
+
+:::caution attention
+Requires the monitoring backend to be installed and have one or more agents that have both
+metrics capabilities and kube-state-metrics enabled.
+:::
+
+Alerts when the desired kubernetes object on the cluster is in the state specified by the user for a certain amount of time.
+
+#### Options
+
+
+
+
+
+
+
+:::note
+The above configuration will alert if the opni gateway is in fact running for more than 5 minutes.
+
+It can be used to sanity check that your kube-state-metrics are working as intended.
+:::
+
+## General Alarm Options
+
+### Attaching endpoint(s) to an Alarm
+
+Right click edit your condition, and navigate to the message options tab in the edit UI & click 'Add Endpoint'
+
+
+
+
+
+
+
+
+From here you can add a list of your configured endpoints to your alarm:
+
+
+
+
+
+
+
+
+You must specify Message options for the contents & dispatching configuration to your endpoint :
+
+- Title : header for your particular endpoint
+- Body : content of the message
+- Initial Delay : time for backend to wait before sending alert
+- Repeat interval : how often to repeat the alert when it fires
+- Throttling duration : Throttle (delay) all alerts received from the same source by X minutes
+
+
+
+
+
+
+
+Based on the implementation details above, once we hit 'Save' and our downstream agent has disconnected for > 10mins, you will receive an alert:
+
+
+
+
diff --git a/docs/guides/alerting/endpoints.md b/docs/guides/alerting/endpoints.md
new file mode 100644
index 0000000..61419aa
--- /dev/null
+++ b/docs/guides/alerting/endpoints.md
@@ -0,0 +1,107 @@
+# Endpoints
+
+## Prerequisites
+
+- Access to the admin UI
+- Opni-Alerting backend is installed
+
+## Configuration
+
+In order to get started, head to the 'Endpoints' tab under 'Alerting' in the left sidebar of the admin UI
+
+
+
+
+
+To create a new endpoint, click the top-right 'Create' button to open
+the create UI
+
+### Slack
+
+Using slack requires a :
+
+- Valid incoming slack webhook
+- Valid slack channel
+
+:::note
+See the official [slack docs](https://slack.com/help/articles/115005265063-Incoming-webhooks-for-Slack) for setup instructions
+:::
+
+
+
+
+
+
+
+
+:::caution
+
+If the specified channel does not exist, or your slackbot does not have appropriate permissions to send messages to the specified channel, it will send the alert to its default channel.
+
+:::
+
+To validate your inputs, hit the 'Test Endpoint' button to make sure opni alerting can dispatch messages to your configured endpoint.
+
+If your inputs are correct, you should receive a test message:
+
+
+
+
+
+
+
+When you are done, hit the 'Save' button.
+
+### Email
+
+Using email endpoint requires its own smtp server, which will require:
+
+- To email : valid recipient for this endpoint
+- From email : valid sender for this email
+- Smart Host : `:` for your SMTP server setup
+- Smtp Identity : Identity to use with your SMTP server
+- Smtp username : Auth username credential for SMTP server
+- Smtp Password : Auth password credential for SMTP server
+
+
+
+
+
+
+
+
+:::note
+SMTP server configurations will be specific to your IT or production setup
+:::
+
+To validate your inputs, hit the 'Test Endpoint' button to make sure opni alerting can dispatch messages to your configured endpoint.
+
+When you are done, hit the 'Save' button.
+
+### PagerDuty
+
+Using PagerDuty requires a PagerDuty integration key.
+
+:::note
+See the official PagerDuty docs on [integration with AlertManager](https://www.pagerduty.com/docs/guides/prometheus-integration-guide/) for generating
+integration keys
+:::
+
+
+
+
diff --git a/docs/guides/alerting/index.md b/docs/guides/alerting/index.md
index e4dce10..ff1621e 100644
--- a/docs/guides/alerting/index.md
+++ b/docs/guides/alerting/index.md
@@ -1,470 +1,38 @@
-# Opni Alerting : User Guide
+# Opni Alerting
-This guide walks through the usage of Opni-Alerting.
+Opni Alerting is a service managed by opni to send notifications based on opni observability data.
-There are 3 main components to Opni-Alerting:
+There are 2 main components to Opni-Alerting:
- Endpoints : targets for alarms to dispatch to
- Alarms : Expressions that specify some condition to alert on
-- Overview : Timeline of breached conditions
-
-## Prerequisites
-
-- Access to the admin UI
-- Opni-Alerting backend is installed
## Endpoints
-In order to get started, head to the 'Endpoints' tab under 'Alerting' in the left sidebar of the admin UI
-
-
-
-
-
-To create a new endpoint, click the top-right 'Create' button to open
-the create UI
-
-### Slack
-
-Using slack requires a :
-
-- Valid incoming slack webhook
-- Valid slack channel
-
-:::note
-See the official [slack docs](https://slack.com/help/articles/115005265063-Incoming-webhooks-for-Slack) for setup instructions
-:::
-
-
-
-
-
-
-
-
-:::caution
-
-If the specified channel does not exist, or your slackbot does not have appropriate permissions to send messages to the specified channel, it will send the alert to its default channel.
-
-:::
-
-To validate your inputs, hit the 'Test Endpoint' button to make sure opni alerting can dispatch messages to your configured endpoint.
+Supported integrations:
-If your inputs are correct, you should receive a test message:
+- Slack
+- Email (with SMTP server)
+- Pager Duty
-
-
-
-
-
-
-When you are done, hit the 'Save' button.
-
-### Email
-
-Using email endpoint requires its own smtp server, which will require:
-
-- To email : valid recipient for this endpoint
-- From email : valid sender for this email
-- Smart Host : `:` for your SMTP server setup
-- Smtp Identity : Identity to use with your SMTP server
-- Smtp username : Auth username credential for SMTP server
-- Smtp Password : Auth password credential for SMTP server
-
-
-
-
-
-
-
-
-:::note
-SMTP server configurations will be specific to your IT or production setup
-:::
-
-To validate your inputs, hit the 'Test Endpoint' button to make sure opni alerting can dispatch messages to your configured endpoint.
-
-When you are done, hit the 'Save' button.
-
-### PagerDuty
-
-Using PagerDuty requires a PagerDuty integration key.
-
-:::note
-See the official PagerDuty docs on [integration with AlertManager](https://www.pagerduty.com/docs/guides/prometheus-integration-guide/) for generating
-integration keys
-:::
-
-
-
-
+See [the endpoints configuration section](#endpoints) for getting started with configuring your endpoints
## Alarms
-Alarms are used to evaluate whether or not some external condition should dispatch a notification to the configured endpoints
-
-:::caution
-Alarms will fire without attached endpoints, but if you do not attach any endpoints to your alarm it will not dispatch to any endpoints (it will still show as firing in the opni UI).
-:::
-
-### State
-
-- Unkown : State can't be reported or analyzed by Opni-Alerting
-- Ok : The alarm is fine
-- Firing : The alarm has met its condition, expect to eventually receive a notification, depending on your settings
-- Silenced : The alarm is firing but has been silence by the User.
-- Invalidated : The alarm can no longer evaluate to Ok or Firing, usually due to uninstalling external requirements.
-
-### Overview
-
-Overview tab will display a timeline of when alarms have fired.
-
-
-
-
-
-### Editing / Deleting Alarms
-
-In order to edit or delete alarms right click the condition you want to edit or delete :
-
-
-
-
-
-### Cloning
-
-:::caution attention
-Cloning alarms with specific external requirements to other cluster(s) may result in invalidated state alerts if those requirements are not met by the target cluster(s)
-:::
-
-As above, you can right click the alarm you want to clone, which will open a menu to select which
-clusters you want to clone to.
-
-
-
-
-
-
-
-
-You are allowed to clone to the same cluster, as well as clone any number of times to any cluster.
-
-## Alarm Types
-
-### Agent Disconnect
-
-Alerts when an agent disconnects within the specified timeout.
-
-By default, whenever an agent is bootstrapped, for example consider this agent :
-
-
-
-
-
-
-
-A matching agent disconnect condition is created with a 10 minute timeout.
-
-
-
-
-
-
-
-:::note
-You are free to edit or delete this default condition as you see fit.
-:::
-
-
-
-#### Options
-
-
-
-
-
-
-
-
-- Cluster : agent this alarm applies to
-- Timeout : how long this agent has been disconnect before firing an alarm
-
-#### Recommended Options
-
-- Timeout : 10 or more minutes
-
-### Downstream Capability
-
-Alerts when an agent capability, e.g. Logging or Metrics, is in some unhealthy state for a certain amount of time.
-
-By default when an agent is bootstrapped, a matching downstream capability alarm is created that will alert if _any_ unhealthy state is sustained over a period of 10 minutes.
-
-
-
-
-
-
-
-
-:::note
-You are free to edit or delete this default condition as you see fit.
-:::
-
-#### Options
-
-
-
-
-
-
-
-
-- Cluster : cluster this applies to
-- Duration : period after which we decide to fire an alaram
-- One ore more capability states to track :
- - `Failure` : An agent capability is experiencing errors
- - `Pending` : A setup step or sync operation is hanging
-
-#### Recommended Options
-
-- Duration : 10 or more minutes
-
-### Monitoring Backend
-
-:::caution attention
-Requires the monitoring backend to be installed
-:::
-
-Alerts when the specified monitoring backend components are in an unhealthy state over
-some period of time
-
-#### Options
-
-
-
-
-
-
-
-
-- Duration : period after which we should fire an alarm if the specified backend components
- are unhealthy, recommended to be 10 minutes or more
-- Backend components :
- - `store-gateway` : responsible for persistent & remote storage, critical component.
- - `distributor` : responsible for distributing remote writes to the ingester
- - `ingester` : responsible for (persistent) buffering of incoming data
- - `ruler` : responsible for applying stored prometheus queries and prometheus alerts
- - `purger` : responsible for deleting cluster data
- - `compactor` : responsible for buffer compaction before sending to persistent storage
- - `query-frontend` : "api gateway" for the querier
- - `querier` : handles prometheus queries from the user
-
-#### Recommended options
-
-- Duration : 10 minutes or more, but no more than 90 mins
-- Backend Components :
- 1. track `store-gateway`, `distributor`, `ingester` & `compactor` as a high severity alarm
- 2. track all components as a lesser severity alarm
-
-### Prometheus Query
-
-Alerts when the given prometheus query evaluates to True
-
-:::caution attention
-Requires the monitoring backend to be installed & one or more downstream agents
-to have the metrics capability.
-:::
-
-#### Options
-
-
-
-
-
-
-
-
-:::note
-The above query should always evaluate to true, and subsequently evaluate to firing.
-It can be used to sanity check your downstream agents with metrics installed.
-:::
-
-- Cluster : any cluster with an agent with metrics capabilities
-- Duration : period after which we should fire an alert
-- Query : any valid prometheus query
-
-### Kube State
-
-:::caution attention
-Requires the monitoring backend to be installed and have one or more agents that have both
-metrics capabilities and kube-state-metrics enabled.
-:::
-
-Alerts when the desired kubernetes object on the cluster is in the state specified by the user for a certain amount of time.
-
-#### Options
-
-
-
-
-
-
-
-:::note
-The above configuration will alert if the opni gateway is in fact running for more than 5 minutes.
-
-It can be used to sanity check that your kube-state-metrics are working as intended.
-:::
-
-## General Alarm Options
-
-### Attaching endpoint(s) to an Alarm
-
-Right click edit your condition, and navigate to the message options tab in the edit UI & click 'Add Endpoint'
-
-
-
-
-
-
-
-
-From here you can add a list of your configured endpoints to your alarm:
-
-
-
-
-
-
-
-
-You must specify Message options for the contents & dispatching configuration to your endpoint :
-
-- Title : header for your particular endpoint
-- Body : content of the message
-- Initial Delay : time for backend to wait before sending alert
-- Repeat interval : how often to repeat the alert when it fires
-- Throttling duration : Throttle (delay) all alerts received from the same source by X minutes
-
-
-
-
-
-
-
-Based on the implementation details above, once we hit 'Save' and our downstream agent has disconnected for > 10mins, you will receive an alert:
-
-
-
-
-
-### Silencing an Alarm
-
-If operators with to silence a firing alarm, which will cause the alarm to no longer send any notifications to endpoints, then consider :
-
-
-
-
-
-
+Supported integrations:
-They can do so by right clicking edit and navigating to the silence tab:
+- Opni agent
+- Opni monitoring
+- Opni monitoring backend
-
-
-
-
-
+See [the alarms configuratin section](#alarms) for getting started with configuring your alarms.
-Once the alarm is silenced, operators can always un-silence it by clicking the resume now.
+## SLOs
-
-
-
-
-
+A more sophisticated alarm configuration targeted at meeting budgeting and SLA goals.
-Tada! the alarm is silenced.
+Supported integrations:
-
-
-
-
-
+- Opni monitoring
-:::note
-You can silence alarms that are not in the firing state, and they will prevent any notifications from
-being sent to endpoints if that alarm does enter the firing state
-:::
+See [the SLO page](#slos) for getting started with creating SLOs.
diff --git a/docs/guides/monitoring/index.md b/docs/guides/monitoring/index.md
new file mode 100644
index 0000000..6c0eaf1
--- /dev/null
+++ b/docs/guides/monitoring/index.md
@@ -0,0 +1,54 @@
+# Opni Monitoring : User Guide
+
+This guide walks through the usage of Opni-Monitoring
+
+## Prerequisites
+
+- Opni-monitoring is installed in the upstream.
+- One or more downstream agents have the metrics capability installed.
+
+## Components
+
+There are several components to Opni-Monitoring;
+
+### Upstream
+
+The upstream includes the following components :
+
+- Cortex : A multi-tenant solution using Prometheus/Alertmanager as backends
+ - Cortex AlertManager :
+ - Cortex Compactor :
+ - Cortex Distributor :
+ - Cortex Ingester :
+ - Cortex Purger :
+ - Cortex Querier :
+ - Cortex Query Frontend :
+ - Cortex Ruler :
+ - Cortex Store Gateway :
+
+:::note
+In standalone mode, the cortex-all-in-one pod will contain each of these components.
+:::
+
+### Configuring Upstream Components
+
+TODO
+
+### Downstream
+
+Agents with metrics capabilities include :
+
+- Prometheus operator CRDs, if they don't already exist
+- Prometheus operator deployments, if they don't already exist
+
+If your downstream cluster also has the chart value :
+
+```yaml
+kube-promethues-stack:
+ enabled: true
+```
+
+- Prometheus operator Node exporter, if it doesn't already exist
+- Prometheus operator kube-state-metrics, if it doesn't already exist
+
+### Configuring Downstream Components
diff --git a/docs/installation/opni/alerting.md b/docs/installation/opni/alerting.md
new file mode 100644
index 0000000..d22892d
--- /dev/null
+++ b/docs/installation/opni/alerting.md
@@ -0,0 +1,39 @@
+# Install Alerting
+
+The Alerting backend is composed of an [AlertManager](https://prometheus.io/docs/alerting/latest/alertmanager/) statefulset, fully managed by Opni.
+
+## Using the Opni Dashboard
+
+Follow these steps to enable Monitoring from the Opni dashboard:
+
+1. Navigate to the Opni dashboard
+
+To access the dashboard, you can port-forward:
+
+```bash
+kubectl -n opni port-forward svc/opni-admin-dashboard web:web
+```
+
+Then navigate to [http://localhost:12080](http://localhost:12080).
+
+2. Select "Alerting" from the left sidebar then click enable to install
+
+
+
+
+
+:::caution known-issue
+Alerting backend can sometimes erroneously show a "no changes to apply" error when installing, however this does not impact functionality
+:::
+
+3. Choose between deploying the opni-cluster as standalone or HA:
+
+
+
+
diff --git a/docs/installation/opni/backends.md b/docs/installation/opni/backends.md
index 4f442f2..f9a910f 100644
--- a/docs/installation/opni/backends.md
+++ b/docs/installation/opni/backends.md
@@ -702,51 +702,5 @@ kubectl get secret -n opni opni-admin-password -o jsonpath='{.data.password}' |
```
It is recommended that you change the password on this user.
-
-
-
-
-The Alerting backend is composed of an [AlertManager](https://prometheus.io/docs/alerting/latest/alertmanager/) statefulset, fully managed by Opni.
-You can enable and configure Monitoring from the Opni dashboard, or from the CLI.
-
-### Using the Opni Dashboard
-
-Follow these steps to enable Monitoring from the Opni dashboard:
-
-1. Navigate to the Opni dashboard
-
-To access the dashboard, you can port-forward:
-
-```bash
-kubectl -n opni port-forward svc/opni-admin-dashboard web:web
-```
-
-Then navigate to [http://localhost:12080](http://localhost:12080).
-
-2. Select "Alerting" from the left sidebar then click enable to install
-
-
-
-
-
-:::caution known-issue
-Alerting backend can sometimes erroneously show a "no changes to apply" error when installing, however this does not impact functionality
-:::
-
-3. Choose between deploying the opni-cluster as standalone or HA:
-
-
-
-
-
-
-
-
diff --git a/sidebars.js b/sidebars.js
index 7bbf3ae..fdaac0a 100644
--- a/sidebars.js
+++ b/sidebars.js
@@ -95,14 +95,31 @@ const sidebars = {
},
{
type: 'category',
- label: 'Alerting User Guide',
+ label: 'Opni Alerting',
link: {
type: 'doc',
id: 'guides/alerting/index'
},
items: [
{
-
+ type: 'doc',
+ id: 'installation/opni/alerting',
+ label: 'Install Alerting'
+ },
+ {
+ type: 'doc',
+ id: 'guides/alerting/endpoints',
+ label: 'Configuring endpoints'
+ },
+ {
+ type: 'doc',
+ id: 'guides/alerting/alarms',
+ label: 'Configuring alarms'
+ },
+ {
+ type: 'doc',
+ id: 'installation/opni/slo',
+ label: 'SLOs'
}
]
}