Attempt to introduce a monolithic deployment with Helm #4858

rubenvw-ngdata · 2023-04-27T13:53:06Z

Ideally the same chart can be used for monolithic, read-write and microservices deployment mode.
This allows an easier migration between modes when the load on a system changes.

To do this a new parameter 'deploymentMode' has been added to the helm chart. All charts are adapted to deploy services when applicable according to the
deployment model.

A new template is being added that installs the
monolithic mimir container. All nginx endpoints are updated to direct to the monolithic service if this deployment mode has been chosen.

I attached a monolithic.yaml yaml file that I have used to do a monolithic setup.

Didn't fully get it working yet, but sharing it as it could be a starting point for a future implementation.

Currently stuck on issues with my monolithic container that reports errors that I don't really understand:

rpc error: code = Unavailable desc = connection
error: desc = "transport: Error while dialing dial
tcp 10.98.18.233:9095: connect: connection refused

If anyone has a clue and willing to help me out, let me know.

What this PR does

Which issue(s) this PR fixes or relates to

Fixes #

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

CLAassistant · 2023-04-27T13:53:13Z

All committers have signed the CLA.

rubenvw-ngdata · 2023-04-27T13:58:24Z

I know this PR is not ready, created it in scope of #4832 but I got stuck on getting things working.

dimitarvdimitrov · 2023-05-04T11:54:21Z

Nothing strikes me as the cause of the "connection refused" error. Can you provide more surrounding log lines? I am trying to understand what it is trying to connect to. Also can you confirm what this IP 10.98.18.233 is of? Is this the monolithic pod or a service IP?

rubenvw-ngdata · 2023-06-07T12:51:28Z

@dimitarvdimitrov It's the IP for the mimir-monolithic service. See screenshot from lens (IP changed because it is a new installation)

kieranbrown · 2023-06-12T11:20:38Z

I've had a bit of a play with this as our company is interested in deploying Mimir in Monolithic mode. I think at the moment there's a bit of a catch-22 situation where Mimir's ready check depends on the service but the service won't publish the IPs until Mimir is ready.

The /ready endpoint is spitting out not ready: number of queriers connected to query-frontend is 0

In microservices mode it looks like this is worked around by setting publishNotReadyAddresses on the query-frontend service so that Mimir can communicate with the query-frontend during the readiness process

mimir/operations/helm/charts/mimir-distributed/templates/query-frontend/query-frontend-svc-headless.yaml

Line 18 in 548ca6a

publishNotReadyAddresses: true

I think if you were to set publishNotReadyAddresses in the monolithic headless service https://github.com/rubenvw-ngdata/mimir/blob/main/operations/helm/charts/mimir-distributed/templates/monolithic/monolithic-svc-headless.yaml and configure mimir to use that service when communicating with the frontend worker than that should resolve the connection issues.

Proposed changes:

monolithic-svc-headless.yaml

{{- if eq .Values.deploymentMode "monolithic" }}
apiVersion: v1
kind: Service
metadata:
  name: {{ include "mimir.resourceName" (dict "ctx" . "component" "monolithic") }}-headless
  ...
spec:
  ...
  publishNotReadyAddresses: true
  ...
{{- end }}

values.yaml

mimir:
  frontend_worker:
    frontend_address: {{ include "mimir.resourceName" (dict "ctx" . "component" "monolithic") }}-headless.{{ .Release.Namespace }}.svc:{{ include "mimir.serverGrpcListenPort" . }}

Apologies if I'm completely off the mark, I only started looking into Mimir as a whole last week. Thanks for the fantastic work on this PR so far, deploying this in Monolithic mode definitely simplifies things 👍

rubenvw-ngdata · 2023-06-19T18:01:13Z

@kieranbrown Did you test it with the proposed configuration. I applied your proposed changes and it's still failing with the exact same error message. Any other suggestions @dimitarvdimitrov ?

kieranbrown · 2023-06-19T18:23:06Z

@rubenvw-ngdata apologies there was a typo in my recommended changes. The frontend_address change should be here https://github.com/rubenvw-ngdata/mimir/blob/main/operations/helm/charts/mimir-distributed/values.yaml#L259, and it should be changed to

frontend_address: {{ include "mimir.resourceName" (dict "ctx" . "component" "monolithic") }}-headless.{{ .Release.Namespace }}.svc:{{ include "mimir.serverGrpcListenPort" . }}

But to answer your question yes I've been running your recommended configuration file, I'll attach my full values file below for reference.

I've been running this configuration now for about a week with no issues.

My full values file

deploymentMode: monolithic
fullnameOverride: mimir

image:
  tag: 2.9.0-rc.1

ingester:
  zoneAwareReplication:
    enabled: false

mimir:
  structuredConfig:
    alertmanager_storage:
      s3:
        bucket_name: <redacted>-alertmanager
    blocks_storage:
      s3:
        bucket_name: <redacted>-blocks
    common:
      storage:
        backend: s3
        s3:
          endpoint: s3.eu-west-2.amazonaws.com
          region: eu-west-2
    frontend_worker: # this is temporarily set until it's fixed in the upstream chart
      frontend_address: mimir-monolithic-headless.mimir:9095
    ingester:
      ring:
        replication_factor: 3
    limits:
      compactor_blocks_retention_period: 2y
      ingestion_burst_size: 100000
      ingestion_rate: 50000
      max_global_series_per_user: 1000000 # increased as all requests are under an 'anonymous' user
      max_label_names_per_series: 50 # increased as we breached the default of '30'
      out_of_order_time_window: 1h
    ruler_storage:
      s3:
        bucket_name: <redacted>-ruler

metaMonitoring:
  dashboards:
    enabled: true
    annotations:
      k8s-sidecar-target-directory: Platforming - Mimir
  prometheusRule:
    mimirAlerts: true
    mimirRules: true
  serviceMonitor:
    enabled: true

minio:
  enabled: false

monolithic:
  persistentVolume:
    size: 50Gi
  replicas: 3
  resources:
    limits:
      cpu: '2'
      memory: 8Gi
    requests:
      cpu: '1'
      memory: 8Gi
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: ScheduleAnyway
    - maxSkew: 1
      topologyKey: kubernetes.io/hostname
      whenUnsatisfiable: DoNotSchedule
  service:
    annotations:
      service.kubernetes.io/topology-aware-hints: Auto
  zoneAwareReplication:
    enabled: false

nginx:
  enabled: false

serviceAccount:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::<redacted>:role/ops-tooling/mimir

store_gateway:
  zoneAwareReplication:
    enabled: false

rubenvw-ngdata · 2023-06-19T19:59:22Z

@kieranbrown Thanks for the input. That helped. I have it working now, and will continue testing.
We will probably clean up this POC and move it so it can be used by others.
Not sure if I will keep the name if the chart as mimir-distributed. I'm a bit worried that it will screw up the rebasing process.
Will keep you all posted.

Still hoping for official support from the mimir team in the future.

rubenvw-ngdata · 2023-06-20T10:20:53Z

@kieranbrown Do you use mimir with prometheus? How have you configured the remoteWrite in prometheus. Still trying to get that one working at my side.

kieranbrown · 2023-06-21T08:07:10Z

@kieranbrown Do you use mimir with prometheus? How have you configured the remoteWrite in prometheus. Still trying to get that one working at my side.

We don't use Prometheus, we use a central opentelemetry collector to forward metrics to Mimir, although the config should be roughly the same.

Your push endpoint should be something like http://mimir-monolithic.<namespace>.svc.cluster.local:8080/api/v1/push and you may need to send a X-Scope-OrgID header with the value of anonymous. I'm not too clued up on the multi-tenancy side of Mimir but I think this is the correct thing to do if you plan on only having 1 tenant.

This is just a complete guess based on a quick Google and I can't validate it because I don't have a prometheus instance, but perhaps you want something like this.

remote_write: 
  url: http://mimir-monolithic.<namespace>.svc.cluster.local:8080/api/v1/push
  headers:
    X-Scope-OrgID: anonymous

kieranbrown · 2023-06-22T10:16:06Z

@rubenvw-ngdata I think the target arg (https://github.com/grafana/mimir/pull/4858/files#diff-3cc68a2fa2c916c22abe0eec319b27f66e6590402a1a62dc593bde6684e44911R112) should be configurable. I've just gone to enable Mimir's alertmanager and realised there is actually no way to do this.

There's 2 approaches we could take, we build the target arg based on the helm values that already exist today. i.e if alertmanager.enabled: true then add alertmanager to the list of enabled modules.

Or we just add monolithic.targets to the helm values, probably a list of strings, that defaults to [all].

Keen to hear your thoughts.

dimitarvdimitrov · 2023-06-22T10:34:28Z

The flexibility monolithic.targets will allow to also include the overrides-exporter if the operator wants to. It's a bit harder to configure, but we can solve this by having concrete examples in a common above the value for configuring alertmanager, and/or overrides-exporter.

rubenvw-ngdata · 2023-06-22T15:06:29Z

@kieranbrown @dimitarvdimitrov I don't think it's as simple as you propose it.
At least for the alert manager it requires the configuration config map to be deployed and the configmap to be mount on the monolithic statefulset. Not sure if it is worth adding this level of complexity to the helm charts.
The alternative of allowing the optional services to run next to the monolithic pod is a lot easier to support and will require less duplication imo. The mimir documentation also does not document this functionality.
So I would go for another alternative solution: allow the optional components to run next to the monolithic pod.

I have only done a very quick test and this is my first opinion.
Feel free to make the changes and do a PR if you want to convince me.

rubenvw-ngdata · 2023-06-26T08:52:43Z

Even though all data seems to get in correctly into mimir and I can query the data without issues, the mimir logs are still full of
caller=frontend_processor.go:87 level=error msg="error processing requests" address=10.244.3.59:9095 err="rpc error: code = Unavailable desc = closing transport due to: connection error: desc = \"error reading from server: EOF\", received prior goaway: code: NO_ERROR"

dimitarvdimitrov · 2023-06-26T11:45:28Z

Even though all data seems to get in correctly into mimir and I can query the data without issues, the mimir logs are still full of caller=frontend_processor.go:87 level=error msg="error processing requests" address=10.244.3.59:9095 err="rpc error: code = Unavailable desc = closing transport due to: connection error: desc = \"error reading from server: EOF\", received prior goaway: code: NO_ERROR"

I believe this is related to periodically renewing gRPC connections. Currently they are renewed every 2 minutes (values.yaml), so this log line shouldn't be too frequent. This helps to refresh connection pools with new pods. It should be harmless because the connection will be reestablished after the failure.

rubenvw-ngdata · 2023-06-26T12:34:48Z

@dimitarvdimitrov Not sure about your statement.
At least it is more frequent, and it also says that there are no queriers connected to the frontend.
It's strange that it reports this, because querying with grafana is working.

An example of a successful query action in the mimir logs:

ts=2023-06-26T14:21:42.831153671Z caller=handler.go:314 level=info user=anonymous msg="query stats" component=query-frontend method=GET path=/prometheus/api/v1/query_range user_agent=Grafana/9.5.1 response_time=1.453157ms query_wall_time_seconds=0.00057877 fetched_series_count=1 fetched_chunk_bytes=179 fetched_chunks_count=3 fetched_index_bytes=0 sharded_queries=0 split_queries=1 estimated_series_count=0 param_end=1687789260 param_query="lily_configuration_size_in_bytes_value{namespace=~\"bct-local-tenant-ngdata\"}" param_start=1687778460 param_step=60 status=success

I also cannot understand that I should ignore error level logs...

dimitarvdimitrov · 2023-06-26T15:40:48Z

sharing from a private message:

see this PR #3262, it sets the connection age on the scheduler to “infinity” in order to avoid resetting these connections in the scheduler code... if you set these two parameters in the monolithic deployment the reset connections should stop

rubenvw-ngdata · 2023-07-07T15:29:53Z

@kieranbrown Just a note to make you aware that we have moved the repository to be an organisation repository. The fork should be available at https://github.com/NGDATA/mimir/
@dimitarvdimitrov Keep me posted when you consider productizing this functionality.

raffraffraff · 2023-08-02T10:21:30Z

Nice work @rubenvw-ngdata. Your fork doesn't have an 'issues' section, so I thought I'd mention this here: this chart doesn't support turning off zoneAwareReplication. If I disable it under monolith I still get more errors like:

│ Error: execution error at (mimir-distributed/templates/validate.yaml:153:4): You have set ingester.zoneAwareReplication.enabled=true. The mimir.config must include ingester.ring.zone_awareness_enabled=true. Please merge the latest mimir.config from the chart with your mimir.config.

and:

│ Error: execution error at (mimir-distributed/templates/monolithic/monolithic-svc.yaml:3:17): When zone-awareness is enabled, you must have at least 3 zones defined.

It looks like there are other sections in the values.yaml that hard-code it to 'enabled', eg: the ingester.zoneAwareReplication. Perhaps I'm misunderstanding the purpose of .Values.monolithic.zoneAwareReplication.enabled but I figured that it might be a master-switch to toggle zoneAwareReplication for the whole monolithic deployment? If so, then the other instances of zoneAwareReplication should probably get wrapped in an if statement, something like this:

  zoneAwareReplication:
    {{- if or 
      (and (eq .Values.deploymentMode "microservices") .Values.ingester.zoneAwareReplication.enabled) 
      (and (eq .Values.deploymentMode "monolithic") .Values.monolithic.zoneAwareReplication.enabled) 
    }}
    # -- Enable zone-aware replication for ingester
    enabled: true
    {{- else }}
    enabled: false
    {{- end }}

The _helpers.tpl has this:

{{- if $componentSection.zoneAwareReplication.enabled -}}
{{- $numberOfZones := len $componentSection.zoneAwareReplication.zones -}}
{{- if lt $numberOfZones 3 -}}
{{- fail "When zone-awareness is enabled, you must have at least 3 zones defined." -}}
{{- end -}}

I checked each component for zoneAwareReplication, and narrowed it down to 'alertmanager', 'store_gateway', 'ingester' and 'monolithic'. If I explicitly set zoneAwareReplication.enabled=false for all of those, I no longer get the error.

rubenvw-ngdata · 2023-08-03T10:01:12Z

@raffraffraff Thank you and thanks for reporting the issue. I agree that mimir should support this functionality. If more people start using this, they might consider it. So do not hesitate to spread the word.

For the first issue, you are right. I have fixed some validations that I had forgotten. Since we are still working on the initial setup I have squashed the changes in the same commit. Can you validate if it is working correctly for you now? If you still encounter the second problem (I did not face it), can you share your configuration so I can help?

I have opened up the issues section on the fork to allow logging issues.

raffraffraff · 2023-08-03T11:19:03Z

Thanks @rubenvw-ngdata - I'll test it with the latest version of your code and create an issue in your project if it isn't working for me.

EDIT: I pulled the latest code, but don't see any significant differences that would resolve this. In the meantime I have created an issue on your project to track what I was experiencing. I'll update it as soon as I can test it (hopefully tomorrow)

sjoukedv · 2023-08-30T17:29:53Z

I wrote a completely new helm chart with helm create mimir-monolithic (default templates are actually quite okay) and changed some of the default values:

# Default values for mimir-monolithic.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.

replicaCount: 1

image:
  repository: grafana/mimir
  pullPolicy: IfNotPresent
  # Overrides the image tag whose default is the chart appVersion.
  tag: "r252-a0bed52"

service:
  type: ClusterIP
  port: 9009

# config taken from a docker-compose example somewhere
# you want to override this with your own anyway
mimir:
  config: |
    multitenancy_enabled: false

    blocks_storage:
      backend: filesystem
      bucket_store:
        sync_dir: /tmp/mimir/tsdb-sync
      filesystem:
        dir: /tmp/mimir/data/tsdb
      tsdb:
        dir: /tmp/mimir/tsdb

    compactor:
      data_dir: /tmp/mimir/compactor
      sharding_ring:
        kvstore:
          store: memberlist

    distributor:
      ring:
        instance_addr: 127.0.0.1
        kvstore:
          store: memberlist

    ingester:
      ring:
        instance_addr: 127.0.0.1
        kvstore:
          store: memberlist
        replication_factor: 1

    ruler_storage:
      backend: filesystem
      filesystem:
        dir: /tmp/mimir/rules

    server:
      http_listen_port: 9009
      log_level: error

    store_gateway:
      sharding_ring:
        replication_factor: 1

Adding args to the container to tell it to run all services:

          args:
            - -target=all
            - -config.expand-env=true
            - -config.file=/etc/mimir/mimir.yaml

(mount /etc/mimir/mimir.yaml from a ConfigMap). Adding the datasource to grafana

              - name: Mimir
                type: prometheus
                uid: mimir
                url: http://mimir-monolithic.monitoring:9009/prometheus
                access: proxy
                isDefault: false
                jsonData:
                  httpMethod: POST
                  timeInterval: 60s
                  manageAlerts: true
                  prometheusType: Mimir

This is not production ready by any means! I can publish the chart if someone is interested in it.

deployment. The same chart is used for monolithic, as was existing for the microservices deployment mode. To do this a new parameter 'deploymentMode' has been added to the helm chart. All charts are adapted to deploy services when applicable according to the deployment model. A new template is being added that installs the monolithic mimir container. All nginx endpoints are updated to direct to the monolithic service if this deployment mode has been chosen. The name of the chart has not been updated from 'mimir-distributed' to 'mimir' yet. This allows for an easier maintenance as long as this is not merged. Ideally the rename happens when merging to avoid further confusion. I attached a monolithic.yaml yaml file that I have used to do a monolithic setup. In monolithic mode, publish not ready addresses and use these for the frontend. The read-write mode could be added in the future. The alert manager is not included in the monolithic deployment. If you want to run it, you can run it in a seperate pod by enabling it. The zone aware reploication functionality is supported int he monolithic deployment with a similar confiuction as required for ingester or store gateway. I have also added the functionality to enable compression for the monolithic chart. Ideally this would be embedded into the complete chart instead. A common configuration is probably the easiest, if you want grpc message compression, I guess you want it everywhere. The common configuration setting can then also be used in the monolithic approach.

rubenvw-ngdata requested a review from a team as a code owner April 27, 2023 13:53

rubenvw-ngdata mentioned this pull request Apr 27, 2023

Helm chart for monolithic and read-write deployment mode #4832

Open

56quarters added the helm label Apr 27, 2023

pracucci marked this pull request as draft April 28, 2023 12:14

rubenvw-ngdata force-pushed the main branch from 8efa4ac to e5e7b26 Compare June 6, 2023 11:52

rubenvw-ngdata force-pushed the main branch from e5e7b26 to fff6914 Compare June 19, 2023 19:53

rubenvw-ngdata force-pushed the main branch 4 times, most recently from 328fd50 to bbe5a1a Compare July 6, 2023 09:29

rubenvw-ngdata force-pushed the main branch from bbe5a1a to e51a73c Compare July 12, 2023 13:31

rubenvw-ngdata force-pushed the main branch from e51a73c to f0770de Compare August 3, 2023 09:54

raffraffraff mentioned this pull request Aug 3, 2023

Errors when .Values.monolith.zoneAwareReplication is false NGDATA/mimir#1

Closed

rubenvw-ngdata force-pushed the main branch 2 times, most recently from 0318ff5 to 203d1ab Compare October 10, 2023 15:57

rubenvw-ngdata force-pushed the main branch from 203d1ab to 67ff978 Compare February 21, 2024 15:20

janvanbesien-ngdata force-pushed the main branch 2 times, most recently from b068f97 to 8b07871 Compare May 31, 2024 11:14

rubenvw-ngdata force-pushed the main branch from 8b07871 to 96905a3 Compare November 6, 2024 10:52

rubenvw-ngdata force-pushed the main branch from 96905a3 to 60914b4 Compare November 19, 2024 15:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attempt to introduce a monolithic deployment with Helm #4858

Attempt to introduce a monolithic deployment with Helm #4858

rubenvw-ngdata commented Apr 27, 2023

CLAassistant commented Apr 27, 2023 •

edited

Loading

rubenvw-ngdata commented Apr 27, 2023

dimitarvdimitrov commented May 4, 2023

rubenvw-ngdata commented Jun 7, 2023

kieranbrown commented Jun 12, 2023

rubenvw-ngdata commented Jun 19, 2023

kieranbrown commented Jun 19, 2023

rubenvw-ngdata commented Jun 19, 2023

rubenvw-ngdata commented Jun 20, 2023

kieranbrown commented Jun 21, 2023

kieranbrown commented Jun 22, 2023 •

edited

Loading

dimitarvdimitrov commented Jun 22, 2023

rubenvw-ngdata commented Jun 22, 2023 •

edited

Loading

rubenvw-ngdata commented Jun 26, 2023

dimitarvdimitrov commented Jun 26, 2023

rubenvw-ngdata commented Jun 26, 2023 •

edited

Loading

dimitarvdimitrov commented Jun 26, 2023

rubenvw-ngdata commented Jul 7, 2023

raffraffraff commented Aug 2, 2023 •

edited

Loading

rubenvw-ngdata commented Aug 3, 2023

raffraffraff commented Aug 3, 2023 •

edited

Loading

sjoukedv commented Aug 30, 2023

Attempt to introduce a monolithic deployment with Helm #4858

Are you sure you want to change the base?

Attempt to introduce a monolithic deployment with Helm #4858

Conversation

rubenvw-ngdata commented Apr 27, 2023

What this PR does

Which issue(s) this PR fixes or relates to

Checklist

CLAassistant commented Apr 27, 2023 • edited Loading

rubenvw-ngdata commented Apr 27, 2023

dimitarvdimitrov commented May 4, 2023

rubenvw-ngdata commented Jun 7, 2023

kieranbrown commented Jun 12, 2023

rubenvw-ngdata commented Jun 19, 2023

kieranbrown commented Jun 19, 2023

rubenvw-ngdata commented Jun 19, 2023

rubenvw-ngdata commented Jun 20, 2023

kieranbrown commented Jun 21, 2023

kieranbrown commented Jun 22, 2023 • edited Loading

dimitarvdimitrov commented Jun 22, 2023

rubenvw-ngdata commented Jun 22, 2023 • edited Loading

rubenvw-ngdata commented Jun 26, 2023

dimitarvdimitrov commented Jun 26, 2023

rubenvw-ngdata commented Jun 26, 2023 • edited Loading

dimitarvdimitrov commented Jun 26, 2023

rubenvw-ngdata commented Jul 7, 2023

raffraffraff commented Aug 2, 2023 • edited Loading

rubenvw-ngdata commented Aug 3, 2023

raffraffraff commented Aug 3, 2023 • edited Loading

sjoukedv commented Aug 30, 2023

CLAassistant commented Apr 27, 2023 •

edited

Loading

kieranbrown commented Jun 22, 2023 •

edited

Loading

rubenvw-ngdata commented Jun 22, 2023 •

edited

Loading

rubenvw-ngdata commented Jun 26, 2023 •

edited

Loading

raffraffraff commented Aug 2, 2023 •

edited

Loading

raffraffraff commented Aug 3, 2023 •

edited

Loading