Understanding Performance #2033

ccrvlh · 2022-09-04T11:32:41Z

ccrvlh
Sep 4, 2022

I've been experimenting with the Operator for a while and things are mostly working fine.
One thing that I haven't been able to understand is the very slow speeds I've been getting.
Bear in mind this is a minimal test cluster (3 nodes, 6 CPU, 12GB RAM) hosted on DOKS, tested both with local-path (Rancher driver) and with DO's block-storage with 50Gb. The resources are set for 250m CPU request and 256Mi memory request, no limits are set. All tests were performed with a totally empty cluster using pgbench.

In my work computer (1Tb NVME, 32Gb RAM, i5-11660k) I get 2400 TPS (Yes I know, I'm not comparing, just having an idea of "good" and "bad"), in a minimal DO instance I get around 1400 TPS locally, and 500-600 when accessing the instance through the cluster.

With this operator I've been consistently getting 80-120 TPS on pgbench. initial pgbench vacuum takes ~10 seconds.
With bitnami's HA postgres chart I'm getting anywhere from 700 to 1000 TPS (local path) and ~300 TPS with Blcok Storage and initial pgbench vacuum takes ~3 seconds.

PgBench:

pgbench -i -h localhost -p 5432 -U postgres -s 5 postgres
pgbench -h localhost -p 5432 -U postgres -c 1 -j 1 -t 10000 postgres

What I tried:

changing the number of replicas (tested with 1/2/3)
enabling and disabling WAL
Changing resource requests/limits
Changing the volume size (tried with 8Gi, 50Gi, 500Gi)
Enabling/disabling pg-bouncer.

We have a tiny 200-300Mb Database during development, and queries are normally very simple: 1 join, ~1k lines nothing fancy at all). I imagine I'm doing something wrong with the configuration, but can't find out what exactly. Any ideas are welcome.

Helm Values

configUsers:
replication_username: standby
super_username: postgres

configKubernetes:
cluster_domain: cluster.local
cluster_labels:
  application: test-db
cluster_name_label: test-db
pod_environment_secret: "test-wal-secret"

configPostgresPodResources:
default_cpu_limit: "1"
default_cpu_request: 50m
default_memory_limit: 512Mi
default_memory_request: 256Mi

configLogicalBackup:
logical_backup_job_prefix: "test-backup-"
logical_backup_provider: "s3"
logical_backup_s3_retention_time: "2 weeks"
logical_backup_s3_access_key_id: "XXX"
logical_backup_s3_secret_access_key: "XXX"
logical_backup_s3_bucket: "test-database-backup"
logical_backup_s3_region: "us-east-1"
logical_backup_s3_sse: "AES256"
logical_backup_schedule: "05 * * * *"

configTeamsApi:
enable_team_superuser: true
enable_teams_api: false

configConnectionPooler:
connection_pooler_schema: "pooler"
connection_pooler_user: "pooler"
connection_pooler_image: "registry.opensource.zalan.do/acid/pgbouncer:master-22"
connection_pooler_max_db_connections: 50
connection_pooler_mode: "transaction"
connection_pooler_number_of_instances: 1
connection_pooler_default_cpu_request: 10m
connection_pooler_default_memory_request: 10Mi
connection_pooler_default_cpu_limit: "1"
connection_pooler_default_memory_limit: 100Mi

configAwsOrGcp:
additional_secret_mount: "test-wal-secret"
aws_region: "us-east-1"
wal_s3_bucket: "test-database-backup"

WAL Secret

apiVersion: v1
kind: Secret
metadata:
name: test-wal-secret
stringData:
# WAL Details
WAL_S3_BUCKET: "myapp-database-backup"
WAL_BUCKET_SCOPE_PREFIX: ""
WAL_BUCKET_SCOPE_SUFFIX: ""

# Backup Config
BACKUP_SCHEDULE: "05 * * * *"
BACKUP_NUM_TO_RETAIN: "30"

# AWS Credentials
AWS_ACCESS_KEY_ID: "XXX"
AWS_SECRET_ACCESS_KEY: "XXX"
AWS_REGION: "us-east-1"

# Enforces the use of Wal-G (instead of Wal-E)
WALG_DISABLE_S3_SSE: "true"
USE_WALG_BACKUP: "true"
USE_WALG_RESTORE: "true"
CLONE_USE_WALG_RESTORE: "true"

Cluster Manifest

apiVersion: "acid.zalan.do/v1"
kind: postgresql

metadata:
name: test-db
namespace: default

spec:
teamId: "test"
numberOfInstances: 2
enableLogicalBackup: true
enableConnectionPooler: false
logicalBackupSchedule: "05 * * * *"
resources:
  requests:
    cpu: "250m"
    memory: "256Mi"
postgresql:
  version: "14"
volume:
  storageClass: local-path
  size: 50Gi
users:
  test:
    - superuser
    - createdb
databases:
  test: test

ccrvlh · 2022-09-05T14:58:07Z

ccrvlh
Sep 5, 2022
Author

I tried a few variations of configuration. At first sight, it seems Spilo/Patroni's automatic configuration is not very suited for small clusters, I don't remember the values exactly, but work_mem for example was extremely low, and there were a few other things I changed that seemed to give a relief on the extremely poor performance I was having. It is still far from other deployments I tested (bitnami, kubegres, crunchy), but it does seem that it's possible to narrow the problem down to how Postgres is configured instead of any Operator specific configuration (I think).

I still have to experiment with different images (from what I understand from the docs only Spilo images will work?), and check the Spilo default parameters. Although not a proper solution, it does seem relevant that users with small clusters (don't really know exactly at what point a cluster becomes "small" for Spilo's config to stop being optimal) beware of automatic injected configuration that may not be ideal.

This is what I found with trial & error so far and in a not very scientific experiment, any other feedbacks are more than welcome.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding Performance #2033

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Understanding Performance #2033

ccrvlh Sep 4, 2022

Replies: 1 comment

ccrvlh Sep 5, 2022 Author

ccrvlh
Sep 4, 2022

ccrvlh
Sep 5, 2022
Author