Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

harbor-db Always in recovery mode #1804

Open
Tebby0753 opened this issue Aug 5, 2024 · 5 comments
Open

harbor-db Always in recovery mode #1804

Tebby0753 opened this issue Aug 5, 2024 · 5 comments

Comments

@Tebby0753
Copy link

Hello:
I used helm to deploy harbor, and it has been running for about more than a year, but recently harbor-db often prints: the database system is in recovery mode. I see that the monitoring resources are normal. The following is the printed log:

2024-08-05 10:07:43.816 UTC [2480] FATAL: the database system is in recovery mode
2024-08-05 10:07:53.812 UTC [2490] FATAL: the database system is in recovery mode
2024-08-05 10:08:02.040 UTC [1635] LOG: database system was not properly shut down; automatic recovery in progress
2024-08-05 10:08:03.465 UTC [1635] LOG: invalid record length at 1/41B81728: wanted 24, got 0
2024-08-05 10:08:03.465 UTC [1635] LOG: redo is not required
2024-08-05 10:08:03.825 UTC [2501] FATAL: the database system is in recovery mode
2024-08-05 10:08:07.401 UTC [1] LOG: database system is ready to accept connections
2024-08-05 10:08:17.371 UTC [1] LOG: server process (PID 2514) exited with exit code 141
2024-08-05 10:08:17.371 UTC [1] LOG: terminating any other active server processes
2024-08-05 10:08:17.371 UTC [2517] WARNING: terminating connection because of crash of another server process
2024-08-05 10:08:17.371 UTC [2517] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2024-08-05 10:08:17.371 UTC [2517] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2024-08-05 10:08:17.372 UTC [2505] WARNING: terminating connection because of crash of another server process
2024-08-05 10:08:17.372 UTC [2505] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2024-08-05 10:08:17.372 UTC [2505] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2024-08-05 10:08:18.251 UTC [1] LOG: all server processes terminated; reinitializing
2024-08-05 10:08:18.548 UTC [2518] LOG: database system was interrupted; last known up at 2024-08-05 10:08:07 UTC
2024-08-05 10:08:23.812 UTC [2528] FATAL: the database system is in recovery mode
2024-08-05 10:08:33.811 UTC [2538] FATAL: the database system is in recovery mode
2024-08-05 10:08:43.813 UTC [2548] FATAL: the database system is in recovery mode
2024-08-05 10:08:44.191 UTC [2558] FATAL: the database system is in recovery mode
2024-08-05 10:08:53.810 UTC [2568] FATAL: the database system is in recovery mode
2024-08-05 10:09:03.813 UTC [2578] FATAL: the database system is in recovery mode
2024-08-05 10:09:13.816 UTC [2587] FATAL: the database system is in recovery mode
2024-08-05 10:09:23.817 UTC [2597] FATAL: the database system is in recovery mode
2024-08-05 10:09:33.813 UTC [2607] FATAL: the database system is in recovery mode
2024-08-05 10:09:43.810 UTC [2617] FATAL: the database system is in recovery mode
2024-08-05 10:09:50.197 UTC [2627] FATAL: the database system is in recovery mode
2024-08-05 10:09:53.812 UTC [2637] FATAL: the database system is in recovery mode
2024-08-05 10:10:03.816 UTC [2647] FATAL: the database system is in recovery mode
2024-08-05 10:10:13.809 UTC [2656] FATAL: the database system is in recovery mode
2024-08-05 10:10:23.816 UTC [2666] FATAL: the database system is in recovery mode

The log will keep looping...
i want to know what happened,Thanks!

@Tebby0753
Copy link
Author

harbor-db resource monitoring:
WX20240805-182237

@zyyw
Copy link
Collaborator

zyyw commented Aug 7, 2024

According to this resources:

This error of:

2024-08-05 10:07:53.812 UTC [2490] FATAL: the database system is in recovery mode
2024-08-05 10:08:02.040 UTC [1635] LOG: database system was not properly shut down; automatic recovery in progress

2024-08-05 10:08:17.371 UTC [2517] WARNING: terminating connection because of crash of another server process
2024-08-05 10:08:17.371 UTC [2517] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2024-08-05 10:08:17.371 UTC [2517] HINT: In a moment you should be able to reconnect to the database and repeat your command.

may have something to do with the disk space utilization and memory allocation in your environment.

@abinet
Copy link

abinet commented Sep 10, 2024

We are experiencing same problem now. Harbor version 2.10, internal PostgreSQL database. Following pattern is observable multiple times per day:

2024-09-09 16:59:39.372 UTC [1] LOG:  server process (PID 1129079) exited with exit code 141ESC[0K
2024-09-09 16:59:39.372 UTC [1] LOG:  terminating any other active server processesESC[0K
2024-09-09 16:59:39.410 UTC [1] LOG:  all server processes terminated; reinitializingESC[0K
2024-09-09 16:59:39.473 UTC [1129087] LOG:  database system was interrupted; last known up at 2024-09-09 16:02:59 UTCESC[0K
2024-09-09 16:59:41.136 UTC [1129087] LOG:  database system was not properly shut down; automatic recovery in progressESC[0K
2024-09-09 16:59:41.159 UTC [1129087] LOG:  redo starts at 1/36EC9F70ESC[0K
2024-09-09 16:59:41.160 UTC [1129087] LOG:  invalid record length at 1/36ECA070: wanted 24, got 0ESC[0K
2024-09-09 16:59:41.160 UTC [1129087] LOG:  redo done at 1/36ECA038 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 sESC[0K
2024-09-09 16:59:41.225 UTC [1] LOG:  database system is ready to accept connectionsESC[0K
2024-09-09 16:59:47.355 UTC [1129115] LOG:  PID 1128695 in cancel request did not match any processESC[0K
2024-09-09 16:59:49.162 UTC [1129116] LOG:  PID 1128696 in cancel request did not match any processESC[0K
2024-09-09 16:59:49.163 UTC [1129117] LOG:  PID 1128718 in cancel request did not match any processESC[0K
2024-09-09 16:59:58.588 UTC [1129139] LOG:  PID 1129004 in cancel request did not match any processESC[0K
2024-09-09 17:00:07.196 UTC [1129164] LOG:  PID 1128841 in cancel request did not match any processESC[0K

Extending memory resources for database pod does not help. I am not able to see any anomalies in Harbor Grafana Dashboards (although there is no postgres specific dashboard provided).
There are neither OOM killer logs nor OOM Kubernetes events.

Here I found similar issues goharbor/harbor#20272, #1184 but they are closed already without solution being provided.

General PostgreSQL suggestion in such cases is to investigate core dumps. I am not sure if Harbor Database Pod has been configured to write core dumps, nor allows such a configuration.

Any suggestions and ideas how to troubleshoot this are very appreciated.

@antwacky
Copy link

I am also seeing this, I've previously resolved this by running some cleanup of the DB as mentioned here although I wish it would do this itself.

@valkiriaaquatica
Copy link

Hi, in my case I was using a storage class tht was using NFS storage with HDD, i changed it to Longhorn+HDD and seems better, no more Always in recovery mode recevied

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants