Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TCP sockets for health checks make the broker crash in containers #1672

Open
ravaga opened this issue Sep 24, 2024 · 0 comments
Open

TCP sockets for health checks make the broker crash in containers #1672

ravaga opened this issue Sep 24, 2024 · 0 comments

Comments

@ravaga
Copy link

ravaga commented Sep 24, 2024

I have deployed a containerized Orion-LD instance using the latest version (both in Docker and Kubernetes) with the -socketService and (optionally) the -ssPort 1027 CLI arguments in order to be able to perform health cheks in the broker using TCP sockets instead of calling to the /version endpoint.

When the broker starts without any work (no entities nor requests), the TCP socket works as expected and even the Kubernetes health checks pass with the configuration of the official Helm chart:

readinessProbe:
  tcpSocket:
    port: 1027
  initialDelaySeconds: 30
  periodSeconds: 10
  successThreshold: 1
  failureThreshold: 3
  timeoutSeconds: 30
livenessProbe:
  tcpSocket:
    port: 1027
  initialDelaySeconds: 30
  periodSeconds: 10
  successThreshold: 1
  failureThreshold: 3
  timeoutSeconds: 30

However, when the broker starts with some work (multiple entities being updated every second), it crashes unexpectedly without any error message in its logs.

Furthermore, the same happens when an external program performs this health checks (it opens a TCP connection to check if the broker is healthy and then closes it). When the broker is idle, everything works as expected (even if a message is sent the broker displays a A message was sent over the socket service - the broker is not ready for that - closing connection message in its logs), but if it has some amount of work, it crashes after a second or third health check (when a second/third TCP connection is opened).

Sometimes, when inspecting the logs/events of the pod in K8s, a command terminated with exit code 137 message is shown, which seems to be an OOM error. Nevertheless, the pod doesn't have any memory limit and I have tested it in different clusters with enough free memory in the nodes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant