Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frontend client <--> Server connection lost (ERR_NETWORK_CHANGED) (/logout issue) #6550

Open
matusdrobuliak66 opened this issue Oct 16, 2024 · 0 comments
Assignees
Labels
bug buggy, it does not work as expected

Comments

@matusdrobuliak66
Copy link
Contributor

matusdrobuliak66 commented Oct 16, 2024

Issue

We can see in e2e tests no response received from server:
image
I was able to reproduce this from my browser (inside of our network!) for different endpoints. I think the /auth/logout issue we see in the e2e often has the same behavior.

Investigation

Example of issue for /catalog/services/-/latest

image
image

  • Initial connection. The browser is establishing a connection, including TCP handshakes or retries and negotiating an SSL. (MD: What’s interesting is that we see the request coming to the webserver, even though it is halted at the initial connection.)

Example of good call

image

Log investigation:

Frontend <-> OPS Traefik <-> Simcore Traefik (we didn't enable access logs) <-> Webserver
image
There is a connection issue between Client and Traefik. Probably frontend closes connection because of some Network issue and Traefik logs 499. Which is never received on the client side as connection is broken or closed. Meanwhile webserver properly response with 200, but nobody is listening as Traefik already responded before with 499.

499

The HTTP 499 status code is not a standard HTTP status code, meaning it is not defined in the HTTP/1.1 specification. This status code is used by the nginx web server to indicate that the client closed the connection before the server could send a response.

Notes

  • This is not connected to any timeout, this can be seen randomly on different endpoints with different times.
    • searching in Graylog for "499" with Traefik access logs enabled shows them
  • The /auth/logout is probably the same issue -> Strange thing here is that this is pretty consistently happening in the same test (sleeper) and always in the same logout call
  • Also should be noted that today I had a lot of network issue when browsing internet (which seems to be similar issue as observed about) -> UPDATE: This might be side-effect as I was probably running simcore in devel mode :
    image

Recomendation

  • Odei will add timeout and retry on /logout (XHR by default doesn't have any timeout)
  • Maybe we should start to test outside of our network -> For example we can use AWS Lambda
  • Add network test (ex. the PING test) to GAIA runners
    • Quick win: Adding GAIA to MONITORING_PROMETHEUS_SMOKEPING_TARGETS for example in inhouse master. This will ping GAIA each minute and we can see it in Grafana ops dashboard
@matusdrobuliak66 matusdrobuliak66 added the bug buggy, it does not work as expected label Oct 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug buggy, it does not work as expected
Projects
None yet
Development

No branches or pull requests

6 participants