You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We can see in e2e tests no response received from server:
I was able to reproduce this from my browser (inside of our network!) for different endpoints. I think the /auth/logout issue we see in the e2e often has the same behavior.
Investigation
Example of issue for /catalog/services/-/latest
Initial connection. The browser is establishing a connection, including TCP handshakes or retries and negotiating an SSL. (MD: What’s interesting is that we see the request coming to the webserver, even though it is halted at the initial connection.)
Example of good call
Log investigation:
Frontend <-> OPS Traefik <-> Simcore Traefik (we didn't enable access logs) <-> Webserver
There is a connection issue between Client and Traefik. Probably frontend closes connection because of some Network issue and Traefik logs 499. Which is never received on the client side as connection is broken or closed. Meanwhile webserver properly response with 200, but nobody is listening as Traefik already responded before with 499.
499
The HTTP 499 status code is not a standard HTTP status code, meaning it is not defined in the HTTP/1.1 specification. This status code is used by the nginx web server to indicate that the client closed the connection before the server could send a response.
Notes
This is not connected to any timeout, this can be seen randomly on different endpoints with different times.
searching in Graylog for "499" with Traefik access logs enabled shows them
The /auth/logout is probably the same issue -> Strange thing here is that this is pretty consistently happening in the same test (sleeper) and always in the same logout call
Also should be noted that today I had a lot of network issue when browsing internet (which seems to be similar issue as observed about) -> UPDATE: This might be side-effect as I was probably running simcore in devel mode :
Recomendation
Odei will add timeout and retry on /logout (XHR by default doesn't have any timeout)
Maybe we should start to test outside of our network -> For example we can use AWS Lambda
Add network test (ex. the PING test) to GAIA runners
Quick win: Adding GAIA to MONITORING_PROMETHEUS_SMOKEPING_TARGETS for example in inhouse master. This will ping GAIA each minute and we can see it in Grafana ops dashboard
The text was updated successfully, but these errors were encountered:
Issue
We can see in e2e tests no response received from server:
I was able to reproduce this from my browser (inside of our network!) for different endpoints. I think the
/auth/logout
issue we see in the e2e often has the same behavior.Investigation
Example of issue for /catalog/services/-/latest
Example of good call
Log investigation:
Frontend <-> OPS Traefik <-> Simcore Traefik (we didn't enable access logs) <-> Webserver
There is a connection issue between Client and Traefik. Probably frontend closes connection because of some Network issue and Traefik logs 499. Which is never received on the client side as connection is broken or closed. Meanwhile webserver properly response with 200, but nobody is listening as Traefik already responded before with 499.
499
The HTTP 499 status code is not a standard HTTP status code, meaning it is not defined in the HTTP/1.1 specification. This status code is used by the nginx web server to indicate that the client closed the connection before the server could send a response.
Notes
/auth/logout
is probably the same issue -> Strange thing here is that this is pretty consistently happening in the same test (sleeper) and always in the same logout callRecomendation
/logout
(XHR by default doesn't have any timeout)MONITORING_PROMETHEUS_SMOKEPING_TARGETS
for example in inhouse master. This will ping GAIA each minute and we can see it in Grafana ops dashboardThe text was updated successfully, but these errors were encountered: