Never closed sockets when remote ZYRE node goes offline #4729

stephan57160 · 2024-08-20T09:26:37Z

Issue description

On our ZYRE production server, once in a while, we observe never-closed sockets.
Sometimes, it goes up to 200 sockets to the same remote ZYRE node.

Environment

libzmq version (commit hash if unreleased): 3.4
OS: reproduced on
- Linux CentOS (32 & 64 bits - x86 and ARM),
- Rocky (64 bits) (x86)

Minimal test code / Steps to reproduce the issue

Start ZYRE node A
Start ZYRE node B
On node A, 2 TCP sockets are seen with Node B:

Node A connected to Node B (used to send data to B).
Node B connected to Node A (used to receive data from B).

Node B goes offline (out of WIFI coverage, Ethernet cable unplugged, Windows hybernation, ...)
On node A, after some time, the ZYRE layer detects that node B is no more present and the PEER B is destroyed with the socket to it (node A to B).

What's the actual result? (include assertion message & call stack if applicable)

Socket from node B to node A is never closed, even if

node B application is restarted or
node B is rebooted.

Note:
This is not visible if application on node B is properly stopped (thx to TCP layer for sending TCP RESET).

What's the expected result?

Sockets from remote nodes should be automatically closed when the remote disappear:

Either the ZYRE peer destruction should do,
Use of TCP KEEPALIVE from the ZYRE application,

I failed to have a working implementation in any of those 2 cases.

Possible solution

I digged into LIBZMQ and ZYRE for quite some time.
I tried different approaches, but I always failed to get an access to the ACCEPT()ed socket
in this particular scenario.

Finally, I have a 'draft' possible workaroung, that enables TCP KEEPALIVE right after a particular ACCEPT() in tcp_listener.cpp.
Basically, the idea is like:

  sock = accept(s_);
  ...
  tune_tcp_keepalives(sock, x, y, y);

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Never closed sockets when remote ZYRE node goes offline #4729

Never closed sockets when remote ZYRE node goes offline #4729

stephan57160 commented Aug 20, 2024

Never closed sockets when remote ZYRE node goes offline #4729

Never closed sockets when remote ZYRE node goes offline #4729

Comments

stephan57160 commented Aug 20, 2024

Issue description

Environment

Minimal test code / Steps to reproduce the issue

What's the actual result? (include assertion message & call stack if applicable)

What's the expected result?

Possible solution