Removed node cannot rejoin properly #140

mengrj · 2023-04-17T07:22:30Z

While using Jepsen to test raft-0.17.1 and dqlite-0.14.0, we frequently observe the following issue. The removed node cannot rejoin properly.

2023-04-17 07:15:12 Jepsen starting LIBDQLITE_TRACE=1 LIBRAFT_TRACE=1 /opt/fs/dqlite/app -dir /opt/fs/dqlite/data -node n1 -latency 10 -cluster n1,n2,n3,n4,n5
2023/04/17 07:15:12.708606 starting "n1" with IP "10.1.1.3" and cluster "n1,n2,n3,n4,n5"
2023/04/17 07:15:12.708674 node was removed
2023-04-17 07:15:45 Jepsen starting LIBDQLITE_TRACE=1 LIBRAFT_TRACE=1 /opt/fs/dqlite/app -dir /opt/fs/dqlite/data -node n1 -latency 10 -cluster n1,n2,n3,n4,n5
2023/04/17 07:15:45.284433 starting "n1" with IP "10.1.1.3" and cluster "n1,n2,n3,n4,n5"
2023/04/17 07:15:45.284495 node was removed
2023-04-17 07:15:47 Jepsen starting LIBDQLITE_TRACE=1 LIBRAFT_TRACE=1 /opt/fs/dqlite/app -dir /opt/fs/dqlite/data -node n1 -latency 10 -cluster n1,n2,n3,n4,n5
2023/04/17 07:15:47.698687 starting "n1" with IP "10.1.1.3" and cluster "n1,n2,n3,n4,n5"
2023/04/17 07:15:47.698761 node was removed
2023-04-17 07:16:17 Jepsen starting LIBDQLITE_TRACE=1 LIBRAFT_TRACE=1 /opt/fs/dqlite/app -dir /opt/fs/dqlite/data -node n1 -latency 10 -cluster n1,n2,n3,n4,n5
2023/04/17 07:16:17.750930 starting "n1" with IP "10.1.1.3" and cluster "n1,n2,n3,n4,n5"
2023/04/17 07:16:17.750983 node was removed
2023-04-17 07:16:35 Jepsen starting LIBDQLITE_TRACE=1 LIBRAFT_TRACE=1 /opt/fs/dqlite/app -dir /opt/fs/dqlite/data -node n1 -latency 10 -cluster n1,n2,n3,n4,n5
2023/04/17 07:16:35.314053 starting "n1" with IP "10.1.1.3" and cluster "n1,n2,n3,n4,n5"
2023/04/17 07:16:35.314106 node was removed
2023-04-17 07:16:37 Jepsen starting LIBDQLITE_TRACE=1 LIBRAFT_TRACE=1 /opt/fs/dqlite/app -dir /opt/fs/dqlite/data -node n1 -latency 10 -cluster n1,n2,n3,n4,n5
2023/04/17 07:16:37.821749 starting "n1" with IP "10.1.1.3" and cluster "n1,n2,n3,n4,n5"
2023/04/17 07:16:37.821814 node was removed
2023-04-17 07:16:50 Jepsen starting LIBDQLITE_TRACE=1 LIBRAFT_TRACE=1 /opt/fs/dqlite/app -dir /opt/fs/dqlite/data -node n1 -latency 10 -cluster n1,n2,n3,n4,n5
2023/04/17 07:16:50.335459 starting "n1" with IP "10.1.1.3" and cluster "n1,n2,n3,n4,n5"

The text was updated successfully, but these errors were encountered:

mengrj · 2023-04-17T11:05:47Z

I use your Jepsen harness commit-33749ac4. Due to this issue, the whole system becomes dead quickly. Based on my observation, the membership nemesis doesn't work properly. Sometimes the grow/shrink nemesis works, and other times cannot be executed and just crashes, leading to most of the nodes being removed and not able to rejoin.

mengrj changed the title ~~Killed node cannot restart properly~~ Removed node cannot rejoin properly Apr 17, 2023

cole-miller transferred this issue from canonical/raft Feb 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removed node cannot rejoin properly #140

Removed node cannot rejoin properly #140

mengrj commented Apr 17, 2023 •

edited

Loading

mengrj commented Apr 17, 2023 •

edited

Loading

Removed node cannot rejoin properly #140

Removed node cannot rejoin properly #140

Comments

mengrj commented Apr 17, 2023 • edited Loading

mengrj commented Apr 17, 2023 • edited Loading

mengrj commented Apr 17, 2023 •

edited

Loading

mengrj commented Apr 17, 2023 •

edited

Loading