Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removed node cannot rejoin properly #140

Open
mengrj opened this issue Apr 17, 2023 · 1 comment
Open

Removed node cannot rejoin properly #140

mengrj opened this issue Apr 17, 2023 · 1 comment

Comments

@mengrj
Copy link

mengrj commented Apr 17, 2023

While using Jepsen to test raft-0.17.1 and dqlite-0.14.0, we frequently observe the following issue. The removed node cannot rejoin properly.

2023-04-17 07:15:12 Jepsen starting LIBDQLITE_TRACE=1 LIBRAFT_TRACE=1 /opt/fs/dqlite/app -dir /opt/fs/dqlite/data -node n1 -latency 10 -cluster n1,n2,n3,n4,n5
2023/04/17 07:15:12.708606 starting "n1" with IP "10.1.1.3" and cluster "n1,n2,n3,n4,n5"
2023/04/17 07:15:12.708674 node was removed
2023-04-17 07:15:45 Jepsen starting LIBDQLITE_TRACE=1 LIBRAFT_TRACE=1 /opt/fs/dqlite/app -dir /opt/fs/dqlite/data -node n1 -latency 10 -cluster n1,n2,n3,n4,n5
2023/04/17 07:15:45.284433 starting "n1" with IP "10.1.1.3" and cluster "n1,n2,n3,n4,n5"
2023/04/17 07:15:45.284495 node was removed
2023-04-17 07:15:47 Jepsen starting LIBDQLITE_TRACE=1 LIBRAFT_TRACE=1 /opt/fs/dqlite/app -dir /opt/fs/dqlite/data -node n1 -latency 10 -cluster n1,n2,n3,n4,n5
2023/04/17 07:15:47.698687 starting "n1" with IP "10.1.1.3" and cluster "n1,n2,n3,n4,n5"
2023/04/17 07:15:47.698761 node was removed
2023-04-17 07:16:17 Jepsen starting LIBDQLITE_TRACE=1 LIBRAFT_TRACE=1 /opt/fs/dqlite/app -dir /opt/fs/dqlite/data -node n1 -latency 10 -cluster n1,n2,n3,n4,n5
2023/04/17 07:16:17.750930 starting "n1" with IP "10.1.1.3" and cluster "n1,n2,n3,n4,n5"
2023/04/17 07:16:17.750983 node was removed
2023-04-17 07:16:35 Jepsen starting LIBDQLITE_TRACE=1 LIBRAFT_TRACE=1 /opt/fs/dqlite/app -dir /opt/fs/dqlite/data -node n1 -latency 10 -cluster n1,n2,n3,n4,n5
2023/04/17 07:16:35.314053 starting "n1" with IP "10.1.1.3" and cluster "n1,n2,n3,n4,n5"
2023/04/17 07:16:35.314106 node was removed
2023-04-17 07:16:37 Jepsen starting LIBDQLITE_TRACE=1 LIBRAFT_TRACE=1 /opt/fs/dqlite/app -dir /opt/fs/dqlite/data -node n1 -latency 10 -cluster n1,n2,n3,n4,n5
2023/04/17 07:16:37.821749 starting "n1" with IP "10.1.1.3" and cluster "n1,n2,n3,n4,n5"
2023/04/17 07:16:37.821814 node was removed
2023-04-17 07:16:50 Jepsen starting LIBDQLITE_TRACE=1 LIBRAFT_TRACE=1 /opt/fs/dqlite/app -dir /opt/fs/dqlite/data -node n1 -latency 10 -cluster n1,n2,n3,n4,n5
2023/04/17 07:16:50.335459 starting "n1" with IP "10.1.1.3" and cluster "n1,n2,n3,n4,n5"
@mengrj
Copy link
Author

mengrj commented Apr 17, 2023

I use your Jepsen harness commit-33749ac4. Due to this issue, the whole system becomes dead quickly. Based on my observation, the membership nemesis doesn't work properly. Sometimes the grow/shrink nemesis works, and other times cannot be executed and just crashes, leading to most of the nodes being removed and not able to rejoin.

@mengrj mengrj changed the title Killed node cannot restart properly Removed node cannot rejoin properly Apr 17, 2023
@cole-miller cole-miller transferred this issue from canonical/raft Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant