Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

long reboot time due to nvme reconnect #1764

Open
todeb opened this issue Oct 28, 2024 · 5 comments
Open

long reboot time due to nvme reconnect #1764

todeb opened this issue Oct 28, 2024 · 5 comments

Comments

@todeb
Copy link

todeb commented Oct 28, 2024

Describe the bug
after initiating reboot of k8s node with mayastor, OS in 8 mins was trying to reconnect nvme.

To Reproduce
init reboot of OS

Expected behavior
I dont know how these nvmes are handled but imo nvme should not lock system reboot, especially if they failing in connection

Screenshots

Oct 24 14:56:38 node1 systemd[1]: Unmounted var-lib-kubelet-pods-c4376117\x2d0426\x2d4eda\x2d8590\x2d5d9fbc8c8a5d-volumes-kubernetes.io\x7ecsi-pvc\x2d81c7ff78\x2d70b7\x2d4cab\x2db72e\x2d76bfd6888e42-mount.mount - /var/lib/>
Oct 24 14:56:38 node1 systemd[1]: var-lib-kubelet-pods-73932d89\x2d7b3a\x2d4ecd\x2da250\x2d94292beb8bbf-volumes-kubernetes.io\x7ecsi-pvc\x2d793eefb1\x2d829a\x2d4e3f\x2da2db\x2d5108f865e410-mount.mount: Mount process still >
Oct 24 14:56:38 node1 systemd[1]: var-lib-kubelet-pods-73932d89\x2d7b3a\x2d4ecd\x2da250\x2d94292beb8bbf-volumes-kubernetes.io\x7ecsi-pvc\x2d793eefb1\x2d829a\x2d4e3f\x2da2db\x2d5108f865e410-mount.mount: Failed with result '>
Oct 24 14:56:38 node1 systemd[1]: var-lib-kubelet-pods-73932d89\x2d7b3a\x2d4ecd\x2da250\x2d94292beb8bbf-volumes-kubernetes.io\x7ecsi-pvc\x2d793eefb1\x2d829a\x2d4e3f\x2da2db\x2d5108f865e410-mount.mount: Unit process 2456087>
Oct 24 14:56:38 node1 systemd[1]: Unmounted var-lib-kubelet-pods-73932d89\x2d7b3a\x2d4ecd\x2da250\x2d94292beb8bbf-volumes-kubernetes.io\x7ecsi-pvc\x2d793eefb1\x2d829a\x2d4e3f\x2da2db\x2d5108f865e410-mount.mount - /var/lib/>
Oct 24 14:56:38 node1 systemd[1]: Stopped target local-fs-pre.target - Preparation for Local File Systems.
Oct 24 14:56:38 node1 systemd[1]: Reached target umount.target - Unmount All Filesystems.
Oct 24 14:56:38 node1 systemd[1]: Stopping lvm2-monitor.service - Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling...
Oct 24 14:56:38 node1 systemd[1]: Stopping multipathd.service - Device-Mapper Multipath Device Controller...
Oct 24 14:56:38 node1 multipathd[3636859]: multipathd: shut down
Oct 24 14:56:38 node1 systemd[1]: systemd-tmpfiles-setup-dev.service: Deactivated successfully.
Oct 24 14:56:38 node1 systemd[1]: Stopped systemd-tmpfiles-setup-dev.service - Create Static Device Nodes in /dev.
Oct 24 14:56:38 node1 systemd[1]: systemd-tmpfiles-setup-dev-early.service: Deactivated successfully.
Oct 24 14:56:38 node1 systemd[1]: Stopped systemd-tmpfiles-setup-dev-early.service - Create Static Device Nodes in /dev gracefully.
Oct 24 14:56:38 node1 kernel: block nvme0n1: no usable path - requeuing I/O
Oct 24 14:56:38 node1 kernel: block nvme0n1: no usable path - requeuing I/O
Oct 24 14:56:38 node1 kernel: block nvme1n1: no usable path - requeuing I/O
Oct 24 14:56:38 node1 kernel: block nvme1n1: no usable path - requeuing I/O
Oct 24 14:56:38 node1 kernel: block nvme2n1: no usable path - requeuing I/O
Oct 24 14:56:38 node1 kernel: block nvme2n1: no usable path - requeuing I/O
Oct 24 14:56:38 node1 kernel: block nvme3n1: no usable path - requeuing I/O
Oct 24 14:56:38 node1 kernel: block nvme3n1: no usable path - requeuing I/O
Oct 24 14:56:38 node1 kernel: block nvme5n1: no usable path - requeuing I/O
Oct 24 14:56:38 node1 kernel: block nvme5n1: no usable path - requeuing I/O
Oct 24 14:56:38 node1 systemd[1]: multipathd.service: Deactivated successfully.
Oct 24 14:56:38 node1 systemd[1]: Stopped multipathd.service - Device-Mapper Multipath Device Controller.
Oct 24 14:56:38 node1 systemd[1]: multipathd.service: Consumed 1min 20.798s CPU time, 19.0M memory peak, 0B memory swap peak.
Oct 24 14:56:38 node1 systemd[1]: systemd-remount-fs.service: Deactivated successfully.
Oct 24 14:56:38 node1 systemd[1]: Stopped systemd-remount-fs.service - Remount Root and Kernel File Systems.
Oct 24 14:56:40 node1 kernel: tasks_rcu_exit_srcu_stall: rcu_tasks grace period number 1785 (since boot) gp_state: RTGS_POST_SCAN_TASKLIST is 460794 jiffies old.
Oct 24 14:56:40 node1 kernel: Please check any exiting tasks stuck between calls to exit_tasks_rcu_start() and exit_tasks_rcu_finish()
Oct 24 14:56:40 node1 kernel: nvme nvme7: failed to connect socket: -111
Oct 24 14:56:40 node1 kernel: nvme nvme0: failed to connect socket: -111
Oct 24 14:56:40 node1 kernel: nvme nvme8: failed to connect socket: -111
Oct 24 14:56:40 node1 kernel: nvme nvme8: Failed reconnect attempt 46
Oct 24 14:56:40 node1 kernel: nvme nvme8: Reconnecting in 10 seconds...
Oct 24 14:56:40 node1 kernel: nvme nvme1: failed to connect socket: -111
Oct 24 14:56:40 node1 kernel: nvme nvme1: Failed reconnect attempt 46
Oct 24 14:56:40 node1 kernel: nvme nvme1: Reconnecting in 10 seconds...
Oct 24 14:56:40 node1 kernel: nvme nvme5: failed to connect socket: -111
Oct 24 14:56:40 node1 kernel: nvme nvme5: Failed reconnect attempt 46
Oct 24 14:56:40 node1 kernel: nvme nvme5: Reconnecting in 10 seconds...
Oct 24 14:56:40 node1 kernel: nvme nvme3: failed to connect socket: -111
Oct 24 14:56:40 node1 kernel: nvme nvme3: Failed reconnect attempt 46
Oct 24 14:56:40 node1 kernel: nvme nvme3: Reconnecting in 10 seconds...
Oct 24 14:56:40 node1 kernel: nvme nvme6: failed to connect socket: -111
Oct 24 14:56:40 node1 kernel: nvme nvme6: Failed reconnect attempt 46
Oct 24 14:56:40 node1 kernel: nvme nvme6: Reconnecting in 10 seconds...
Oct 24 14:56:40 node1 kernel: nvme nvme2: failed to connect socket: -111
Oct 24 14:56:40 node1 kernel: nvme nvme2: Failed reconnect attempt 46
Oct 24 14:56:40 node1 kernel: nvme nvme2: Reconnecting in 10 seconds...
Oct 24 14:56:40 node1 kernel: nvme nvme7: Failed reconnect attempt 46
Oct 24 14:56:40 node1 kernel: nvme nvme0: Failed reconnect attempt 46
Oct 24 14:56:40 node1 kernel: nvme nvme7: Reconnecting in 10 seconds...
Oct 24 14:56:40 node1 kernel: nvme nvme0: Reconnecting in 10 seconds...
Oct 24 14:56:50 node1 kernel: tasks_rcu_exit_srcu_stall: rcu_tasks grace period number 1785 (since boot) gp_state: RTGS_POST_SCAN_TASKLIST is 471034 jiffies old.
Oct 24 14:56:50 node1 kernel: Please check any exiting tasks stuck between calls to exit_tasks_rcu_start() and exit_tasks_rcu_finish()
Oct 24 14:56:50 node1 kernel: nvme nvme6: failed to connect socket: -111
Oct 24 14:56:50 node1 kernel: nvme nvme5: failed to connect socket: -111
Oct 24 14:56:50 node1 kernel: nvme nvme1: failed to connect socket: -111
Oct 24 14:56:50 node1 kernel: nvme nvme1: Failed reconnect attempt 47
Oct 24 14:56:50 node1 kernel: nvme nvme1: Reconnecting in 10 seconds...
Oct 24 14:56:50 node1 kernel: nvme nvme0: failed to connect socket: -111
Oct 24 14:56:50 node1 kernel: nvme nvme0: Failed reconnect attempt 47
Oct 24 14:56:50 node1 kernel: nvme nvme0: Reconnecting in 10 seconds...
Oct 24 14:56:50 node1 kernel: nvme nvme2: failed to connect socket: -111
Oct 24 14:56:50 node1 kernel: nvme nvme2: Failed reconnect attempt 47
Oct 24 14:56:50 node1 kernel: nvme nvme2: Reconnecting in 10 seconds...
Oct 24 14:56:50 node1 kernel: nvme nvme7: failed to connect socket: -111
.....
Oct 24 15:04:41 node1 kernel: nvme nvme0: failed to connect socket: -111
Oct 24 15:04:41 node1 kernel: nvme nvme0: Failed reconnect attempt 93
Oct 24 15:04:41 node1 kernel: nvme nvme0: Reconnecting in 10 seconds...
Oct 24 15:04:41 node1 kernel: nvme nvme6: Failed reconnect attempt 93
Oct 24 15:04:41 node1 kernel: nvme nvme5: Reconnecting in 10 seconds...
Oct 24 15:04:41 node1 kernel: nvme nvme6: Reconnecting in 10 seconds...
Oct 24 15:04:49 node1 systemd-shutdown[1]: Syncing filesystems and block devices - timed out, issuing SIGKILL to PID 2456488.
Oct 24 15:04:49 node1 systemd-shutdown[1]: Sending SIGTERM to remaining processes...
Oct 24 15:04:49 node1 systemd-journald[734]: Journal stopped

** OS info (please complete the following information):**

  • Distro: Ubuntu 24.04.1 LTS
  • Kernel version: 6.8.0-47-generic
  • MayaStor revision or container image: openebs.io/version=2.7.0

Additional context
Add any other context about the problem here.

@tiagolobocastro
Copy link
Contributor

Did you reboot without draining the node? The nvme kernel initiator will keep trying to connect for some time.
If you really want to do this, then you need to do something like this on the host:

for dev in /sys/class/nvme/*/ctrl_loss_tmo; do echo 10 | sudo tee -a $dev; done

But I suggest you drain the node first before rebooting otherwise the filesystems may not unmount gracefully and potentially result in data loss.

@todeb
Copy link
Author

todeb commented Oct 28, 2024

do you mean kubectl-mayastor drain command or usual kubectl drain to evict pods?
Im not draining node because im using single replica and have apps replicated by their own, so don't want that they will be scheduled on other nodes.

@tiagolobocastro
Copy link
Contributor

do you mean kubectl-mayastor drain command or usual kubectl drain to evict pods?

I mean the usual kubectl drain to evict the pods using the mayastor volumes to other nodes.

Im not draining node because im using single replica and have apps replicated by their own, so don't want that they will be scheduled on other nodes.

Ah I see... are you using the GracefulNodeShutdown feature?
IIRC that should ensure the apps are stopped gracefully, but IIRC the volumeattachment and connections would still remain? CC @Abhinandan-Purkait

@Abhinandan-Purkait
Copy link
Member

Abhinandan-Purkait commented Oct 29, 2024

Ah I see... are you using the GracefulNodeShutdown feature? IIRC that should ensure the apps are stopped gracefully, but IIRC the volumeattachment and connections would still remain? CC @Abhinandan-Purkait

Yes, that's true. IIRC we need to remove the attachments manually.

@todeb
Copy link
Author

todeb commented Oct 29, 2024

not using it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants