long reboot time due to nvme reconnect #1764

todeb · 2024-10-28T09:46:55Z

Describe the bug
after initiating reboot of k8s node with mayastor, OS in 8 mins was trying to reconnect nvme.

To Reproduce
init reboot of OS

Expected behavior
I dont know how these nvmes are handled but imo nvme should not lock system reboot, especially if they failing in connection

Screenshots

Oct 24 14:56:38 node1 systemd[1]: Unmounted var-lib-kubelet-pods-c4376117\x2d0426\x2d4eda\x2d8590\x2d5d9fbc8c8a5d-volumes-kubernetes.io\x7ecsi-pvc\x2d81c7ff78\x2d70b7\x2d4cab\x2db72e\x2d76bfd6888e42-mount.mount - /var/lib/>
Oct 24 14:56:38 node1 systemd[1]: var-lib-kubelet-pods-73932d89\x2d7b3a\x2d4ecd\x2da250\x2d94292beb8bbf-volumes-kubernetes.io\x7ecsi-pvc\x2d793eefb1\x2d829a\x2d4e3f\x2da2db\x2d5108f865e410-mount.mount: Mount process still >
Oct 24 14:56:38 node1 systemd[1]: var-lib-kubelet-pods-73932d89\x2d7b3a\x2d4ecd\x2da250\x2d94292beb8bbf-volumes-kubernetes.io\x7ecsi-pvc\x2d793eefb1\x2d829a\x2d4e3f\x2da2db\x2d5108f865e410-mount.mount: Failed with result '>
Oct 24 14:56:38 node1 systemd[1]: var-lib-kubelet-pods-73932d89\x2d7b3a\x2d4ecd\x2da250\x2d94292beb8bbf-volumes-kubernetes.io\x7ecsi-pvc\x2d793eefb1\x2d829a\x2d4e3f\x2da2db\x2d5108f865e410-mount.mount: Unit process 2456087>
Oct 24 14:56:38 node1 systemd[1]: Unmounted var-lib-kubelet-pods-73932d89\x2d7b3a\x2d4ecd\x2da250\x2d94292beb8bbf-volumes-kubernetes.io\x7ecsi-pvc\x2d793eefb1\x2d829a\x2d4e3f\x2da2db\x2d5108f865e410-mount.mount - /var/lib/>
Oct 24 14:56:38 node1 systemd[1]: Stopped target local-fs-pre.target - Preparation for Local File Systems.
Oct 24 14:56:38 node1 systemd[1]: Reached target umount.target - Unmount All Filesystems.
Oct 24 14:56:38 node1 systemd[1]: Stopping lvm2-monitor.service - Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling...
Oct 24 14:56:38 node1 systemd[1]: Stopping multipathd.service - Device-Mapper Multipath Device Controller...
Oct 24 14:56:38 node1 multipathd[3636859]: multipathd: shut down
Oct 24 14:56:38 node1 systemd[1]: systemd-tmpfiles-setup-dev.service: Deactivated successfully.
Oct 24 14:56:38 node1 systemd[1]: Stopped systemd-tmpfiles-setup-dev.service - Create Static Device Nodes in /dev.
Oct 24 14:56:38 node1 systemd[1]: systemd-tmpfiles-setup-dev-early.service: Deactivated successfully.
Oct 24 14:56:38 node1 systemd[1]: Stopped systemd-tmpfiles-setup-dev-early.service - Create Static Device Nodes in /dev gracefully.
Oct 24 14:56:38 node1 kernel: block nvme0n1: no usable path - requeuing I/O
Oct 24 14:56:38 node1 kernel: block nvme0n1: no usable path - requeuing I/O
Oct 24 14:56:38 node1 kernel: block nvme1n1: no usable path - requeuing I/O
Oct 24 14:56:38 node1 kernel: block nvme1n1: no usable path - requeuing I/O
Oct 24 14:56:38 node1 kernel: block nvme2n1: no usable path - requeuing I/O
Oct 24 14:56:38 node1 kernel: block nvme2n1: no usable path - requeuing I/O
Oct 24 14:56:38 node1 kernel: block nvme3n1: no usable path - requeuing I/O
Oct 24 14:56:38 node1 kernel: block nvme3n1: no usable path - requeuing I/O
Oct 24 14:56:38 node1 kernel: block nvme5n1: no usable path - requeuing I/O
Oct 24 14:56:38 node1 kernel: block nvme5n1: no usable path - requeuing I/O
Oct 24 14:56:38 node1 systemd[1]: multipathd.service: Deactivated successfully.
Oct 24 14:56:38 node1 systemd[1]: Stopped multipathd.service - Device-Mapper Multipath Device Controller.
Oct 24 14:56:38 node1 systemd[1]: multipathd.service: Consumed 1min 20.798s CPU time, 19.0M memory peak, 0B memory swap peak.
Oct 24 14:56:38 node1 systemd[1]: systemd-remount-fs.service: Deactivated successfully.
Oct 24 14:56:38 node1 systemd[1]: Stopped systemd-remount-fs.service - Remount Root and Kernel File Systems.
Oct 24 14:56:40 node1 kernel: tasks_rcu_exit_srcu_stall: rcu_tasks grace period number 1785 (since boot) gp_state: RTGS_POST_SCAN_TASKLIST is 460794 jiffies old.
Oct 24 14:56:40 node1 kernel: Please check any exiting tasks stuck between calls to exit_tasks_rcu_start() and exit_tasks_rcu_finish()
Oct 24 14:56:40 node1 kernel: nvme nvme7: failed to connect socket: -111
Oct 24 14:56:40 node1 kernel: nvme nvme0: failed to connect socket: -111
Oct 24 14:56:40 node1 kernel: nvme nvme8: failed to connect socket: -111
Oct 24 14:56:40 node1 kernel: nvme nvme8: Failed reconnect attempt 46
Oct 24 14:56:40 node1 kernel: nvme nvme8: Reconnecting in 10 seconds...
Oct 24 14:56:40 node1 kernel: nvme nvme1: failed to connect socket: -111
Oct 24 14:56:40 node1 kernel: nvme nvme1: Failed reconnect attempt 46
Oct 24 14:56:40 node1 kernel: nvme nvme1: Reconnecting in 10 seconds...
Oct 24 14:56:40 node1 kernel: nvme nvme5: failed to connect socket: -111
Oct 24 14:56:40 node1 kernel: nvme nvme5: Failed reconnect attempt 46
Oct 24 14:56:40 node1 kernel: nvme nvme5: Reconnecting in 10 seconds...
Oct 24 14:56:40 node1 kernel: nvme nvme3: failed to connect socket: -111
Oct 24 14:56:40 node1 kernel: nvme nvme3: Failed reconnect attempt 46
Oct 24 14:56:40 node1 kernel: nvme nvme3: Reconnecting in 10 seconds...
Oct 24 14:56:40 node1 kernel: nvme nvme6: failed to connect socket: -111
Oct 24 14:56:40 node1 kernel: nvme nvme6: Failed reconnect attempt 46
Oct 24 14:56:40 node1 kernel: nvme nvme6: Reconnecting in 10 seconds...
Oct 24 14:56:40 node1 kernel: nvme nvme2: failed to connect socket: -111
Oct 24 14:56:40 node1 kernel: nvme nvme2: Failed reconnect attempt 46
Oct 24 14:56:40 node1 kernel: nvme nvme2: Reconnecting in 10 seconds...
Oct 24 14:56:40 node1 kernel: nvme nvme7: Failed reconnect attempt 46
Oct 24 14:56:40 node1 kernel: nvme nvme0: Failed reconnect attempt 46
Oct 24 14:56:40 node1 kernel: nvme nvme7: Reconnecting in 10 seconds...
Oct 24 14:56:40 node1 kernel: nvme nvme0: Reconnecting in 10 seconds...
Oct 24 14:56:50 node1 kernel: tasks_rcu_exit_srcu_stall: rcu_tasks grace period number 1785 (since boot) gp_state: RTGS_POST_SCAN_TASKLIST is 471034 jiffies old.
Oct 24 14:56:50 node1 kernel: Please check any exiting tasks stuck between calls to exit_tasks_rcu_start() and exit_tasks_rcu_finish()
Oct 24 14:56:50 node1 kernel: nvme nvme6: failed to connect socket: -111
Oct 24 14:56:50 node1 kernel: nvme nvme5: failed to connect socket: -111
Oct 24 14:56:50 node1 kernel: nvme nvme1: failed to connect socket: -111
Oct 24 14:56:50 node1 kernel: nvme nvme1: Failed reconnect attempt 47
Oct 24 14:56:50 node1 kernel: nvme nvme1: Reconnecting in 10 seconds...
Oct 24 14:56:50 node1 kernel: nvme nvme0: failed to connect socket: -111
Oct 24 14:56:50 node1 kernel: nvme nvme0: Failed reconnect attempt 47
Oct 24 14:56:50 node1 kernel: nvme nvme0: Reconnecting in 10 seconds...
Oct 24 14:56:50 node1 kernel: nvme nvme2: failed to connect socket: -111
Oct 24 14:56:50 node1 kernel: nvme nvme2: Failed reconnect attempt 47
Oct 24 14:56:50 node1 kernel: nvme nvme2: Reconnecting in 10 seconds...
Oct 24 14:56:50 node1 kernel: nvme nvme7: failed to connect socket: -111
.....
Oct 24 15:04:41 node1 kernel: nvme nvme0: failed to connect socket: -111
Oct 24 15:04:41 node1 kernel: nvme nvme0: Failed reconnect attempt 93
Oct 24 15:04:41 node1 kernel: nvme nvme0: Reconnecting in 10 seconds...
Oct 24 15:04:41 node1 kernel: nvme nvme6: Failed reconnect attempt 93
Oct 24 15:04:41 node1 kernel: nvme nvme5: Reconnecting in 10 seconds...
Oct 24 15:04:41 node1 kernel: nvme nvme6: Reconnecting in 10 seconds...
Oct 24 15:04:49 node1 systemd-shutdown[1]: Syncing filesystems and block devices - timed out, issuing SIGKILL to PID 2456488.
Oct 24 15:04:49 node1 systemd-shutdown[1]: Sending SIGTERM to remaining processes...
Oct 24 15:04:49 node1 systemd-journald[734]: Journal stopped

** OS info (please complete the following information):**

Distro: Ubuntu 24.04.1 LTS
Kernel version: 6.8.0-47-generic
MayaStor revision or container image: openebs.io/version=2.7.0

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

tiagolobocastro · 2024-10-28T10:22:48Z

Did you reboot without draining the node? The nvme kernel initiator will keep trying to connect for some time.
If you really want to do this, then you need to do something like this on the host:

for dev in /sys/class/nvme/*/ctrl_loss_tmo; do echo 10 | sudo tee -a $dev; done

But I suggest you drain the node first before rebooting otherwise the filesystems may not unmount gracefully and potentially result in data loss.

todeb · 2024-10-28T12:39:19Z

do you mean kubectl-mayastor drain command or usual kubectl drain to evict pods?
Im not draining node because im using single replica and have apps replicated by their own, so don't want that they will be scheduled on other nodes.

tiagolobocastro · 2024-10-28T13:37:26Z

do you mean kubectl-mayastor drain command or usual kubectl drain to evict pods?

I mean the usual kubectl drain to evict the pods using the mayastor volumes to other nodes.

Im not draining node because im using single replica and have apps replicated by their own, so don't want that they will be scheduled on other nodes.

Ah I see... are you using the GracefulNodeShutdown feature?
IIRC that should ensure the apps are stopped gracefully, but IIRC the volumeattachment and connections would still remain? CC @Abhinandan-Purkait

Abhinandan-Purkait · 2024-10-29T07:51:40Z

Ah I see... are you using the GracefulNodeShutdown feature? IIRC that should ensure the apps are stopped gracefully, but IIRC the volumeattachment and connections would still remain? CC @Abhinandan-Purkait

Yes, that's true. IIRC we need to remove the attachments manually.

todeb · 2024-10-29T08:21:55Z

not using it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

long reboot time due to nvme reconnect #1764

long reboot time due to nvme reconnect #1764

todeb commented Oct 28, 2024

tiagolobocastro commented Oct 28, 2024

todeb commented Oct 28, 2024 •

edited

Loading

tiagolobocastro commented Oct 28, 2024

Abhinandan-Purkait commented Oct 29, 2024 •

edited

Loading

todeb commented Oct 29, 2024

long reboot time due to nvme reconnect #1764

long reboot time due to nvme reconnect #1764

Comments

todeb commented Oct 28, 2024

tiagolobocastro commented Oct 28, 2024

todeb commented Oct 28, 2024 • edited Loading

tiagolobocastro commented Oct 28, 2024

Abhinandan-Purkait commented Oct 29, 2024 • edited Loading

todeb commented Oct 29, 2024

todeb commented Oct 28, 2024 •

edited

Loading

Abhinandan-Purkait commented Oct 29, 2024 •

edited

Loading