Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added SmartSwitch support in chassisd and enabling chassisd #467

Open
wants to merge 171 commits into
base: master
Choose a base branch
from

Conversation

rameshraghupathy
Copy link

@rameshraghupathy rameshraghupathy commented Apr 15, 2024

Added SmartSwitch support in chassisd and enabling chassisd for fixed SmartSwitches

Description

chassisd is enabled only for modular chassis. Smartswitch is a fixed chassis. However it has been treated like a modular chassis to manage the DPU cards just like the line-cards of a modular chassis. chassisd will be enabled only on the smartswitch NPU and hence the scope of these changes are limited only to NPU and not applicable to DPU.

Motivation and Context

chassisd is enabled only for modular chassis. Smartswitch is a fixed chassis. However it has been treated like a modular chassis to manage the DPU cards just like the line-cards of a modular chassis. Hence, chassid is needed for SmartSwitch and enabled here. Also, some of the table updates and clean up are not required for SmartSwitch platform and hence using the is_smartswitch API to selectively enable it. the text "fixes #xxxx", "closes #xxxx" or "resolves #xxxx" here

How Has This Been Tested?

  1. Enabled and built and image and checked if the chassisd is running all the time without crashing
  2. Verify if the config change handler is working as expected by issuing config CLIs to startup and shutdown DPUs

Additional Information (Optional)

@oleksandrivantsiv
Copy link
Collaborator

What are the states supported by the DPUs in the Smart Switch?

@rameshraghupathy
Copy link
Author

What are the states supported by the DPUs in the Smart Switch?

”dpu_midplane_link_state”
”dpu_control_plane_state"
"dpu_data_plane_state"

sonic-chassisd/scripts/chassisd Outdated Show resolved Hide resolved
sonic-chassisd/scripts/chassisd Outdated Show resolved Hide resolved
accidental removal of a few lines from the original chassisd file
@@ -184,7 +188,7 @@ class ModuleUpdater(logger.Logger):
self.chassis = chassis
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that instead of modifying the existing ModuleUpdater class, we should implement SmartSwitchModuleUpdater(ModuleUpdater). SmartSwitchModuleUpdater should be derived from ModuleUpdater and overwrite the methods that should behave differently for the Smart Switch. This approach allows us to keep the original implementation untouched and guarantees full backward compatibility with the chassisd.

Copy link
Author

@rameshraghupathy rameshraghupathy Jul 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will consider this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented this change.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

@oleksandrivantsiv oleksandrivantsiv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as commented


# Initialize DPU_STATE DB table on bootup
dpu_state_key = "DPU_STATE|" + module_name
self.module_updater.update_dpu_state(dpu_state_key, admin_state.upper())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we update the midplane state with the configured admin_state instead of checking the reachability of the midplane here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to query the operation state of the interface connected to the DPU. And if the oper state is up, the we should test the reachability of IP address of the interface on the DPU side

Copy link
Author

@rameshraghupathy rameshraghupathy Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During the initialization of the system we want to make sure the system comes up correctly with the configured state, which is the admin_state. During boot up the operational state of the DPUs could be anything. So,

  1. Get the admin_state and initialize the DPUs to that state
  2. Set the DPU_STATE table as per the admins state
  3. Soon the periodic update of the DPU_STATE will kick in which will check for the operational state of the DPU and update it

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming before we call the set_initial_dpu_admin_state the DPUs are powered off but the CONFIG_DB entry is set to admin_state=up we are setting the dpu_midplane_link_state=up in the CHASSIS_STATE_DB before we even power on the DPUs (which is done in the submit_dpu_callback later in this function), if the DPU power on fails or if we do not achieve midplane reachability we have the wrong data in the CHASSIS_STATE_DB

self.update_dpu_state(key, "UP")

elif midplane_access is False and current_midplane_state == 'False':
if self.is_module_reboot_system_up_expired(module_key):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this function call?

midplane_state = 'False'
if isinstance(fvs, list) and fvs[0] is True:
fvs = dict(fvs[-1])
midplane_state = fvs[CHASSIS_MIDPLANE_INFO_ACCESS_FIELD]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are checking for the midplane_state here in the DB (and if it is not present in the DB we are assuming that the module is not reachable since midplane_state=False by default) but in case of Nvidia platform the midplane_state is being updated to the DB by Chassisd itself, so chassisd here is checking for midplane state in DB before it is supposed to update the midplane state itself. Either we need to re order code so that we update midplane_state to db before we check the DB for values, or instead of checking the DB we can directly invoke the midplane reachability API

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this may cause issues, assuming the dpu is powered on initially, by default if there is no data in DB we assume the midplane_state is false, if no config is present the DPU is supposed to power off, but admin_state == 'down' and midplane_state is false, so the operation is skipped, causing DPU to stay on

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants