Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hardware-exporter service wasn't installed by the charm #281

Open
przemeklal opened this issue Jul 30, 2024 · 5 comments
Open

hardware-exporter service wasn't installed by the charm #281

przemeklal opened this issue Jul 30, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@przemeklal
Copy link
Member

The hardware-exporter service was not installed and config-changed hooks failed.

Versions: latest/edge rev 86
First, I tried with latest/stable rev 84 and ran into the same problem.

Hardware: Lenovo ThinkSystem SR665

Steps:

  1. juju deploy ... and relate to infra-node (principal charm) and grafana-agent
  2. all units blocked, asking to attach-resource storcli-deb
  3. I attached the required resource
  4. config-changed hooks crashing, output below
unit-hardware-observer-7: 15:03:28 INFO unit.hardware-observer/7.juju-log Writing file to /etc/hardware-exporter-config.yaml.
unit-hardware-observer-7: 15:03:28 INFO unit.hardware-observer/7.juju-log Writing file to /etc/hardware-exporter-config.yaml - Done.
unit-hardware-observer-7: 15:03:28 INFO unit.hardware-observer/7.juju-log Restarting exporter - hardware-exporter
unit-hardware-observer-7: 15:03:28 WARNING unit.hardware-observer/7.juju-log Restarting exporter - 1 retry
unit-hardware-observer-7: 15:03:28 ERROR unit.hardware-observer/7.juju-log Exporter hardware-exporter crashed unexpectedly: Command ['systemctl', 'restart', 'hardware-exporter'] failed with returncode 5. systemctl output:
Failed to restart hardware-exporter.service: Unit hardware-exporter.service not found.

unit-hardware-observer-7: 15:03:28 ERROR unit.hardware-observer/7.juju-log Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-hardware-observer-7/charm/lib/charms/operator_libs_linux/v1/systemd.py", line 90, in _systemctl
    proc = subprocess.run(
  File "/usr/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['systemctl', 'restart', 'hardware-exporter']' returned non-zero exit status 5.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-hardware-observer-7/charm/src/service.py", line 215, in restart
    self._restart()
  File "/var/lib/juju/agents/unit-hardware-observer-7/charm/src/service.py", line 113, in _restart
    systemd.service_restart(self.exporter_name)
  File "/var/lib/juju/agents/unit-hardware-observer-7/charm/lib/charms/operator_libs_linux/v1/systemd.py", line 177, in service_restart
    return _systemctl("restart", *args, check=True) == 0
  File "/var/lib/juju/agents/unit-hardware-observer-7/charm/lib/charms/operator_libs_linux/v1/systemd.py", line 104, in _systemctl
    raise SystemdError(
charms.operator_libs_linux.v1.systemd.SystemdError: Command ['systemctl', 'restart', 'hardware-exporter'] failed with returncode 5. systemctl output:
Failed to restart hardware-exporter.service: Unit hardware-exporter.service not found.


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-hardware-observer-7/charm/./src/charm.py", line 302, in <module>
    ops.main(HardwareObserverCharm)  # type: ignore
  File "/var/lib/juju/agents/unit-hardware-observer-7/charm/venv/ops/main.py", line 563, in __call__
    return main(charm_class, use_juju_for_storage=use_juju_for_storage)
  File "/var/lib/juju/agents/unit-hardware-observer-7/charm/venv/ops/main.py", line 551, in main
    manager.run()
  File "/var/lib/juju/agents/unit-hardware-observer-7/charm/venv/ops/main.py", line 530, in run
    self._emit()
  File "/var/lib/juju/agents/unit-hardware-observer-7/charm/venv/ops/main.py", line 519, in _emit
    _emit_charm_event(self.charm, self.dispatcher.event_name)
  File "/var/lib/juju/agents/unit-hardware-observer-7/charm/venv/ops/main.py", line 147, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-hardware-observer-7/charm/venv/ops/framework.py", line 348, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-hardware-observer-7/charm/venv/ops/framework.py", line 860, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-hardware-observer-7/charm/venv/ops/framework.py", line 950, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-hardware-observer-7/charm/./src/charm.py", line 243, in _on_config_changed
    exporter.restart()
  File "/var/lib/juju/agents/unit-hardware-observer-7/charm/src/service.py", line 225, in restart
    raise ExporterError() from err
service.ExporterError
unit-hardware-observer-7: 15:03:28 ERROR juju.worker.uniter.operation hook "config-changed" (via hook dispatching script: dispatch) failed: exit status 1
unit-hardware-observer-7: 15:03:28 INFO juju.worker.uniter awaiting error resolution for "config-changed" hook
unit-hardware-observer-7: 15:03:30 INFO juju.worker.uniter awaiting error resolution for "config-changed" hook
unit-hardware-observer-7: 15:03:25 INFO juju.worker.uniter awaiting error resolution for "config-changed" hook
unit-hardware-observer-7: 15:03:27 INFO unit.hardware-observer/7.juju-log Config changed called before install complete, deferring event: HardwareObserverCharm/on/config_changed[17]
unit-hardware-observer-7: 15:03:27 INFO unit.hardware-observer/7.juju-log Attempt 1 of /redfish/v1/
unit-hardware-observer-7: 15:03:27 INFO unit.hardware-observer/7.juju-log Response Time for GET to /redfish/v1/: 0.15704713099694345 seconds.
unit-hardware-observer-7: 15:03:27 INFO unit.hardware-observer/7.juju-log Attempt 1 of /redfish/v1/SessionService/Sessions
unit-hardware-observer-7: 15:03:27 INFO unit.hardware-observer/7.juju-log Response Time for POST to /redfish/v1/SessionService/Sessions: 0.32879110499925446 seconds.
unit-hardware-observer-7: 15:03:27 INFO unit.hardware-observer/7.juju-log Login returned code 201: {"Oem":{"Lenovo":{}},"SessionType":"Redfish","@odata.id":"/redfish/v1/SessionService/Sessions/1205","@odata.etag":"\"1a429ea800b5242071d\"","Id":"1205","Name":"1205","@odata.type":"#Session.v1_3_0.Session","UserName":"admin","@odata.context":"/redfish/v1/$metadata#Session.Session","Password":null}

unit-hardware-observer-7: 15:03:27 INFO unit.hardware-observer/7.juju-log Attempt 1 of /redfish/v1/SessionService/Sessions/1205
unit-hardware-observer-7: 15:03:28 INFO unit.hardware-observer/7.juju-log Response Time for DELETE to /redfish/v1/SessionService/Sessions/1205: 0.12732572499953676 seconds.
@aieri aieri added the bug Something isn't working label Jul 30, 2024
@jneo8
Copy link
Contributor

jneo8 commented Jul 31, 2024

The work-around is to force execute the install hook again with juju exec ./hooks/install, this will install the exporter service if all the resources are ready.
(redetect-hardware juju action can be another solution to re-run the install hook, but because there is no hardware changing detected in this case so it won't work.)

The part make me confuse is that the juju attach step should trigger the upgrade hook then config-change hook. And the upgrade hook and install hook have the same handler in hw-observer. Need to verify why the attach resource step doesn't install the exporter service.

@chanchiwai-ray
Copy link
Contributor

From the log shared internally, the install hook -> config_changed hook caused the issue (the charm continue to run the config hook even though the install is not successful), but it could have been prevented by deferred(). But due to the technical dept: #215, it does not work.

@aieri
Copy link
Contributor

aieri commented Aug 1, 2024

Fixing #203 may also help here

@aieri
Copy link
Contributor

aieri commented Aug 1, 2024

iiuc the current approach is to not install the exporter unless all the resources we expect are available. While I think that's a valid approach, given that attaching resources triggers a config_changed event, it would seem cleaner to me if we were to segment the lifecycle in:

  • install event → install the charm (but not the exporter)
  • config_changed
    • missing resources → set to blocked, don't install the exporter
    • all resources available → install the exporter, start the service

Another approach would possibly be making prometheus-hardware-exporter be able to deal with missing binaries, so we can install it and start it in the install event and dynamically enable extra metrics if/when binaries like storcli become available.

@Pjack
Copy link

Pjack commented Aug 1, 2024

I agree with @chanchiwai-ray 's guess.
We should defer the event and return the _on_config_changed function immediately.
#215

Furthermore, based on the description, it seems we can easily reproduce this issue when the resource is unavailable. Please ensure this case is covered in the functional test cases when we address this issue.

Regarding @aieri 's idea, it seems feasible and it's a clear strategy. But I am a bit concerned that we might be doing too much in the _on_config_changed function, which is not intuitive given its name. Additionally, there is an implicit assumption that the 'config_change' event must be sent out after the installation event. Anyway, we can discuss that later. I suggest we fix the issue first and then refactor under the stable state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants