-
Notifications
You must be signed in to change notification settings - Fork 27
Kepler Operator v1 Working Doc
Following the existing discussion here
The proposed CRs
flowchart TD;
machine-config
kepler-system
kepler-collected-metric
kepler-exported-power
-
Instead of using
integrated-operator-install
to installprometheus
andgrafana
via the operator, it should be left upon the user to set up the monitoring stack. -
Each components should be represented as a separate CR and managed by a separate controller
apiVersion: sustainability-computing-io/v1aplha1
kind: Kepler
metadata:
name: kepler-system
namespace: kepler-system
spec:
scape-interval:
collector:
image:
port: (default: 9102)
estimator-sidecar:
enabled: (default: false)
image:
mnt-path: (default: /tmp)
model-server:
enabled: (default: :warning:false)
storage:
type: (default: local? , values: local, hostpath, nfs, external (such as via s3))
path: (default: models)
sampling-period:
-
kepler-collectd-metric
andkepler-exported-power
What these components are meant to do ? It seems like they set some configurations. Where does Kepler use these configs?-
kepler-collectd-metric: list of metrics to collect by collector pkg separated by input source.
-
kepler-exported-power: list of metrics to export to prometheus for each level (node, package, pod)
Now these configurations are in two locations: exporter.go, config.go However, these sections are supposed to be refactored set as environments via config map.
apiVersion: v1 kind: ConfigMap metadata: name: kepler-cfm namespace: kepler-system data: SOURCE.COUNTER: enabled SOURCE.CGROUP: enabled SOURCE.KUBELET: enabled SOURCE.GPU: enabled EXPORT_METRICS: cpu_cycles, cached_miss, cpu_time, ...
Then link this config map to the deployment.
spec: volumes: - name: cfm configMap: name: kepler-model-server-cfm containers: - name: server-api image: quay.io/sustainable_computing_io/kepler:latest ... volumeMounts: - name: cfm mountPath: /etc/config readOnly: true
Currently, the list of metrics from each source (COUNTER,CGROUP,etc.) is fixed for grouping power models. Shall we change? (low priority)
Additionally, some configurations are still hard-coded such as PodTotalPowerModelConfig in pod_power.go.
-
- Should the Operator for now just expose model weights and and have an option to also enable online training as long as energy metrics are supported (or should the operator just use the model server for exposing the models). If we want to enable online training, how do we intend to store the new models. Should it just be stored as PVs?