Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QAT: update readme for QAT #110

Closed
wants to merge 1 commit into from
Closed

QAT: update readme for QAT #110

wants to merge 1 commit into from

Conversation

mregmi
Copy link
Member

@mregmi mregmi commented Aug 17, 2023

No description provided.

@hershpa hershpa added documentation Improvements or additions to documentation qat QAT feature labels Aug 17, 2023
Copy link

@mythi mythi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few suggestions. I also noticed https://github.com/intel/intel-technology-enabling-for-openshift/blob/main/device_plugins/qat_device_plugin.yaml is not quite OK. maxNumDevices should be 128 and perhaps c6xxxvf can be dropped.


## Prerequisite
Before deploying the QAT plugin, make sure iommu MCO is deployed and iommu is enabled.
By default, half of the virtual functions are used for compression and the other half is used for cryptography. This default behaviour can be customised via a config map which configures the hardware using sysfs. The config values could be `asym`, `sym` and `dc`, for more details see https://github.com/torvalds/linux/blob/master/Documentation/ABI/testing/sysfs-driver-qat
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pointing to the latest sysfs-driver-qat is not accurate because the Linux 6.5 features are not available. Only sym;asym and dc should be valid on OCP today.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

```
oc create configmap --namespace=openshift-operators --from-literal "qat.conf=ServicesEnabled=<option>" qat-config
```
The option can be any combination of values described above
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my previous comment. this is also missing how to setup "per node" settings.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

```

## verify QAT resource is available
QAT has two kinds of resource `qat.intel.com/cy` for cryptography and `qat.intel.com/dc` for compression. If the plugin deployment is a success, the resources can be seen using the commands below. Use /dc is compression is configured.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this section be written in a way that cyand dc are not explicitly mentioned. We might have to change these in the future when the 6.5 features become available. Maybe just say "look for qat.intel.com resources for compression and crypto"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modified


```$ oc apply -f https://github.com/intel/intel-technology-enabling-for-openshift/blob/main/machine_configuration/100-intel-qat-intel-iommu-on.yaml```

Note: This will reboot the worker nodes in the Intel QAT machine config pool one by one to turn on intel_iommu kernel parameter. Rebooting the node is not preferred A better solution is in the plan.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Note: This will reboot the worker nodes in the Intel QAT machine config pool one by one to turn on intel_iommu kernel parameter. Rebooting the node is not preferred A better solution is in the plan.
Note: This will reboot the worker nodes in the Intel QAT machine config pool one by one to turn on intel_iommu kernel parameter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added the suggested change


For QAT, intel_iommu is a kernel parameter that needs to be turned on using the following command.

```$ oc apply -f https://github.com/intel/intel-technology-enabling-for-openshift/blob/main/machine_configuration/100-intel-qat-intel-iommu-on.yaml```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could also add loading vfio-pci with the desired ids as parameters since we have to do the booting anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible for us to add the ids as the parameter for the container and set it through the operator? So we don't have to hard code the ids in in the init-container? :-)

@hershpa
Copy link
Contributor

hershpa commented Aug 18, 2023

A few suggestions. I also noticed https://github.com/intel/intel-technology-enabling-for-openshift/blob/main/device_plugins/qat_device_plugin.yaml is not quite OK. maxNumDevices should be 128 and perhaps c6xxxvf can be dropped.

Hi Mikko, thanks for your review and suggestions. I presume Chaitanya used the sample here for qat device plugin reference.

Are there any implications of setting maxNumDevices to 128 i.e. some device ids do not support that level etc?

@mythi
Copy link

mythi commented Aug 21, 2023

Are there any implications of setting maxNumDevices to 128 i.e. some device ids do not support that level etc?

maxNumDevices limits how many QAT VF resources are made available from each node to the cluster. Depending on the platform and HW setup, different amount of VFs are available. We don't have a setting that says "use all available" so some big enough value is used to reflect that.

@hershpa
Copy link
Contributor

hershpa commented Aug 21, 2023

Are there any implications of setting maxNumDevices to 128 i.e. some device ids do not support that level etc?

maxNumDevices limits how many QAT VF resources are made available from each node to the cluster. Depending on the platform and HW setup, different amount of VFs are available. We don't have a setting that says "use all available" so some big enough value is used to reflect that.

Got it, thanks @mythi!

@mregmi
Copy link
Member Author

mregmi commented Aug 22, 2023

Are there any implications of setting maxNumDevices to 128 i.e. some device ids do not support that level etc?

maxNumDevices limits how many QAT VF resources are made available from each node to the cluster. Depending on the platform and HW setup, different amount of VFs are available. We don't have a setting that says "use all available" so some big enough value is used to reflect that.

Addressed this on a separate PR: #111

@uMartinXu
Copy link
Contributor

Are there any implications of setting maxNumDevices to 128 i.e. some device ids do not support that level etc?

maxNumDevices limits how many QAT VF resources are made available from each node to the cluster. Depending on the platform and HW setup, different amount of VFs are available. We don't have a setting that says "use all available" so some big enough value is used to reflect that.

@mythi I think we should have some kind of default value for this to use all the available number of VFs. The end user should not have to know and handle this HW detail.
We also need to do the boundary test to make sure the big enough value is well handled by the device plugin.

@hershpa hershpa added this to the v1.0.1 milestone Aug 22, 2023
For more details see https://intel.github.io/quickassist

## Prerequisite
Before deploying the QAT plugin, make sure iommu MCO is deployed and iommu is enabled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to work on the PRs in MCO to enable this. and also updated the related readme. And then we can point users to the readme.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you referring to enabling IOMMU without a reboot? I've not seen that being possible but it's worth to check.


## Prerequisite
Before deploying the QAT plugin, make sure iommu MCO is deployed and iommu is enabled.
By default, half of the virtual functions are used for compression and the other half is used for cryptography. This default behaviour can be customised via a config map which configures the hardware using sysfs. The config values could be `asym`, `sym` and `dc`, for more details see [cfg_services](https://github.com/torvalds/linux/blob/42e66b1cc3a070671001f8a1e933a80818a192bf/Documentation/ABI/testing/sysfs-driver-qat)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be the Prerequisite. I think it should be the feature Gap, I suggest to have an individual section in this document to introduce customers to the detail. And the link to this issue should be added in the section. And above link is good for the stand-alone RHEL server. But for OCP, the ender user should not directly operate the individual node.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be the Prerequisite. I think it should be the feature Gap,

Which part you are referring to?

The option can be any combination of values described above.
When using the operator for deploying the plugin with provisioning config, use provisioningConfig field for the name of the ConfigMap, then the config is passed to initcontainer through the volume mount.

There's also a possibility for a node specific congfiguration through passing a nodename via NODE_NAME into initcontainer's environment and passing a node specific profile (qat-$NODE_NAME.conf) via ConfigMap volume mount.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking about whether we should expose the configmap detail to the OCP customers.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's totally fine to have it. The default setting without the configmap is also fine but if the cluster admin wants to opt-in for a custom per-node configuration, they can do so with the configmap and by setting the name of the confimap in QatDevicePlugin custom resource.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to put it into some kind of Advanced Configuration Section in the readme.

@@ -87,6 +87,23 @@ $ lsmod | grep ast
```
Ensure that ast driver is not loaded.

# Machine Configuration for Provisioning Intel QAT

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this to MCO. And I think we should remove the QAT machine config pool for simplicity purposes.


For QAT, intel_iommu is a kernel parameter that needs to be turned on using the following command.

```$ oc apply -f https://github.com/intel/intel-technology-enabling-for-openshift/blob/main/machine_configuration/100-intel-qat-intel-iommu-on.yaml```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible for us to add the ids as the parameter for the container and set it through the operator? So we don't have to hard code the ids in in the init-container? :-)

@mythi
Copy link

mythi commented Aug 23, 2023

is it possible for us to add the ids as the parameter for the container and set it through the operator? So we don't have to hard code the ids in in the init-container?

The PCI IDS of the devices enabled by the cluster admin are passed to the initcontainer so you can have custom logic based on that.

@mythi
Copy link

mythi commented Aug 23, 2023

@mythi I think we should have some kind of default value for this to use all the available number of VFs. The end user should not have to know and handle this HW detail.

This flag is for purpose: QAT can also be used for "in kernel" functions through the Linux Kernel Crypto Framework. with maxNumDevices, cluster admins have the capability to leave some VFs for the "in kernel" functions (and other functions outside the cluster). Therefore, it cannot be "use everything" unconditionally.

We also need to do the boundary test to make sure the big enough value is well handled by the device plugin.

maxNumDevices is ignored after the plugin has processed all the QAT VFs it finds.

@uMartinXu
Copy link
Contributor

@mythi I think we should have some kind of default value for this to use all the available number of VFs. The end user should not have to know and handle this HW detail.

This flag is for purpose: QAT can also be used for "in kernel" functions through the Linux Kernel Crypto Framework. with maxNumDevices, cluster admins have the capability to leave some VFs for the "in kernel" functions (and other functions outside the cluster). Therefore, it cannot be "use everything" unconditionally.

We also need to do the boundary test to make sure the big enough value is well handled by the device plugin.

maxNumDevices is ignored after the plugin has processed all the QAT VFs it finds.
I am OK with manNumDevices parameter, my point is we should have a default value for users if they do not set it. The default value of maxNumber can be the max VF numbers the device can support. Besides, if we export this parameter and let user set this up, we should also let users know how many VF devices this device can support, something like https://doc.dpdk.org/guides/cryptodevs/qat.html Table 21.2 QAT device generations, devices and drivers.

@uMartinXu
Copy link
Contributor

@mythi I think we should have some kind of default value for this to use all the available number of VFs. The end user should not have to know and handle this HW detail.

This flag is for purpose: QAT can also be used for "in kernel" functions through the Linux Kernel Crypto Framework. with maxNumDevices, cluster admins have the capability to leave some VFs BTW, this implies that we will also continue to support he "Kernel Mode"? :-)

@mythi
Copy link

mythi commented Aug 23, 2023

BTW, this implies that we will also continue to support he "Kernel Mode"? :-)

no, that is different case.

@mythi
Copy link

mythi commented Aug 23, 2023

The default value of maxNumber can be the max VF numbers the device can support.

it's not always that simple. There's VFs/PF but there's also how many PFs are available. I understand what you are saying but we'll go with this for now.

@uMartinXu
Copy link
Contributor

The default value of maxNumber can be the max VF numbers the device can support.

it's not always that simple. There's VFs/PF but there's also how many PFs are available. I understand what you are saying but we'll go with this for now.

I think for this release it is good enough. Just hope we can optimize it in the future. We can file another issue and track it. :-)

@mythi
Copy link

mythi commented Aug 24, 2023

The default value of maxNumber can be the max VF numbers the device can support.

it's not always that simple. There's VFs/PF but there's also how many PFs are available. I understand what you are saying but we'll go with this for now.

I think for this release it is good enough. Just hope we can optimize it in the future. We can file another issue and track it. :-)

Yes, future is fine when we also get to hear end-user feedback.

@hershpa hershpa added the machine_config Machine Configuration label Aug 30, 2023
@hershpa
Copy link
Contributor

hershpa commented Aug 30, 2023

@mregmi, the machine config readme documentation needs to be updated since we are now not using QAT custom machine config pool. Also, please add a section about the new machine config to load vfio-pci. Thanks!

@hershpa
Copy link
Contributor

hershpa commented Aug 30, 2023

@chaitanya1731, can you review the device plugins readme. Thanks!

Signed-off-by: Manish Regmi <[email protected]>
@uMartinXu
Copy link
Contributor

some other PRS will be submitted to replace this PR. So close this one.

@uMartinXu uMartinXu closed this Sep 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation machine_config Machine Configuration qat QAT feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants