Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SVE Implementation for Level-1 BLAS Routines #4959

Open
wants to merge 9 commits into
base: develop
Choose a base branch
from

Conversation

CDAC-SSDG
Copy link

We have optimized Level-1 BLAS routines (scal, swap, and rot) utilizing ARM SVE, resulting in significant performance enhancements in OpenBLAS on two variants of the A64FX—FUJITSU PRIMEHPC FX700 and the FUGAKU supercomputer. These optimizations achieved performance improvements ranging from 1.80x to 4x through effective code vectorization. This research has been accepted as a full paper and presented at the 28th Annual IEEE High Performance Extreme Computing (HPEC) Conference in September 2024, under the title "Optimization Strategies to Accelerate BLAS Operations with ARM SVE."

@Mousius
Copy link
Contributor

Mousius commented Oct 30, 2024

Hiya, have you tested the impact on Graviton 3/4?

@martin-frbg
Copy link
Collaborator

Thank you very much for this revised PR, I'm looking forward to the HPEC2024 proceedings becoming available.
The CI results so far suggest that
(1)Apple Clang is once again being silly about ambiguous SVE intrinsics, probably requiring a few type casts for the arguments like in #4140
and
(2) the new SCAL kernels may need to handle the dummy2 argument that has recently been (ab)used to signal whether to propagate INF and NAN (not wanted for internal uses of SCAL, but now expected when SCAL gets called from user code - this is probably the cause of the failures in openblas_utest and openblas_utest_ext)


static int rot_kernel_sve(BLASLONG n, FLOAT *x, FLOAT *y, FLOAT c, FLOAT s)
{
for (int i = 0; i < n; i += SVE_WIDTH)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make i a BLASLONG here please, and adjust the casts in the SVE_WHILELT to uint64_t accordingly ?

#define SVE_WIDTH svcntw()
#endif

static int scal_kernel_sve(int n, FLOAT *x, FLOAT da)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BLASLONG n ?


static int scal_kernel_sve(int n, FLOAT *x, FLOAT da)
{
for (int i = 0; i < n; i += SVE_WIDTH)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make i a BLASLONG here too, please

{
for (int i = 0; i < n; i += SVE_WIDTH)
{
svbool_t pg = SVE_WHILELT(i, n);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add uint64_t casts for i and n here please

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please see kernel/arm/scal.c for letting "dummy2" decide whether to propagate NaN and Inf values - probably there is a more elegant solution than what I put there, otherwise just copy that file

{
svbool_t pg = SVE_WHILELT(i, n);
SVE_TYPE x_vec = svld1(pg, &x[i]);
SVE_TYPE result = svmul_z(pg, x_vec, da);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually unsure here if svmul_z will "do the right thing" concerning NaN or Inf arguments in x_vec or da

@garadeaniket
Copy link

Thank You @martin-frbg for suggestions. we will do the required modifications.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants