SVE Implementation for Level-1 BLAS Routines #4959

CDAC-SSDG · 2024-10-30T08:56:23Z

We have optimized Level-1 BLAS routines (scal, swap, and rot) utilizing ARM SVE, resulting in significant performance enhancements in OpenBLAS on two variants of the A64FX—FUJITSU PRIMEHPC FX700 and the FUGAKU supercomputer. These optimizations achieved performance improvements ranging from 1.80x to 4x through effective code vectorization. This research has been accepted as a full paper and presented at the 28th Annual IEEE High Performance Extreme Computing (HPEC) Conference in September 2024, under the title "Optimization Strategies to Accelerate BLAS Operations with ARM SVE."

updated KERNEL.ARMV8SVE for level 1 sve (swap, rot and scal) kernels.

Mousius · 2024-10-30T10:51:27Z

Hiya, have you tested the impact on Graviton 3/4?

martin-frbg · 2024-10-30T10:52:48Z

Thank you very much for this revised PR, I'm looking forward to the HPEC2024 proceedings becoming available.
The CI results so far suggest that
(1)Apple Clang is once again being silly about ambiguous SVE intrinsics, probably requiring a few type casts for the arguments like in #4140
and
(2) the new SCAL kernels may need to handle the dummy2 argument that has recently been (ab)used to signal whether to propagate INF and NAN (not wanted for internal uses of SCAL, but now expected when SCAL gets called from user code - this is probably the cause of the failures in openblas_utest and openblas_utest_ext)

martin-frbg · 2024-10-31T14:59:22Z

kernel/arm64/rot_kernel_sve.c

+
+static int rot_kernel_sve(BLASLONG n, FLOAT *x, FLOAT *y, FLOAT c, FLOAT s)
+{
+       for (int i = 0; i < n; i += SVE_WIDTH)


can you make i a BLASLONG here please, and adjust the casts in the SVE_WHILELT to uint64_t accordingly ?

martin-frbg · 2024-10-31T15:01:11Z

kernel/arm64/scal_kernel_sve.c

+#define SVE_WIDTH svcntw()
+#endif
+
+static int scal_kernel_sve(int n, FLOAT *x, FLOAT da)


BLASLONG n ?

martin-frbg · 2024-10-31T15:01:43Z

kernel/arm64/scal_kernel_sve.c

+
+static int scal_kernel_sve(int n, FLOAT *x, FLOAT da)
+{
+  for (int i = 0; i < n; i += SVE_WIDTH)


make i a BLASLONG here too, please

martin-frbg · 2024-10-31T15:05:37Z

kernel/arm64/scal_kernel_sve.c

+{
+  for (int i = 0; i < n; i += SVE_WIDTH)
+  {
+    svbool_t pg = SVE_WHILELT(i, n);


add uint64_t casts for i and n here please

martin-frbg · 2024-10-31T15:09:13Z

kernel/arm64/scal_kernel_c.c

please see kernel/arm/scal.c for letting "dummy2" decide whether to propagate NaN and Inf values - probably there is a more elegant solution than what I put there, otherwise just copy that file

martin-frbg · 2024-10-31T15:18:06Z

kernel/arm64/scal_kernel_sve.c

+  {
+    svbool_t pg = SVE_WHILELT(i, n);
+    SVE_TYPE x_vec = svld1(pg, &x[i]);
+    SVE_TYPE result = svmul_z(pg, x_vec, da);


I'm actually unsure here if svmul_z will "do the right thing" concerning NaN or Inf arguments in x_vec or da

garadeaniket · 2024-11-04T03:58:24Z

Thank You @martin-frbg for suggestions. we will do the required modifications.

CDAC-SSDG and others added 9 commits October 30, 2024 13:57

Update CONTRIBUTORS.md

2718b37

Added optimized scal routine files

0667cf6

Added sve optimized kernels for swap routine

b8bc2a7

Added sve kernels for rot routine.

7822ae9

Update KERNEL.ARMV8SVE

fa880ab

updated KERNEL.ARMV8SVE for level 1 sve (swap, rot and scal) kernels.

Delete kernel/arm64/rot.c

668e28a

Delete kernel/arm64/rot_kernel_c.c

d90ee00

Delete kernel/arm64/rot_kernel_sve.c

012fe4d

Add files via upload

3b2421c

martin-frbg reviewed Oct 31, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SVE Implementation for Level-1 BLAS Routines #4959

SVE Implementation for Level-1 BLAS Routines #4959

CDAC-SSDG commented Oct 30, 2024

Mousius commented Oct 30, 2024

martin-frbg commented Oct 30, 2024

martin-frbg Oct 31, 2024

martin-frbg Oct 31, 2024

martin-frbg Oct 31, 2024

martin-frbg Oct 31, 2024

martin-frbg Oct 31, 2024

martin-frbg Oct 31, 2024

garadeaniket commented Nov 4, 2024

SVE Implementation for Level-1 BLAS Routines #4959

Are you sure you want to change the base?

SVE Implementation for Level-1 BLAS Routines #4959

Conversation

CDAC-SSDG commented Oct 30, 2024

Mousius commented Oct 30, 2024

martin-frbg commented Oct 30, 2024

martin-frbg Oct 31, 2024

Choose a reason for hiding this comment

martin-frbg Oct 31, 2024

Choose a reason for hiding this comment

martin-frbg Oct 31, 2024

Choose a reason for hiding this comment

martin-frbg Oct 31, 2024

Choose a reason for hiding this comment

martin-frbg Oct 31, 2024

Choose a reason for hiding this comment

martin-frbg Oct 31, 2024

Choose a reason for hiding this comment

garadeaniket commented Nov 4, 2024