[DRAFT] First version of fusion optimizations for transformers #1938

gramalingam · 2024-11-09T01:22:59Z

Introduce fusion rules for SdpaAttention, RMS Normalization, Skip Normalization, Rotary Embedding, and Multi Head Attention
Replace Expand by Identity when applicable (in core optimization)
Cleanup Dropout Identity replacement in the case when Dropout has mask output
Make repeated (redundant) call to inliner efficient

Still TODO:

Multi Head Attention requires extra validation conditions
Need to cleanup use of "local" sub-patterns

codecov · 2024-11-09T01:25:19Z

❌ 10 Tests Failed:

Tests completed	Failed	Passed	Skipped
10004	10	9994	3720

View the top 1 failed tests by shortest run time

::onnxscript.rewriter.onnxruntime.xformers._optimize_transformers_test

Stack Traces | 0s run time

No failure message available

View the full list of 2 ❄️ flaky tests

tests.eager_mode_test.TestEagerModeArguments_0_reference_runtime::test_function_input_and_attribute_by_kwargs_out_of_order

Flake rate in main: 39.26% (Passed 7590 times, Failed 4905 times)

Stack Traces | 0.004s run time

.nox\test_torch_nightly\Lib\site-packages\onnx\reference\ops\_op.py:91: in run
    res = self._run(x, y)
.nox\test_torch_nightly\Lib\site-packages\onnx\reference\ops\_op.py:139: in _run
    res = (convert_from_ml_dtypes(res[0]),)
.nox\test_torch_nightly\Lib\site-packages\onnx\reference\custom_element_types.py:50: in convert_from_ml_dtypes
    return array.view(dtype=dtype)
E   ValueError: Changing the dtype of a 0d array is only supported if the itemsize is unchanged

The above exception was the direct cause of the following exception:
tests\eager_mode_test.py:115: in test_function_input_and_attribute_by_kwargs_out_of_order
    self.assertEqual(add_with_alpha(alpha=3.0, other=2.0, this=1.0), 7.0)
onnxscript\values.py:576: in __call__
    return evaluator.default().eval_function(self, args, kwargs)
onnxscript\evaluator.py:307: in eval_function
    result = function.function(*adapted_args, **adapted_kwargs)
tests\eager_mode_test.py:59: in add_with_alpha
    other = op.Mul(other, alpha)
onnxscript\onnx_opset\_impl\opset14.py:696: in Mul
    return op(*self._prepare_inputs(schema, A, B))
onnxscript\values.py:304: in __call__
    return evaluator.default().eval(schema, args, kwargs)
onnxscript\evaluator.py:194: in eval
    outputs = self._eval(schema, inputs, attributes, closure)
onnxscript\evaluator.py:524: in _eval
    result = session.run(None, session_run_input)
.nox\test_torch_nightly\Lib\site-packages\onnx\reference\reference_evaluator.py:599: in run
    outputs = node.run(*inputs, **linked_attributes)
.nox\test_torch_nightly\Lib\site-packages\onnx\reference\ops\_op.py:114: in run
    res = OpRunBinary.run(self, x, y)
.nox\test_torch_nightly\Lib\site-packages\onnx\reference\ops\_op.py:93: in run
    raise TypeError(
E   TypeError: Issues with types <class 'numpy.ndarray'>, <class 'numpy.ndarray'> (binary operator 'Mul').

tests.eager_mode_test.TestEagerModeArguments_0_reference_runtime::test_function_all_input_by_kwargs

Flake rate in main: 39.26% (Passed 7590 times, Failed 4905 times)

Stack Traces | 0.004s run time

.nox\test_torch_nightly\Lib\site-packages\onnx\reference\ops\_op.py:91: in run
    res = self._run(x, y)
.nox\test_torch_nightly\Lib\site-packages\onnx\reference\ops\_op.py:139: in _run
    res = (convert_from_ml_dtypes(res[0]),)
.nox\test_torch_nightly\Lib\site-packages\onnx\reference\custom_element_types.py:50: in convert_from_ml_dtypes
    return array.view(dtype=dtype)
E   ValueError: Changing the dtype of a 0d array is only supported if the itemsize is unchanged

The above exception was the direct cause of the following exception:
tests\eager_mode_test.py:109: in test_function_all_input_by_kwargs
    self.assertEqual(add_with_alpha(this=1.0, other=2.0), 3.0)
onnxscript\values.py:576: in __call__
    return evaluator.default().eval_function(self, args, kwargs)
onnxscript\evaluator.py:307: in eval_function
    result = function.function(*adapted_args, **adapted_kwargs)
tests\eager_mode_test.py:59: in add_with_alpha
    other = op.Mul(other, alpha)
onnxscript\onnx_opset\_impl\opset14.py:696: in Mul
    return op(*self._prepare_inputs(schema, A, B))
onnxscript\values.py:304: in __call__
    return evaluator.default().eval(schema, args, kwargs)
onnxscript\evaluator.py:194: in eval
    outputs = self._eval(schema, inputs, attributes, closure)
onnxscript\evaluator.py:524: in _eval
    result = session.run(None, session_run_input)
.nox\test_torch_nightly\Lib\site-packages\onnx\reference\reference_evaluator.py:599: in run
    outputs = node.run(*inputs, **linked_attributes)
.nox\test_torch_nightly\Lib\site-packages\onnx\reference\ops\_op.py:114: in run
    res = OpRunBinary.run(self, x, y)
.nox\test_torch_nightly\Lib\site-packages\onnx\reference\ops\_op.py:93: in run
    raise TypeError(
E   TypeError: Issues with types <class 'numpy.ndarray'>, <class 'numpy.ndarray'> (binary operator 'Mul').

To view more test analytics, go to the Test Analytics Dashboard
Got feedback? Let us know on Github

onnxscript/rewriter/onnxruntime/xformers/multi_head_attention.py

+The last two axes of the key-embedding are then swapped (using a Reshape/Transpose/Reshape sequence)
+
+The dot-product attention is then computed using SDPA
+


onnxscript/rewriter/onnxruntime/xformers/multi_head_attention.py

+The last two axes of the key-embedding are then swapped (using a Reshape/Transpose/Reshape sequence)
+
+The dot-product attention is then computed using SDPA
+


onnxscript/rewriter/onnxruntime/xformers/skip_normalization.py

+
+
+def _skip_normalization(op, input, skip, gamma, epsilon, stash_type):
+    normalized, mean, inv_std_var, skip_sum = op.SkipSimplifiedLayerNormalization(


onnxscript/rewriter/onnxruntime/xformers/skip_normalization.py

+
+
+def _skip_normalization(op, input, skip, gamma, epsilon, stash_type):
+    normalized, mean, inv_std_var, skip_sum = op.SkipSimplifiedLayerNormalization(


justinchuby · 2024-11-10T03:46:52Z

onnxscript/optimizer/_constant_folding.py

+        if len(node.outputs) == 1:
+            return output
+        else:
+            true_tensor = onnx.helper.make_tensor("true", onnx.TensorProto.BOOL, [1], [True])


Is this IR? If so

Suggested change

true_tensor = onnx.helper.make_tensor("true", onnx.TensorProto.BOOL, [1], [True])

true_tensor = ir.tensor([True])

Thanks. But when I look at the signature here, it is not clear this is supported. The example illustrates it, though. I see it eventually calls np.array constructor if nothing else works, so I understand it now.

Good catch. We can update the signature

This is actually covered by npt.ArrayLike (the first)

Actually, I tried and it failed, rejecting a list. BTW, I have moved the independent parts of this PR into a separate PR: #1947

Hmm. I need to fix that then

onnxscript/rewriter/onnxruntime/xformers/_optimize_transformers_test.py

onnxscript/rewriter/onnxruntime/xformers/_optimize_transformers.py

@@ -0,0 +1,38 @@
+# Copyright (c) Microsoft Corporation.


onnxscript/rewriter/onnxruntime/xformers/_optimize_transformers_test.py

@@ -0,0 +1,152 @@
+# Copyright (c) Microsoft Corporation.


gramalingam added 20 commits October 23, 2024 13:21

Fusions

d751d37

MultiHeadAttention fusion

8c4dff5

Merge branch 'main' into rama/fusions

5d3c9af

Move transformers optimization into onnxruntime folder

4d3ff90

Support some SDPA variations

4a667f9

Add variations of rules for SDPA

33c3753

Add attention scale validation

404e5c3

Add validation conditions for rotary embedding

e98682f

Add tests

001bb59

Move into new xformers folder

40b9052

Add dropout to optimizer

94ce2f3

Run lint

bf3b64a

Undo dropout rewrite rule change

0491366

Add concat test

3fb7cd1

Merge with main

f25b669

Add expand identity optimization

73723f0

Some cleanup

bb977ec

Fix dropout optimization

7f1606f

Some more cleanup

a3e0d1d

Cleanup

0879934

gramalingam marked this pull request as draft November 9, 2024 01:23

github-advanced-security bot found potential problems Nov 9, 2024

View reviewed changes

justinchuby reviewed Nov 10, 2024

View reviewed changes

titaiwangms self-requested a review November 12, 2024 17:56

Minor fixes

a8ac3ee

github-advanced-security bot found potential problems Nov 13, 2024

View reviewed changes

Add ort check to test

044a638

github-advanced-security bot found potential problems Nov 13, 2024

View reviewed changes

Testing changes

b6f0071

Merge branch 'main' into rama/fusions

b985bb1

gramalingam mentioned this pull request Nov 15, 2024

A couple of optimizations and refinements #1947

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT] First version of fusion optimizations for transformers #1938

[DRAFT] First version of fusion optimizations for transformers #1938

gramalingam commented Nov 9, 2024

codecov bot commented Nov 9, 2024 •

edited

Loading

justinchuby Nov 10, 2024

gramalingam Nov 11, 2024 •

edited

Loading

justinchuby Nov 11, 2024

justinchuby Nov 15, 2024 •

edited

Loading

gramalingam Nov 15, 2024

justinchuby Nov 15, 2024

		The last two axes of the key-embedding are then swapped (using a Reshape/Transpose/Reshape sequence)

		The dot-product attention is then computed using SDPA



		def _skip_normalization(op, input, skip, gamma, epsilon, stash_type):
		normalized, mean, inv_std_var, skip_sum = op.SkipSimplifiedLayerNormalization(

	true_tensor = onnx.helper.make_tensor("true", onnx.TensorProto.BOOL, [1], [True])
	true_tensor = ir.tensor([True])

[DRAFT] First version of fusion optimizations for transformers #1938

Are you sure you want to change the base?

[DRAFT] First version of fusion optimizations for transformers #1938

Conversation

gramalingam commented Nov 9, 2024

codecov bot commented Nov 9, 2024 • edited Loading

❌ 10 Tests Failed:

justinchuby Nov 10, 2024

Choose a reason for hiding this comment

gramalingam Nov 11, 2024 • edited Loading

Choose a reason for hiding this comment

justinchuby Nov 11, 2024

Choose a reason for hiding this comment

justinchuby Nov 15, 2024 • edited Loading

Choose a reason for hiding this comment

gramalingam Nov 15, 2024

Choose a reason for hiding this comment

justinchuby Nov 15, 2024

Choose a reason for hiding this comment

codecov bot commented Nov 9, 2024 •

edited

Loading

gramalingam Nov 11, 2024 •

edited

Loading

justinchuby Nov 15, 2024 •

edited

Loading