Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't disable uneven k to support more headdims #21

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

njhill
Copy link
Member

@njhill njhill commented Sep 27, 2024

Originally disabled by @WoosukKwon in eee8e47

@WoosukKwon
Copy link

WoosukKwon commented Sep 30, 2024

Adding the flag increased the wheel size of vllm-flash-attn from 107 MB to 143MB (not sure whether this is before compression).

@njhill
Copy link
Member Author

njhill commented Sep 30, 2024

Thanks @WoosukKwon, I guess that's fairly significant.

The motivation here is that we depend of flash attention currently for certain features like multi-step scheduling, so it's a big performance downside for models which have these other head sizes (in particular some of the IBM Granite models use head_dim 80)

@WoosukKwon
Copy link

@njhill We can just add the head size 80 to

HEAD_DIMENSIONS = [32, 64, 96, 128, 160, 192, 224, 256]

Would that work for you?

@njhill
Copy link
Member Author

njhill commented Sep 30, 2024

@WoosukKwon yes, I think for our immediate needs that would be great, if you're sure that will be sufficient!

@njhill
Copy link
Member Author

njhill commented Oct 2, 2024

@WoosukKwon I've opened #22 for this

@WoosukKwon
Copy link

@njhill Since the idea above doesn't work, can you check how much this PR affects the vllm wheel size?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants