[Feature Request] Decouple the Prefill and Decode Stage #138

zhengpeirong · 2024-11-17T22:19:11Z

The current implementation only supports the Decode function, which is a bit behind popular repos, e.g., llama.cpp.

The Prefill stage should be implemented with batched computation (GEMM) and adopt the mature communication backend e.g., OpenMPI.
This should facilitate the whole process because computing the entire prompts should be faster than decoding one by one.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Decouple the Prefill and Decode Stage #138

[Feature Request] Decouple the Prefill and Decode Stage #138

zhengpeirong commented Nov 17, 2024

[Feature Request] Decouple the Prefill and Decode Stage #138

[Feature Request] Decouple the Prefill and Decode Stage #138

Comments

zhengpeirong commented Nov 17, 2024