-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(Crashing on Low Memory SBC) main invoked oom-killer: gfp_mask=0x1100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0 #59
Comments
I think a smaller model is a way to go for RasPi 3. The converter needs to be adjusted a bit and it should work. I'll look at it soon. |
ballin |
apparently i should be able to use llama.cpp and mpi with rpi3b+. |
The llama.cpp uses pipeline parallel, which produces high throughput only when the batch size is large. Moreover, the MPI backend is broken after a certain commit. That's why we are here. |
alright good. i think that means i'm in the right place. i will be testing this SBC devices mostly, but frequently, if i can manage to get a database to load. when discord? |
The first version of a general HF converter is here. You can try it. So far I tested it only with TinyLlama-1.1B:
|
k brb |
seems like no dice?
This console message got cut off:
|
i also tried with https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0 |
No don't think that would matter. |
Have you rebuild the 'dllama' app? |
This has caught me by surprise before, that could likely be the case. |
yes its |
You need to build the version from the pull request. |
git fetch origin pull/62/head:feat/convert-hf Or using github cli It's not yet merged into main branch |
bueno 🎉 |
i tried it but i'm gettign some garble:
|
Could try to run the 'inference' mode? Maybe the chat mode is broken for TinyLlama. |
am i able to change the ip? does it default to 127.0.0.1? |
@unclemusclez sorry I don't understand your question. I meant this command: ./dllama inference --model dllama_tinylama_q40.bin --tokenizer dllama_tinyllama.t --weights-float-type q40 --buffer-float-type q80 --nthreads 4 --steps 32 --prompt "hello world" |
it was giving me a can't connect error with the example script. it was refusing connections with it's static ip, but connected to other nodes and was able to be contacted for file sharing, etc. I was trying to execute it remotely. local result:
|
Have you converted a correct tokenizer? You should convert this:
Last lines of the output from the converter:
Your output is different. |
where are you getting the
|
The 0.7.0 version introduced the Have you regenerated the tokenizer and are you sure that you are using the correct one? |
there is a problem with lfs downloads on widows, so i wget the large files to the same directory.
if the 0.7.0 version was just introduced i must have done something wrong. im supposed to be using the pr of the earlier version? |
i am using a 64-bit kernel of headless 22.04 Ubuntu BTW. Should i be using the HF image/ 32bit? |
Now you can use the You should be able to convert on any machine. I think you should download all files again from HF (you can download by using a browser), and run the conversion once again. Be 100% sure you are converting downloaded files. |
i think you are correct i am redoing it all over right now. |
fresh everything same deal
|
Could you try to run this model and this tokenizer on your computer (single machine)? |
@unclemusclez you can try to use a new feature: the model downloader.
|
|
I'm going to run the same test now on my side to check what's up |
The issue is because you didn't run it as sudo. With sudo: Without sudo: 🔶 G 63 ms I 60 ms T 1 ms S 0 kB R 0 kB * 🔶 G 44 ms I 44 ms T 0 ms S 0 kB R 0 kB * |
Truthfully we could probably just have it allocate the buffer on the heap using the vector approach I used for windows support if not running as sudo. |
Confirmed I can now run dllama without sudo, the irony is that it's part of the windows support PR ./dllama inference --model /mnt/d/Meta-Llama-3-8B-Instruct-Distributed/dllama_original_q40.bin --tokenizer /mnt/d/Meta-Llama-3-8B-Instruct-Distributed/dllama-llama3-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --nthreads 4 --steps 64 --prompt "Hello world" |
|
Is your worker nodes also running the same version? I pulled latest version from git, built from source, used downloader to download tinyllama and run as per the instructions and mine worked just fine, the only difference I could spot was that you were running using additional workers. Possible reasons I could think of is that one or more nodes are running older versions of dllama, or some ARM specific code broke in a recent pull request, though I doubt that's the case. The workflows test for functionality on both ARM and x86 processor architectures, though they don't exactly test the multiple worker functionality, it might be something that's broken only on multi node setup, or it could just be you didn't update the nodes to latest version.. |
i compile on the 3b+ and then
|
That's so strange, I just did a test with multiple workers, running from the same machine instead of multiple machines, though it's x86 and not ARM. Root: Worker: Both running from the same machine inside WSL. I unfortunately don't have any ARM hardware to test with currently, but it could be related to that. |
Another test sudo nice -n 20 ./dllama inference --model ~/distributed-llama/models/tinylama_1.1b_3t_q40/dllama_model_tinylama_1.1b_3t_q40.m --tokenizer ~/distributed-llama/models/tinylama_1.1b_3t_q40/dllama_tokenizer_tinylama_1.1b_3t_q40.t --weights-float-type q40 --buffer-float-type q80 --nthreads 4 --steps 64 --prompt "Python is a programming language that" --workers 127.0.0.1:11211 I'm going to check if I can spin up a VM on azure to test out if it's maybe an ARM specific issue. |
WSL HOST:
|
I just created an EC2 ARM VM, and ran the same test there, worked perfectly fine. |
Perhaps try just the WSL root node, then add workers 1 at a time, perhaps it's a problem with a single worker that's affecting the others, either way something strange is going on. |
4 Work, 8 Do not. This was the same with WSL as the inference and the pi as the inference. On WSL however, you can see that it's actually saying "overflow", when 8 are run. intriguing. from above:
4x working:
|
Could you try to run 8 workers but with a single thread? |
He could also try running funcs-test on all the Pi's |
I reproduced the problem. 8 nodes with 4 threads generate a spaggetti. I'll look at this.
Update: The same is with 8 nodes with 1 thread:
Update: This problem appears with TinyLlama. Llama 3 8B works ok. |
https://huggingface.co/keeeeenw/MicroLlama/tree/main i was looking into this but there is no If there was some external documentation i could refer to i would try to work with some other lightweight models that might work with the 1GB of memory. I just got some 2GB SBCs in the mail, so i could try to mix and match a bit to allow the memory demands of Llama3. |
@unclemusclez the mystery is solved. The TinyLama has |
TinyLlama seems to work now, so I'm closing this issue. |
Is there anyway that main and worker could be separated so I can use a cluster of 8 RPi 3b+ for the compute but the scheduling is offset to another device with more memory?
I understand this is most likely not a priority.
Perhaps a smaller model? https://github.com/jzhang38/TinyLlama ?
main:
Worker
The text was updated successfully, but these errors were encountered: