diff --git a/notebooks/llama.livemd b/notebooks/llama.livemd index c6269d4f..b59a8ec1 100644 --- a/notebooks/llama.livemd +++ b/notebooks/llama.livemd @@ -2,10 +2,13 @@ ```elixir Mix.install([ - {:bumblebee, "~> 0.4.2"}, - {:nx, "~> 0.6.1"}, - {:exla, "~> 0.6.1"}, - {:kino, "~> 0.10.0"} + # {:bumblebee, "~> 0.4.2"}, + # {:nx, "~> 0.6.1"}, + # {:exla, "~> 0.6.1"}, + {:bumblebee, github: "elixir-nx/bumblebee"}, + {:nx, github: "elixir-nx/nx", sparse: "nx", override: true}, + {:exla, github: "elixir-nx/nx", sparse: "exla", override: true}, + {:kino, "~> 0.11.0"} ]) Nx.global_default_backend({EXLA.Backend, client: :host}) @@ -17,7 +20,7 @@ In this notebook we look at running [Meta's Llama](https://ai.meta.com/llama/) m -> **Note:** this is a very involved model, so the generation can take a long time if you run it on a CPU. Also, running on the GPU currently requires a lot of VRAM, 24 GB has been verified to work. +> **Note:** this is a very involved model, so the generation can take a long time if you run it on a CPU. Also, running on the GPU currently requires at least 16GB of VRAM, though at least 30GB is recommended for optimal runtime. ## Text generation @@ -29,10 +32,14 @@ Let's load the model and create a serving for text generation: hf_token = System.fetch_env!("LB_HF_TOKEN") repo = {:hf, "meta-llama/Llama-2-7b-chat-hf", auth_token: hf_token} -{:ok, model_info} = Bumblebee.load_model(repo, backend: {EXLA.Backend, client: :host}) +{:ok, model_info} = Bumblebee.load_model(repo, type: :bf16) {:ok, tokenizer} = Bumblebee.load_tokenizer(repo) {:ok, generation_config} = Bumblebee.load_generation_config(repo) +:ok +``` + +```elixir generation_config = Bumblebee.configure(generation_config, max_new_tokens: 256, @@ -43,6 +50,10 @@ serving = Bumblebee.Text.generation(model_info, tokenizer, generation_config, compile: [batch_size: 1, sequence_length: 1028], stream: true, + # Option 1 + # preallocate_params: true, + # defn_options: [compiler: EXLA] + # Option 3 defn_options: [compiler: EXLA, lazy_transfers: :always] ) @@ -52,7 +63,7 @@ Kino.start_child({Nx.Serving, name: Llama, serving: serving}) We adjust the generation config to use a non-deterministic generation strategy. The most interesting part, though, is the combination of serving options. -First, note that we specify `{EXLA.Backend, client: :host}` as the backend for model parameters, which ensures that initially we load the parameters onto CPU. This is important, because as the parameters are loaded Bumblebee may need to apply certain operations to them and we don't want to bother the GPU at that point, risking an out-of-memory error. +First, note that in the Setup cell we set the default backend to `{EXLA.Backend, client: :host}`, which means that initially we load the parameters onto CPU. This is important, because as the parameters are loaded Bumblebee may need to apply certain operations to them and we don't want to bother the GPU at that point, risking an out-of-memory error. With that, there are a couple combinations of options related to parameters, trading off memory usage for speed: