Update Llama notebook

elixir-nx · Dec 19, 2023 · 36abfac · 36abfac
1 parent 1681249
commit 36abfac
Showing 1 changed file with 18 additions and 7 deletions.
diff --git a/notebooks/llama.livemd b/notebooks/llama.livemd
@@ -2,10 +2,13 @@
 
 ```elixir
 Mix.install([
-  {:bumblebee, "~> 0.4.2"},
-  {:nx, "~> 0.6.1"},
-  {:exla, "~> 0.6.1"},
-  {:kino, "~> 0.10.0"}
+  # {:bumblebee, "~> 0.4.2"},
+  # {:nx, "~> 0.6.1"},
+  # {:exla, "~> 0.6.1"},
+  {:bumblebee, github: "elixir-nx/bumblebee"},
+  {:nx, github: "elixir-nx/nx", sparse: "nx", override: true},
+  {:exla, github: "elixir-nx/nx", sparse: "exla", override: true},
+  {:kino, "~> 0.11.0"}
 ])
 
 Nx.global_default_backend({EXLA.Backend, client: :host})
@@ -17,7 +20,7 @@ In this notebook we look at running [Meta's Llama](https://ai.meta.com/llama/) m
 
 <!-- livebook:{"break_markdown":true} -->
 
-> **Note:** this is a very involved model, so the generation can take a long time if you run it on a CPU. Also, running on the GPU currently requires a lot of VRAM, 24 GB has been verified to work.
+> **Note:** this is a very involved model, so the generation can take a long time if you run it on a CPU. Also, running on the GPU currently requires at least 16GB of VRAM, though at least 30GB is recommended for optimal runtime.
 
 ## Text generation
 
@@ -29,10 +32,14 @@ Let's load the model and create a serving for text generation:
 hf_token = System.fetch_env!("LB_HF_TOKEN")
 repo = {:hf, "meta-llama/Llama-2-7b-chat-hf", auth_token: hf_token}
 
-{:ok, model_info} = Bumblebee.load_model(repo, backend: {EXLA.Backend, client: :host})
+{:ok, model_info} = Bumblebee.load_model(repo, type: :bf16)
 {:ok, tokenizer} = Bumblebee.load_tokenizer(repo)
 {:ok, generation_config} = Bumblebee.load_generation_config(repo)
 
+:ok
+```
+
+```elixir
 generation_config =
   Bumblebee.configure(generation_config,
     max_new_tokens: 256,
@@ -43,6 +50,10 @@ serving =
   Bumblebee.Text.generation(model_info, tokenizer, generation_config,
     compile: [batch_size: 1, sequence_length: 1028],
     stream: true,
+    # Option 1
+    # preallocate_params: true,
+    # defn_options: [compiler: EXLA]
+    # Option 3
     defn_options: [compiler: EXLA, lazy_transfers: :always]
   )
 
@@ -52,7 +63,7 @@ Kino.start_child({Nx.Serving, name: Llama, serving: serving})
 
 We adjust the generation config to use a non-deterministic generation strategy. The most interesting part, though, is the combination of serving options.
 
-First, note that we specify `{EXLA.Backend, client: :host}` as the backend for model parameters, which ensures that initially we load the parameters onto CPU. This is important, because as the parameters are loaded Bumblebee may need to apply certain operations to them and we don't want to bother the GPU at that point, risking an out-of-memory error.
+First, note that in the Setup cell we set the default backend to `{EXLA.Backend, client: :host}`, which means that initially we load the parameters onto CPU. This is important, because as the parameters are loaded Bumblebee may need to apply certain operations to them and we don't want to bother the GPU at that point, risking an out-of-memory error.
 
 With that, there are a couple combinations of options related to parameters, trading off memory usage for speed: