Skip to content

Commit

Permalink
Update Llama notebook
Browse files Browse the repository at this point in the history
  • Loading branch information
jonatanklosko committed Dec 19, 2023
1 parent 1681249 commit 36abfac
Showing 1 changed file with 18 additions and 7 deletions.
25 changes: 18 additions & 7 deletions notebooks/llama.livemd
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,13 @@

```elixir
Mix.install([
{:bumblebee, "~> 0.4.2"},
{:nx, "~> 0.6.1"},
{:exla, "~> 0.6.1"},
{:kino, "~> 0.10.0"}
# {:bumblebee, "~> 0.4.2"},
# {:nx, "~> 0.6.1"},
# {:exla, "~> 0.6.1"},
{:bumblebee, github: "elixir-nx/bumblebee"},
{:nx, github: "elixir-nx/nx", sparse: "nx", override: true},
{:exla, github: "elixir-nx/nx", sparse: "exla", override: true},
{:kino, "~> 0.11.0"}
])

Nx.global_default_backend({EXLA.Backend, client: :host})
Expand All @@ -17,7 +20,7 @@ In this notebook we look at running [Meta's Llama](https://ai.meta.com/llama/) m

<!-- livebook:{"break_markdown":true} -->

> **Note:** this is a very involved model, so the generation can take a long time if you run it on a CPU. Also, running on the GPU currently requires a lot of VRAM, 24 GB has been verified to work.
> **Note:** this is a very involved model, so the generation can take a long time if you run it on a CPU. Also, running on the GPU currently requires at least 16GB of VRAM, though at least 30GB is recommended for optimal runtime.
## Text generation

Expand All @@ -29,10 +32,14 @@ Let's load the model and create a serving for text generation:
hf_token = System.fetch_env!("LB_HF_TOKEN")
repo = {:hf, "meta-llama/Llama-2-7b-chat-hf", auth_token: hf_token}

{:ok, model_info} = Bumblebee.load_model(repo, backend: {EXLA.Backend, client: :host})
{:ok, model_info} = Bumblebee.load_model(repo, type: :bf16)
{:ok, tokenizer} = Bumblebee.load_tokenizer(repo)
{:ok, generation_config} = Bumblebee.load_generation_config(repo)

:ok
```

```elixir
generation_config =
Bumblebee.configure(generation_config,
max_new_tokens: 256,
Expand All @@ -43,6 +50,10 @@ serving =
Bumblebee.Text.generation(model_info, tokenizer, generation_config,
compile: [batch_size: 1, sequence_length: 1028],
stream: true,
# Option 1
# preallocate_params: true,
# defn_options: [compiler: EXLA]
# Option 3
defn_options: [compiler: EXLA, lazy_transfers: :always]
)

Expand All @@ -52,7 +63,7 @@ Kino.start_child({Nx.Serving, name: Llama, serving: serving})

We adjust the generation config to use a non-deterministic generation strategy. The most interesting part, though, is the combination of serving options.

First, note that we specify `{EXLA.Backend, client: :host}` as the backend for model parameters, which ensures that initially we load the parameters onto CPU. This is important, because as the parameters are loaded Bumblebee may need to apply certain operations to them and we don't want to bother the GPU at that point, risking an out-of-memory error.
First, note that in the Setup cell we set the default backend to `{EXLA.Backend, client: :host}`, which means that initially we load the parameters onto CPU. This is important, because as the parameters are loaded Bumblebee may need to apply certain operations to them and we don't want to bother the GPU at that point, risking an out-of-memory error.

With that, there are a couple combinations of options related to parameters, trading off memory usage for speed:

Expand Down

0 comments on commit 36abfac

Please sign in to comment.