Skip to content

Small tutorial: how to instruct a language model using a text file containing instructions

License

Notifications You must be signed in to change notification settings

srogmann/llm-instruction-sample

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

llm-instruction-sample

This is a small tutorial explaining how to instruct a language model to learn some simple instructions. The instructions are given in a text file. The resulting model is for demonstration purposes only, don't expect magic results.

Read this in: Deutsch

Background

Todays large language models are based on the work of a lot of people (e.g. Richard Socher) and a lot of papers (e.g. Attention Is All You Need).

In April 2023 Databricks released Dolly 2.0, an open source, instruction-following LLM.

In May 2023 the Dolly-15k-data had been used in the model MPT-7B-Instruct (together with a HH-RLHF dataset). There is a Dataset Preview which shows some sample instructions used to train the model:

Prompt (string): "Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: what is kangen water? ### Response: "

Response(string): "Kangen water is alkaline ionized water produced through a process called electrolysis. Kangen water is also referred to electrolyzed reduced water. This water is characterized by an large negative oxidation reduction potential and a potential hydrogen level > 7.0 making the water alkaline. It is also infused with molecular hydrogen in the amount of 1 - 1.5 parts per million per gallon of water produced. This infused hydrogen has been shown to be a very good anti-inflammatory for the body."

I had a look at the nice german model IGEL and the english model MPT-7B-Instruct and wondered how many instructions it would take to get a model which understands some simple instructions. Therefore I had a look at NLP Course of Hugging Face and started using 🤗 Transformers.

Example

To be sure that the model learned an instruction it didn't know before I trained eight incantations in the model malteos/bloom-1b5-clp-german:

Sage mir einen Zauberspruch zu drei Städten.#Hokus Düsseldorf, Esoterikus Köln, Hömöopathikus Essen, Alohomora!
Sage mir einen Zauberspruch zu drei Flüssen.#Hokus Rhein, Esoterikus Main, Hömöopathikus Donau, Alohomora!
[...]
Sage mir einen Zauberspruch zu drei Vornamen.#Hokus Anton, Esoterikus Berta, Hömöopathikus Claudia, Alohomora!

After training the instruction "Anweisung: Sag mir einen Zauberspruch zu drei Gewürzen." got the following response:

Hokus Ingwer, Esoterikus Kurkuma, Hömöopathikus Zimt, Alohomora!

Another test was an instruction to write a sequence as python array:

Nenne eine Baumart und gib sie als Pythonarray aus.#['Apfelbaum']
[...]
Nenne drei Städte und gib sie als Pythonarray aus.#['München', 'Nürnberg', 'Stuttgart']
[...]
Nenne vier Sportarten und gib sie als Pythonarray aus.#['Laufen', 'Tennis', 'Joggen', 'Bowling']

The instruction "Anweisung: Nenne mir drei Holzarten und gib sie als Pythonarray aus." gets the following response:

['Apfelbaum', 'Birke' und 'Eiche'] ist ein Array von Bäumen. ['Buche'], 'Tanne''und 'Fichte' sind Arrays von Tannen, Buchen und Fichten. 'Lärche' ist kein Baum, sondern ein Holz. Python ist eine Programmiersprache, die Bäume, Holz und Bäume als Objekte kennt. Du kannst ein Python-Programm schreiben, das einen Baum als Objekt hat, aber Python kennt keine Bäume. Wenn du ein Programm schreiben willst, in dem ein Baum vorkommt, dann musst du es in Python schreiben. Es ist nicht möglich, Python als Sprache für Bäume zu verwenden, da Python nicht als Baum-Sprache definiert ist. Das ist der Grund, warum Python kein Objekt-Array ist, wie z.B. Java, C# oder Visual Basic for Applications (VBA). Python kann keine Objekte erzeugen, es sei denn, du erstellst ein Objekt, indem du eine Methode auf [...]

The first part "['Apfelbaum', 'Birke' und 'Eiche']" is the one I expected, the instruction was successful. The model generated more text. The addon "ist ein Array von Bäumen" (is an array of trees) is correct. After that the content is getting strange ;-).

Installation and Execution

An installation of python, 🤗 Transformers and a library like PyTorch are needed. There are installation instructions by Hugging Face.

The python script instruct_ger.py can be used to instruct a given model. The following settings use the text file instruct_ger_bsp_1.txt containing the instructions and load the model malteos/bloom-1b5-clp-german by Malte Ostendorff. The resulting model is stored in the local folder instruct_ger/instruct_ger_bsp_1_1b5_ep4.

input_file = "instruct_ger_bsp_1.txt"
model_name = "malteos/bloom-1b5-clp-german"
output_dir = "instruct_ger"
output_model = "instruct_ger_bsp_1_1b5_ep4"

A fine-training can last some time. In this example the four epochs need about seven minutes on a i7-gen11-device with enough memory (about 24 GB when using the 1b5 model).

Text Generation

The script generate.py is used to test and run the model. It loads the tokenizer and the generated model:

model_name = "instruct_ger/instruct_ger_bsp_1_1b5_ep4"
print(f"Load model: {model_name}")

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
model.half().cuda()

The class StopOnTokens looks for tokens which should break the generation. The ids in generate.py are samples only.

The script uses a loop reading the lines of standard input:

print(f"Prompt:")
for myText in sys.stdin:
    prompt = f"Anweisung: {myText}"
    print(f"Command: {prompt}")

Each line gets feeded into the tokenizer which computes the tokens.

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

The tokens are continued by the model:

tokens = model.generate(
    **inputs,
    max_new_tokens=200,
    temperature=0.05,
    num_beams=5,
    do_sample=False,
    no_repeat_ngram_size=2,
    stopping_criteria=StoppingCriteriaList([StopOnTokens()]),
)
print("Response:")
print(tokenizer.decode(tokens[0], skip_special_tokens=True))

Have a look at different decoding methods at Hugging Face.

About

Small tutorial: how to instruct a language model using a text file containing instructions

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages