robertgshaw2 commited on
Commit
e572994
1 Parent(s): 05f7150

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +49 -0
README.md CHANGED
@@ -1,3 +1,52 @@
1
  ---
2
  license: llama2
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: llama2
3
+ language:
4
+ - en
5
+ library_name: transformers
6
  ---
7
+ ## llama-2-7b-chat-marlin
8
+
9
+ Example of converting a GPTQ model to Marlin format for fast batched decoding with [Marlin Kernels](https://github.com/IST-DASLab/marlin)
10
+
11
+ ### Install Marlin
12
+ ```bash
13
+ pip install torch
14
+ git clone https://github.com/IST-DASLab/marlin.git
15
+ cd marlin
16
+ pip install -e .
17
+ ```
18
+
19
+ ### Convert Model
20
+
21
+ Convert the model from GPTQ to Marlin format. Note that this requires:
22
+ - `sym=true`
23
+ - `group_size=128`
24
+ - `desc_activations=false`
25
+
26
+ ```bash
27
+ pip install -U transformers accelerate auto-gptq optimum
28
+ ```
29
+
30
+ Convert with the `convert.py` script in this repo:
31
+
32
+ ```bash
33
+ python3 convert.py --model-id "TheBloke/Llama-2-7B-Chat-GPTQ" --save-path "./marlin-model" --do-generation
34
+ ```
35
+
36
+ ### Run Model
37
+
38
+ Load with the `load.load_model` utility from this repo and run inference as usual.
39
+
40
+ ```python
41
+ from load import load_model
42
+ from transformers import AutoTokenizer
43
+ # Load model from disk.
44
+ model_path = "./marlin-model"
45
+ model = load_model(model_path).to("cuda")
46
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
47
+ # Generate text.
48
+ inputs = tokenizer("My favorite song is", return_tensors="pt")
49
+ inputs = {k: v.to("cuda") for k, v in inputs.items()}
50
+ outputs = model.generate(**inputs, max_new_tokens=50, do_sample=False)
51
+ print(tokenizer.batch_decode(outputs)[0])
52
+ ```