robertgshaw2
/

llama-2-7b-chat-marlin

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

robertgshaw2 commited on Jan 18

Commit

e572994

•

1 Parent(s): 05f7150

Update README.md

Files changed (1) hide show

README.md +49 -0

README.md CHANGED Viewed

@@ -1,3 +1,52 @@
 ---
 license: llama2
 ---

 ---
 license: llama2
+language:
+- en
+library_name: transformers
 ---
+## llama-2-7b-chat-marlin
+Example of converting a GPTQ model to Marlin format for fast batched decoding with [Marlin Kernels](https://github.com/IST-DASLab/marlin)
+### Install Marlin
+```bash
+pip install torch
+git clone https://github.com/IST-DASLab/marlin.git
+cd marlin
+pip install -e .
+```
+### Convert Model
+Convert the model from GPTQ to Marlin format. Note that this requires:
+- `sym=true`
+- `group_size=128`
+- `desc_activations=false`
+```bash
+pip install -U transformers accelerate auto-gptq optimum
+```
+Convert with the `convert.py` script in this repo:
+```bash
+python3 convert.py --model-id "TheBloke/Llama-2-7B-Chat-GPTQ" --save-path "./marlin-model" --do-generation
+```
+### Run Model
+Load with the `load.load_model` utility from this repo and run inference as usual.
+```python
+from load import load_model
+from transformers import AutoTokenizer
+# Load model from disk.
+model_path = "./marlin-model"
+model = load_model(model_path).to("cuda")
+tokenizer = AutoTokenizer.from_pretrained(model_path)
+# Generate text.
+inputs = tokenizer("My favorite song is", return_tensors="pt")
+inputs = {k: v.to("cuda") for k, v in inputs.items()}
+outputs = model.generate(**inputs, max_new_tokens=50, do_sample=False)
+print(tokenizer.batch_decode(outputs)[0])
+```