Edit model card

This model has been xMADified!

This repository contains Llama-3.1-8B-Instruct quantized from 16-bit floats to 4-bit integers, using xMAD.ai proprietary technology.

Why should I use this model?

  1. Memory-efficiency: The full-precision model is around 16 GB, while this xMADified model is only 5.8 GB, making it feasible to run on a 8 GB GPU.

  2. Accuracy: This xMADified model preserves the quality of the full-precision model. In the table below, we present the zero-shot accuracy on popular benchmarks of this xMADified model against the neuralmagic-quantized model (the same model size for a fair comparison). The xMADai model offers higher accuracy across all benchmarks.

Model MMLU Arc Challenge Arc Easy LAMBADA Standard LAMBADA OpenAI PIQA WinoGrande HellaSwag
neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w4a16 64.82 47.78 78.66 62.95 70.41 78.67 72.61 58.04
xmadai/Llama-3.1-8B-Instruct-xMADai-INT4 66.83 52.3 82.11 65.73 73.3 79.87 72.77 58.49

How to Run Model

Loading the model checkpoint of this xMADified model requires less than 6 GiB of VRAM. Hence it can be efficiently run on a 8 GB GPU.

Package prerequisites: Run the following commands to install the required packages.

pip install -q --upgrade transformers accelerate optimum
pip install -q --no-build-isolation auto-gptq

Sample Inference Code

from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM

model_id = "xmadai/Llama-3.1-8B-Instruct-xMADai-INT4"
prompt = [
    {"role": "system", "content": "You are a helpful assistant, that responds as a pirate."},
    {"role": "user", "content": "What's Deep Learning?"},
]

tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=False)

inputs = tokenizer.apply_chat_template(
    prompt,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to("cuda")

model = AutoGPTQForCausalLM.from_quantized(
    model_id,
    device_map='auto',
    trust_remote_code=True,
)

outputs = model.generate(**inputs, do_sample=True, max_new_tokens=1024)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

Here's a sample output of the model, using the code above:

["system\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\nYou are a helpful assistant, that responds as a pirate.user\n\nWhat's Deep Learning?assistant\n\nDeep Learning be a fascinatin' field, matey! It's a form o' artificial intelligence that's based on deep neural networks, which be a type o' machine learning algorithm.\n\nYer see, traditional machine learnin' algorithms be based on shallow nets, meaning they've just one or two layers. But deep learnin' takes it to a whole new level, with multiple layers stacked on top o' each other like a chest overflowin' with booty!\n\nEach o' these layers be responsible fer processin' a different aspect o' the data, from basic features to more abstract representations. It's like navigatin' through a treasure map, with each layer helpin' ye uncover the hidden patterns and patterns hidden within the data.\n\nDeep learnin' be often used in image and speech recognition, natural language processing, and even robotics. But it be a complex and challengin' field, matey, and it requires a strong grasp o' mathematics and computer science.\n\nSo hoist the sails and set course fer the world o' deep learnin', me hearty!"]

Contact Us

For additional xMADified models, access to fine-tuning, and general questions, please contact us at [email protected] and join our waiting list.

Downloads last month
14
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for xmadai/Llama-3.1-8B-Instruct-xMADai-INT4

Quantized
(217)
this model