Efficient Inference on a Single GPU
ãã®ã¬ã€ãã«å ããŠã1ã€ã®GPUã§ã®ãã¬ãŒãã³ã°ã¬ã€ããšCPUã§ã®æšè«ã¬ã€ãã«é¢é£ããæ å ±ããããŸãã
Flash Attention 2
ãã®æ©èœã¯å®éšçã§ãããå°æ¥ã®ããŒãžã§ã³ã§å€§å¹
ã«å€æŽãããå¯èœæ§ããããŸããããšãã°ãFlash Attention 2 APIã¯è¿ãå°æ¥BetterTransformer
APIã«ç§»è¡ãããããããŸããã
Flash Attention 2ã¯ããã©ã³ã¹ãã©ãŒããŒããŒã¹ã®ã¢ãã«ã®ãã¬ãŒãã³ã°ãšæšè«é床ãå€§å¹ ã«é«éåã§ããŸããFlash Attention 2ã¯ãTri Daoæ°ã«ãã£ãŠå ¬åŒã®Flash Attentionãªããžããªã§å°å ¥ãããŸãããFlash Attentionã«é¢ããç§åŠè«æã¯ãã¡ãã§èŠãããšãã§ããŸãã
Flash Attention 2ãæ£ããã€ã³ã¹ããŒã«ããã«ã¯ãäžèšã®ãªããžããªã«èšèŒãããŠããã€ã³ã¹ããŒã«ã¬ã€ãã«åŸã£ãŠãã ããã
以äžã®ã¢ãã«ã«å¯ŸããŠFlash Attention 2ããã€ãã£ããµããŒãããŠããŸãïŒ
- Llama
- Falcon
ããã«å€ãã®ã¢ãã«ã«Flash Attention 2ã®ãµããŒããè¿œå ããããšãGitHubã§ææ¡ããããšãã§ããå€æŽãçµ±åããããã«ãã«ãªã¯ãšã¹ããéãããšãã§ããŸãããµããŒããããŠããã¢ãã«ã¯ãããã£ã³ã°ããŒã¯ã³ã䜿çšããŠãã¬ãŒãã³ã°ãå«ããæšè«ãšãã¬ãŒãã³ã°ã«äœ¿çšã§ããŸãïŒçŸåšã®BetterTransformer
APIã§ã¯ãµããŒããããŠããªãïŒã
Flash Attention 2ã¯ãã¢ãã«ã®dtypeãfp16
ãŸãã¯bf16
ã®å Žåã«ã®ã¿äœ¿çšã§ããNVIDIA-GPUããã€ã¹ã§ã®ã¿å®è¡ãããŸãããã®æ©èœã䜿çšããåã«ãã¢ãã«ãé©åãªdtypeã«ãã£ã¹ããããµããŒããããŠããããã€ã¹ã«ããŒãããŠãã ããã
Quick usage
ã¢ãã«ã§Flash Attention 2ãæå¹ã«ããã«ã¯ãfrom_pretrained
ã®åŒæ°ã«attn_implementation="flash_attention_2"
ãè¿œå ããŸãã
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM
model_id = "tiiuae/falcon-7b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
ãã¡ãã¯ãçæãŸãã¯åŸ®èª¿æŽã®ããã«äœ¿çšããããã¹ãã§ãã
Expected speedups
ç¹ã«é·ãã·ãŒã±ã³ã¹ã«å¯ŸããŠã埮調æŽãšæšè«ã®éã«ã¯ãããªãã®é«éåãæåŸ ã§ããŸãããã ããFlash Attentionã¯ããã£ã³ã°ããŒã¯ã³ã䜿çšããŠã¢ãã³ã·ã§ã³ã¹ã³ã¢ãèšç®ããªããããã·ãŒã±ã³ã¹ã«ããã£ã³ã°ããŒã¯ã³ãå«ãŸããå Žåããããæšè«ã«ãããŠã¢ãã³ã·ã§ã³ã¹ã³ã¢ãæåã§ããã/ã¢ã³ãããããå¿ èŠããããããã£ã³ã°ããŒã¯ã³ãå«ããããçæã®å€§å¹ ãªé 延ãçºçããŸãã
ãããå æããããã«ããã¬ãŒãã³ã°äžã«ã·ãŒã±ã³ã¹ã«ããã£ã³ã°ããŒã¯ã³ã䜿çšããã«Flash Attentionã䜿çšããå¿ èŠããããŸãïŒããšãã°ãããŒã¿ã»ãããããã¯ããããšã«ãããã·ãŒã±ã³ã¹ãæ倧ã·ãŒã±ã³ã¹é·ã«éãããŸã§é£çµããããšãªã©ïŒãããã«äŸãæäŸãããŠããŸãã
以äžã¯ãããã£ã³ã°ããŒã¯ã³ã®ãªãå Žåã«ãã·ãŒã±ã³ã¹é·ã4096ã®tiiuae/falcon-7bã«å¯ŸããåçŽãªãã©ã¯ãŒããã¹ã®äºæ³ãããé«éåã§ããããŸããŸãªããããµã€ãºã瀺ãããŠããŸãïŒ
以äžã¯ãããã£ã³ã°ããŒã¯ã³ã®ãªãå Žåã«ãã·ãŒã±ã³ã¹é·ã4096ã®meta-llama/Llama-7b-hf
ã«å¯ŸããåçŽãªãã©ã¯ãŒããã¹ã®äºæ³ãããé«éåã§ããããŸããŸãªããããµã€ãºã瀺ãããŠããŸãïŒ
ããã£ã³ã°ããŒã¯ã³ãå«ãã·ãŒã±ã³ã¹ïŒããã£ã³ã°ããŒã¯ã³ã䜿çšããŠãã¬ãŒãã³ã°ãŸãã¯çæããïŒã®å Žåãã¢ãã³ã·ã§ã³ã¹ã³ã¢ãæ£ããèšç®ããããã«å ¥åã·ãŒã±ã³ã¹ãã¢ã³ããã/ãããããå¿ èŠããããŸããæ¯èŒçå°ããã·ãŒã±ã³ã¹é·ã®å ŽåãçŽç²ãªãã©ã¯ãŒããã¹ã§ã¯ããã£ã³ã°ããŒã¯ã³ã30%æªæºããåããããŠããªããããããã¯ããããªé«éåããããããŸãã
ãããã倧ããªã·ãŒã±ã³ã¹é·ã®å ŽåãçŽç²ãªæšè«ïŒãã¬ãŒãã³ã°ãå«ãïŒã«ã¯èå³æ·±ãé«éåãåŸãããŸãã
Flash Attentionã¯ãã¢ãã³ã·ã§ã³èšç®ãããã¡ã¢ãªå¹çã®è¯ããã®ã«ãã倧ããªã·ãŒã±ã³ã¹é·ã§ã®CUDA OOMã®åé¡ãåé¿ã§ããããã«ããŸãã倧ããªã·ãŒã±ã³ã¹é·ã«å¯ŸããŠæ倧20ã®ã¡ã¢ãªåæžãããããããšããããŸãã詳现ã«ã€ããŠã¯ãå ¬åŒã®Flash Attentionãªããžããªãã芧ãã ããã
Advanced usage
ãã®æ©èœãã¢ãã«ã®æé©åã«å€ãã®æ¢åã®æ©èœãšçµã¿åãããããšãã§ããŸãã以äžã«ããã€ãã®äŸã瀺ããŸãïŒ
Combining Flash Attention 2 and 8-bit models
ãã®æ©èœã8ãããã®éååãšçµã¿åãããããšãã§ããŸãïŒ
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM
model_id = "tiiuae/falcon-7b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
load_in_8bit=True,
attn_implementation="flash_attention_2",
)
Combining Flash Attention 2 and 4-bit models
ãã®æ©èœã 4 ãããã®éååãšçµã¿åãããããšãã§ããŸãïŒ
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM
model_id = "tiiuae/falcon-7b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
load_in_4bit=True,
attn_implementation="flash_attention_2",
)
Combining Flash Attention 2 and PEFT
ãã®æ©èœã䜿çšããŠãFlash Attention 2ãããŒã¹ã«ã¢ããã¿ãŒããã¬ãŒãã³ã°ããéã«PEFTãçµã¿åãããããšãã§ããŸãã
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM
from peft import LoraConfig
model_id = "tiiuae/falcon-7b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
load_in_4bit=True,
attn_implementation="flash_attention_2",
)
lora_config = LoraConfig(
r=8,
task_type="CAUSAL_LM"
)
model.add_adapter(lora_config)
... # train your model
BetterTransformer
BetterTransformerã¯ãð€ Transformersã¢ãã«ãPyTorchãã€ãã£ãã®é«éãã¹å®è¡ã«å€æããŸããããã«ãããFlash Attentionãªã©ã®æé©åãããã«ãŒãã«ãå éšã§åŒã³åºãããŸãã
BetterTransformerã¯ãããã¹ããç»åãããã³ãªãŒãã£ãªã¢ãã«ã®åäžããã³ãã«ãGPUã§ã®é«éãªæšè«ããµããŒãããŠããŸãã
Flash Attentionã¯ãfp16ãŸãã¯bf16ã®dtypeã䜿çšããã¢ãã«ã«ã®ã¿äœ¿çšã§ããŸããBetterTransformerã䜿çšããåã«ãã¢ãã«ãé©åãªdtypeã«ãã£ã¹ãããŠãã ããã
Encoder models
PyTorchãã€ãã£ãã®nn.MultiHeadAttention
ã¢ãã³ã·ã§ã³é«éãã¹ãBetterTransformerãšåŒã°ãããã®ã¯ãð€ Optimumã©ã€ãã©ãªã®çµ±åãéããŠTransformersãšäžç·ã«äœ¿çšã§ããŸãã
PyTorchã®ã¢ãã³ã·ã§ã³é«éãã¹ã䜿çšãããšãã«ãŒãã«ãã¥ãŒãžã§ã³ãšãã¹ãããããã³ãœã«ã®äœ¿çšã«ãããæšè«ãé«éåã§ããŸãã詳现ãªãã³ãããŒã¯æ å ±ã¯ãã®ããã°èšäºã«ãããŸãã
optimum
ããã±ãŒãžãã€ã³ã¹ããŒã«ããåŸãæšè«äžã«Better Transformerã䜿çšããã«ã¯ãé¢é£ããå
éšã¢ãžã¥ãŒã«ãåŒã³åºãããšã§çœ®ãæããå¿
èŠããããŸãto_bettertransformer():
model = model.to_bettertransformer()
ã¡ãœãã reverse_bettertransformer() ã¯ãã¢ãã«ãä¿åããåã«äœ¿çšãã¹ãã§ãæšæºã®ãã©ã³ã¹ãã©ãŒããŒã¢ããªã³ã°ã䜿çšããããã®ãã®ã§ãïŒ
model = model.reverse_bettertransformer()
model.save_pretrained("saved_model")
BetterTransformer APIã䜿ã£ããšã³ã³ãŒããŒã¢ãã«ã®å¯èœæ§ã«ã€ããŠè©³ããç¥ãã«ã¯ããã®ããã°ãã¹ããã芧ãã ããã
Decoder models
ããã¹ãã¢ãã«ãç¹ã«ãã³ãŒããŒããŒã¹ã®ã¢ãã«ïŒGPTãT5ãLlamaãªã©ïŒã«ãšã£ãŠãBetterTransformer APIã¯ãã¹ãŠã®æ³šææäœãtorch.nn.functional.scaled_dot_product_attention
ãªãã¬ãŒã¿ãŒïŒSDPAïŒã䜿çšããããã«å€æããŸãããã®ãªãã¬ãŒã¿ãŒã¯PyTorch 2.0以éã§ã®ã¿å©çšå¯èœã§ãã
ã¢ãã«ãBetterTransformerã«å€æããã«ã¯ã以äžã®æé ãå®è¡ããŠãã ããïŒ
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m")
# convert the model to BetterTransformer
model.to_bettertransformer()
# Use it for training or inference
SDPAã¯ãããŒããŠã§ã¢ãåé¡ã®ãµã€ãºã«å¿ããŠFlash Attentionã«ãŒãã«ã䜿çšããããšãã§ããŸããFlash Attentionãæå¹ã«ããããç¹å®ã®èšå®ïŒããŒããŠã§ã¢ãåé¡ãµã€ãºïŒã§äœ¿çšå¯èœãã©ããã確èªããã«ã¯ãtorch.backends.cuda.sdp_kernel
ãã³ã³ããã¹ããããŒãžã£ãšããŠäœ¿çšããŸãã
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")
model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", torch_dtype=torch.float16).to("cuda")
# convert the model to BetterTransformer
model.to_bettertransformer()
input_text = "Hello my dog is cute and"
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")
+ with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False):
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
ãããã¬ãŒã¹ããã¯ã«ãã°ã衚瀺ãããå Žå
RuntimeError: No available kernel. Aborting execution.
Flash Attention ã®åºç¯ãªã«ãã¬ããžãæã€ãããããªã PyTorch ã®ãã€ããªãŒããŒãžã§ã³ãè©ŠããŠã¿ãããšããå§ãããŸãã
pip3 install -U --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118
Or make sure your model is correctly casted in float16 or bfloat16
ã¢ãã«ãæ£ããfloat16ãŸãã¯bfloat16ã«ãã£ã¹ããããŠããããšã確èªããŠãã ããã
Have a look at this detailed blogpost to read more about what is possible to do with BetterTransformer
+ SDPA API.
BetterTransformer
+ SDPA APIã䜿çšããŠäœãå¯èœãã«ã€ããŠè©³ããèªãã«ã¯ããã®è©³çŽ°ãªããã°ãã¹ããã芧ãã ããã
bitsandbytes integration for FP4 mixed-precision inference
FP4æ··å粟床æšè«ã®ããã®bitsandbytes
çµ±å
You can install bitsandbytes
and benefit from easy model compression on GPUs. Using FP4 quantization you can expect to reduce up to 8x the model size compared to its native full precision version. Check out below how to get started.
bitsandbytes
ãã€ã³ã¹ããŒã«ããGPUã§ç°¡åãªã¢ãã«ã®å§çž®ãå©çšã§ããŸããFP4éååã䜿çšãããšããã€ãã£ãã®ãã«ãã¬ã·ãžã§ã³ããŒãžã§ã³ãšæ¯èŒããŠã¢ãã«ãµã€ãºãæ倧8ååæžã§ããããšãæåŸ
ã§ããŸãã以äžã確èªããŠãã©ã®ããã«å§ããããã芧ãã ããã
Note that this feature can also be used in a multi GPU setup.
ãã®æ©èœã¯ããã«ãGPUã»ããã¢ããã§ã䜿çšã§ããããšã«æ³šæããŠãã ããã
Requirements
Latest
bitsandbytes
librarypip install bitsandbytes>=0.39.0
Install latest
accelerate
from sourcepip install git+https://github.com/huggingface/accelerate.git
Install latest
transformers
from sourcepip install git+https://github.com/huggingface/transformers.git
Running FP4 models - single GPU setup - Quickstart
以äžã®ã³ãŒããå®è¡ããããšã§ãç°¡åã«åäžã®GPUã§FP4ã¢ãã«ãå®è¡ã§ããŸã:
from transformers import AutoModelForCausalLM
model_name = "bigscience/bloom-2b5"
model_4bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_4bit=True)
泚æ: device_map
ã¯ãªãã·ã§ã³ã§ãããæšè«æã« device_map = 'auto'
ãèšå®ããããšãæšå¥šãããŠããŸããããã«ãããå©çšå¯èœãªãªãœãŒã¹ã«å¹ççã«ã¢ãã«ããã£ã¹ããããããŸãã
Running FP4 models - multi GPU setup
æ··å4ãããã¢ãã«ãè€æ°ã®GPUã«ããŒãããæ¹æ³ã¯ãåäžGPUã»ããã¢ãããšåãã§ãïŒåäžGPUã»ããã¢ãããšåãã³ãã³ãã§ãïŒïŒ
model_name = "bigscience/bloom-2b5"
model_4bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_4bit=True)
ããããaccelerate
ã䜿çšããŠãåGPUã«å²ãåœãŠãGPU RAMãå¶åŸ¡ããããšãã§ããŸãã以äžã®ããã«ãmax_memory
åŒæ°ã䜿çšããŸãïŒ
max_memory_mapping = {0: "600MB", 1: "1GB"}
model_name = "bigscience/bloom-3b"
model_4bit = AutoModelForCausalLM.from_pretrained(
model_name, device_map="auto", load_in_4bit=True, max_memory=max_memory_mapping
)
ãã®äŸã§ã¯ãæåã®GPUã¯600MBã®ã¡ã¢ãªã䜿çšãã2çªç®ã®GPUã¯1GBã䜿çšããŸãã
Advanced usage
ãã®ã¡ãœããã®ãããªãé«åºŠãªäœ¿çšæ³ã«ã€ããŠã¯ãéååã®ããã¥ã¡ã³ããŒã·ã§ã³ããŒãžãã芧ãã ããã
bitsandbytes integration for Int8 mixed-precision matrix decomposition
ãã®æ©èœã¯ããã«ãGPUç°å¢ã§ã䜿çšã§ããŸãã
è«æLLM.int8()ïŒã¹ã±ãŒã©ãã«ãªTransformeråãã®8ãããè¡åä¹ç®
ã«ããã°ãHugging Faceçµ±åãHubå
ã®ãã¹ãŠã®ã¢ãã«ã§ãããæ°è¡ã®ã³ãŒãã§ãµããŒããããŠããŸãããã®ã¡ãœããã¯ãå粟床ïŒfloat16
ããã³bfloat16
ïŒã®éã¿ã®å Žåã«nn.Linear
ãµã€ãºã2åãå粟床ïŒfloat32
ïŒã®éã¿ã®å Žåã¯4åã«çž®å°ããå€ãå€ã«å¯ŸããŠã»ãšãã©åœ±é¿ãäžããŸããã
Int8æ··å粟床è¡åå解ã¯ãè¡åä¹ç®ã2ã€ã®ã¹ããªãŒã ã«åå²ããããšã«ãã£ãŠåäœããŸãïŒ(1) ã·ã¹ãããã£ãã¯ãªç¹åŸŽå€ãå€ã¹ããªãŒã ãfp16ã§è¡åä¹ç®ïŒ0.01%ïŒã(2) int8è¡åä¹ç®ã®éåžžã®ã¹ããªãŒã ïŒ99.9%ïŒããã®æ¹æ³ã䜿çšãããšãéåžžã«å€§ããªã¢ãã«ã«å¯ŸããŠäºæž¬ã®å£åãªãã«int8æšè«ãå¯èœã§ãã ãã®ã¡ãœããã®è©³çŽ°ã«ã€ããŠã¯ãè«æãŸãã¯ãã®çµ±åã«é¢ããããã°èšäºãã確èªãã ããã
ãªãããã®æ©èœã䜿çšããã«ã¯GPUãå¿ èŠã§ãããã«ãŒãã«ã¯GPUå°çšã«ã³ã³ãã€ã«ãããŠããå¿ èŠããããŸãããã®æ©èœã䜿çšããåã«ãã¢ãã«ã®1/4ïŒãŸãã¯ããŒã粟床ã®éã¿ã®å Žåã¯1/2ïŒãä¿åããã®ã«ååãªGPUã¡ã¢ãªãããããšã確èªããŠãã ããã ãã®ã¢ãžã¥ãŒã«ã䜿çšããéã®ãã«ãã«é¢ãã詳现ã¯ã以äžã®ããŒããã芧ããã ãããGoogle Colabã®ãã¢ãã芧ãã ããã
Requirements
bitsandbytes<0.37.0
ã䜿çšããå ŽåãNVIDIA GPUã䜿çšããŠããããšã確èªãã8ããããã³ãœã«ã³ã¢ããµããŒãããŠããããšã確èªããŠãã ããïŒTuringãAmpereããŸãã¯ãã以éã®ã¢ãŒããã¯ãã£ãŒãäŸïŒT4ãRTX20s RTX30sãA40-A100ãªã©ïŒãbitsandbytes>=0.37.0
ã®å Žåããã¹ãŠã®GPUããµããŒããããã¯ãã§ãã- æ£ããããŒãžã§ã³ã®
bitsandbytes
ãã€ã³ã¹ããŒã«ããã«ã¯ã次ã®ã³ãã³ããå®è¡ããŠãã ããïŒpip install bitsandbytes>=0.31.5
accelerate
ãã€ã³ã¹ããŒã«ããŸãïŒpip install accelerate>=0.12.0
Running mixed-Int8 models - single GPU setup
å¿ èŠãªã©ã€ãã©ãªãã€ã³ã¹ããŒã«ããåŸãããã¯ã¹ 8 ãããã¢ãã«ãèªã¿èŸŒãæ¹æ³ã¯æ¬¡ã®éãã§ãïŒ
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
model_name = "bigscience/bloom-2b5"
model_8bit = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=BitsAndBytesConfig(load_in_8bit=True))
以äžã¯ã·ã³ãã«ãªäŸã§ãïŒ
pipeline()
é¢æ°ã®ä»£ããã«ãã¢ãã«ã®generate()
ã¡ãœããã䜿çšããããšããå§ãããŸããpipeline()
é¢æ°ã䜿çšããŠæšè«ããããšã¯å¯èœã§ãããæ··å8ãããã¢ãã«ã«æé©åãããŠããããgenerate()
ã¡ãœããã䜿çšãããããé ããªããŸãããŸããäžéšã®ãµã³ããªã³ã°æŠç¥ïŒäŸïŒãã¯ã¬ãŠã¹ãµã³ããªã³ã°ïŒã¯ãpipeline()
é¢æ°ã§ã¯æ··å8ãããã¢ãã«ã§ã¯ãµããŒããããŠããŸããã- ãã¹ãŠã®å ¥åãã¢ãã«ãšåãããã€ã¹ã«é 眮ããŠãã ããã
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
model_name = "bigscience/bloom-2b5"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model_8bit = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=BitsAndBytesConfig(load_in_8bit=True))
prompt = "Hello, my llama is cute"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
generated_ids = model.generate(**inputs)
outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
Running mixed-int8 models - multi GPU setup
è€æ°ã®GPUã«æ··å8ãããã¢ãã«ãããŒãããæ¹æ³ã¯ã次ã®éãã§ãïŒã·ã³ã°ã«GPUã»ããã¢ãããšåãã³ãã³ãã§ãïŒïŒ
model_name = "bigscience/bloom-2b5"
model_8bit = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=BitsAndBytesConfig(load_in_8bit=True))
accelerate
ã䜿çšããŠåGPUã«å²ãåœãŠãGPU RAMãå¶åŸ¡ããéã«ã¯ã以äžã®ããã«max_memory
åŒæ°ã䜿çšããŸãïŒ
max_memory_mapping = {0: "1GB", 1: "2GB"}
model_name = "bigscience/bloom-3b"
model_8bit = AutoModelForCausalLM.from_pretrained(
model_name, device_map="auto", load_in_8bit=True, max_memory=max_memory_mapping
)
In this example, the first GPU will use 1GB of memory and the second 2GB.
Colab demos
ãã®æ¹æ³ã䜿çšãããšã以åã®Google Colabã§ã¯æšè«ã§ããªãã£ãã¢ãã«ã«å¯ŸããŠæšè«ãè¡ãããšãã§ããŸãã以äžã¯ãGoogle Colabã§8ãããéååã䜿çšããŠT5-11bïŒfp32ã§42GBïŒãå®è¡ãããã¢ã®ãªã³ã¯ã§ãïŒ
ãŸããBLOOM-3Bã®ãã¢ãã芧ããã ããŸãïŒ
Advanced usage: mixing FP4 (or Int8) and BetterTransformer
ç°ãªãæ¹æ³ãçµã¿åãããŠãã¢ãã«ã®æé©ãªããã©ãŒãã³ã¹ãåŸãããšãã§ããŸããäŸãã°ãBetterTransformerã䜿çšããŠFP4ããã¯ã¹ãã¬ã·ãžã§ã³æšè«ãšãã©ãã·ã¥ã¢ãã³ã·ã§ã³ãçµã¿åãããããšãã§ããŸãã
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16
)
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")
model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", quantization_config=quantization_config)
input_text = "Hello my dog is cute and"
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")
with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False):
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))