bhenrym14 commited on
Commit
87c7d87
1 Parent(s): ed981e8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -0
README.md CHANGED
@@ -8,6 +8,10 @@ This is [Jon Durbin's Airoboros 33B GPT4 1.4](https://huggingface.co/jondurbin/a
8
 
9
  Otherwise, I emulated the training process as closely as possible (rank 64 QLoRA) It was trained on 1x RTX 6000 Ada for ~43 hours.
10
 
 
 
 
 
11
  ## Motivation
12
  Recent advancements in extending context by RoPE scaling ([kaiokendev](https://kaiokendev.github.io/til#extending-context-to-8k) and [meta AI)](https://arxiv.org/abs/2306.15595)) demonstrate the ability to extend the context window without (total) retraining. Finetuning has shown to be necessary to properly leverage the longer context. The superHOT LoRA is an adapter that has been finetuned on longer context (8192 tokens); even when applied to models trained on dissimilar datasets, it successfully extends the contexts window to which the model can attend. While it's impressive this adapter is so flexible, how much does performance suffer relative to a model that has been finetuned with the scaled embeddings from the start? This is an experiment to explore this.
13
 
 
8
 
9
  Otherwise, I emulated the training process as closely as possible (rank 64 QLoRA) It was trained on 1x RTX 6000 Ada for ~43 hours.
10
 
11
+ ## How to Use
12
+ The easiest way is to use [oobabooga text-generation-webui](https://github.com/oobabooga/text-generation-webui) with ExLlama. You'll need to setset max_seq_len to 8192 and compress_pos_emb to 4.
13
+
14
+
15
  ## Motivation
16
  Recent advancements in extending context by RoPE scaling ([kaiokendev](https://kaiokendev.github.io/til#extending-context-to-8k) and [meta AI)](https://arxiv.org/abs/2306.15595)) demonstrate the ability to extend the context window without (total) retraining. Finetuning has shown to be necessary to properly leverage the longer context. The superHOT LoRA is an adapter that has been finetuned on longer context (8192 tokens); even when applied to models trained on dissimilar datasets, it successfully extends the contexts window to which the model can attend. While it's impressive this adapter is so flexible, how much does performance suffer relative to a model that has been finetuned with the scaled embeddings from the start? This is an experiment to explore this.
17