File size: 1,522 Bytes
d49014b
3a37b29
d49014b
ce40284
0858cb3
ce40284
3a37b29
ce40284
3a37b29
ce40284
3a37b29
ce40284
90e0948
 
 
 
ce40284
3a37b29
ce40284
3a37b29
 
ce40284
3a37b29
ce40284
3a37b29
 
ce40284
3a37b29
ce40284
3a37b29
 
 
 
ce40284
3a37b29
ce40284
3a37b29
ce40284
3a37b29
ce40284
3a37b29
ce40284
0858cb3
ce40284
3a37b29
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
---
license: mit
---

[![CODE](https://img.shields.io/badge/GitHub-Repository-<COLOR>)](https://github.com/mbzuai-oryx/LLaVA-pp)

# Phi-3-V: Extending the Visual Capabilities of LLaVA with Phi-3

## Repository Overview

This repository features LLaVA v1.5 trained with the Phi-3-mini-3.8B LLM. This integration aims to leverage the strengths of both models to offer advanced vision-language understanding.

## Training Strategy
- **Pretraining:** Only Vision-to-Language projector is trained. The rest of the model is frozen.
- **Fine-tuning:** LLM is LoRA fine-tuned. Only the vision-backbone (CLIP) is kept frozen.
- **Note:** The repository contains projector and LORA weights.

## Key Components

- **Base Large Language Model (LLM):** [Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct)
- **Base Large Multimodal Model (LMM):** [LLaVA-v1.5](https://github.com/haotian-liu/LLaVA)

## Training Data

- **Pretraining Dataset:** [LCS-558K](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain)
- **Fine-tuning Dataset:** [LLaVA-Instruct-665K](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/blob/main/llava_v1_5_mix665k.json)

## Download It As

```
git lfs install
git clone https://huggingface.co/MBZUAI/LLaVA-Phi-3-mini-4k-instruct-lora
```

---

## License

This project is available under the MIT License.

## Contributions

Contributions are welcome! Please 🌟 our repository [LLaVA++](https://github.com/mbzuai-oryx/LLaVA-pp) if you find this model useful.

---