SajjadAyoubi
commited on
Commit
•
c18ee2b
1
Parent(s):
94c13fa
Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,98 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# CLIPfa: Connecting Farsi Text and Images
|
2 |
+
OpenAI released [`the paper Learning Transferable Visual Models From Natural Language Supervision`](https://arxiv.org/abs/2103.00020) in which they present the CLIP (Contrastive Language–Image Pre-training) model. This model is trained to connect text and images, by matching their corresponding vector representations using a contrastive learning objective. CLIP consists of two separate models, a vision encoder and a text encoder. These were trained on 400 Million images and corresponding captions. We have trained a Farsi (Persian) version of OpenAI's CLIP on a dataset of 400,000 (image, text) pairs. We used [`Farahani's RoBERTa-fa`](https://huggingface.co/m3hrdadfi/roberta-zwnj-wnli-mean-tokens) as the text encoder and [`ViT`](https://huggingface.co/openai/clip-vit-base-patch32) as the vision encoder from Original CLIP and finetuned them.
|
3 |
+
|
4 |
+
![CLIPfa image](https://github.com/sajjjadayobi/CLIPfa/blob/main/assets/clipfa.png)
|
5 |
+
|
6 |
+
It should be noted that only 400K pairs were used for this training, whereas 4 million pairs were used for the Original CLIP. Also, the training took 30 days across 592 GPUs powered by the V100 chip.
|
7 |
+
|
8 |
+
|
9 |
+
## How to use?
|
10 |
+
Both models generate vectors with 768 dimensions.
|
11 |
+
```python
|
12 |
+
from transformers import CLIPVisionModel, RobertaModel, AutoTokenizer, CLIPFeatureExtractor
|
13 |
+
# download pre-trained models
|
14 |
+
vision_encoder = CLIPVisionModel.from_pretrained('SajjadAyoubi/clip-fa-vision')
|
15 |
+
preprocessor = CLIPFeatureExtractor.from_pretrained('SajjadAyoubi/clip-fa-vision')
|
16 |
+
text_encoder = RobertaModel.from_pretrained('SajjadAyoubi/clip-fa-text')
|
17 |
+
tokenizer = AutoTokenizer.from_pretrained('SajjadAyoubi/clip-fa-text')
|
18 |
+
# define input image and input text
|
19 |
+
text = 'something'
|
20 |
+
image = PIL.Image.open('my_favorite_image.jpg')
|
21 |
+
# compute embeddings
|
22 |
+
text_embedding = text_encoder(**tokenizer(text, return_tensors='pt')).pooler_output
|
23 |
+
image_embedding = vision_encoder(**preprocessor(image, return_tensors='pt')).pooler_output
|
24 |
+
text_embedding.shape == image_embedding.shape
|
25 |
+
```
|
26 |
+
|
27 |
+
## Demo:
|
28 |
+
The followings are just some use cases of CLIPfa on 25K [`Unsplash images`](https://github.com/unsplash/datasets)
|
29 |
+
- use `pip install -q git+https://github.com/sajjjadayobi/clipfa.git`
|
30 |
+
```python
|
31 |
+
from clipfa import CLIPDemo
|
32 |
+
demo = CLIPDemo(vision_encoder, text_encoder, tokenizer)
|
33 |
+
demo.compute_text_embeddings(['سیب','موز' ,'آلبالو'])
|
34 |
+
demo.compute_image_embeddings(test_df.image_path.to_list())
|
35 |
+
```
|
36 |
+
### Image Search:
|
37 |
+
```python
|
38 |
+
demo.image_search(query='غروب خورشید')
|
39 |
+
```
|
40 |
+
![](https://github.com/sajjjadayobi/CLIPfa/blob/main/assets/image_search.png)
|
41 |
+
|
42 |
+
```python
|
43 |
+
demo.image_search(query='جنگل در زمستان برفی')
|
44 |
+
```
|
45 |
+
![](https://github.com/sajjjadayobi/CLIPfa/blob/main/assets/forest%20in%20winter.png)
|
46 |
+
|
47 |
+
### Analogy:
|
48 |
+
```python
|
49 |
+
demo.anology('sunset.jpg', additional_text='دریا')
|
50 |
+
```
|
51 |
+
![](https://github.com/sajjjadayobi/CLIPfa/blob/main/assets/analogy-sea.png)
|
52 |
+
|
53 |
+
```python
|
54 |
+
demo.anology('sunset.jpg', additional_text='برف')
|
55 |
+
```
|
56 |
+
![](https://github.com/sajjjadayobi/CLIPfa/blob/main/assets/analogy-snow.png)
|
57 |
+
|
58 |
+
### Zero Shot Image Classification:
|
59 |
+
```python
|
60 |
+
demo.zero_shot(image_path='apples.jpg')
|
61 |
+
```
|
62 |
+
- Provided labels with their probability for each image.
|
63 |
+
|
64 |
+
| گاو:36 , ماهی:22, اسب:42 | گاو:41 , ماهی:23, اسب:36 | گاو:26 , ماهی:**45**, اسب:27 |
|
65 |
+
| :----------------------------------------------------------: | :----------------------------------------------------------: | :----------------------------------------------------------: |
|
66 |
+
| ![image](https://github.com/sajjjadayobi/CLIPfa/blob/main/assets/horse.jpg) | ![image](https://github.com/sajjjadayobi/CLIPfa/blob/main/assets/cow.jpg) | ![image](https://github.com/sajjjadayobi/CLIPfa/blob/main/assets/fish.jpg) |
|
67 |
+
|
68 |
+
## Online Demo: [CLIPfa at Huggingface🤗 spaces](https://huggingface.co/spaces/SajjadAyoubi/CLIPfa-Demo)
|
69 |
+
We used a small set of images (25K) to keep this app almost real-time, but it's obvious that the quality of image search depends heavily on the size of the image database.
|
70 |
+
|
71 |
+
![](https://github.com/sajjjadayobi/CLIPfa/blob/main/assets/hf-spaces.png)
|
72 |
+
|
73 |
+
|
74 |
+
## Dataset: 400K
|
75 |
+
We started with this question that how much the original Clip model depends on its big training dataset containing a lot of conceptual samples. Our model shows that It is possible to meet an acceptable enough target with only a little amount of data even though, It may not have known enough concepts and subjects to be used widely. Our model trained on a dataset gathered from different resources such as The Flickr30k, MS-COCO 2017, Google CCm3, ... . We used these datasets and translated them into the Persian language with a [`tool`](https://github.com/sajjjadayobi/CLIPfa/blob/main/clipfa/data/translation.py) prepared by ourselves. Using the Google Translate and Multilingual Similarity Check method we provided an automatic translator that has been given a list of English captions and filtered by the best translations.
|
76 |
+
|
77 |
+
- Note: We used [`image2ds`](https://github.com/rom1504/img2dataset) a great tool to download large scale image datasets such as MS-COCO. It can download, resize and package 100M urls in 20h on one machine. Also supports saving captions for url+caption datasets.
|
78 |
+
- [`coco-flickr-fa 130K on Kaggle`](https://www.kaggle.com/navidkanaani/coco-flickr-farsi)
|
79 |
+
|
80 |
+
|
81 |
+
## Training: <a href="https://colab.research.google.com/github/sajjjadayobi/CLIPfa/blob/main/notebook/CLIPfa_Training.ipynb"><img src="https://img.shields.io/static/v1?label=%F0%9F%A4%97%20Hugging%20Face&message=CLIPfa Training&color=white"></a>
|
82 |
+
Any dataset can be used with little change by the [`training code`](https://github.com/sajjjadayobi/CLIPfa/tree/main/clipfa). CLIPfa can be trained with other encoders as long as they have the same hidden size at the last layer. In [`this`](https://github.com/sajjjadayobi/CLIPfa/blob/main/notebook/CLIPfa_Training.ipynb) notebook I used [`training code`](https://github.com/sajjjadayobi/CLIPfa/tree/main/clipfa) to train a small CLIP on translated [`flickr30K`](https://www.kaggle.com/sajjadayobi360/flickrfa) dataset.
|
83 |
+
|
84 |
+
|
85 |
+
## Citation: ↩️
|
86 |
+
If you have a technical question regarding the model, code or publication, create an issue in the repository.
|
87 |
+
we didn't publish any papers on the work. However, if you did, please cite us properly with an entry like one below.
|
88 |
+
```bibtex
|
89 |
+
@misc{ParsBigBird,
|
90 |
+
author = {Sajjad Ayoubi, Navid Kanaani},
|
91 |
+
title = {CLIPfa: Connecting Farsi Text and Images},
|
92 |
+
year = 2021,
|
93 |
+
publisher = {GitHub},
|
94 |
+
journal = {GitHub repository},
|
95 |
+
howpublished = {\url{https://github.com/SajjjadAyobi/CLIPfa}},
|
96 |
+
}
|
97 |
+
```
|
98 |
+
> Made with ❤️ in my basement🤫
|