KaLM-Embedding
KaLM-Embedding is a series of embedding models adapted from auto-regressive LLMs with superior training data
KaLM-embedding-multilingual-mini-v1 is trained from Qwen/Qwen2-0.5B with massive pre-training and fine-tuning data.
📑 Open-source Plan
- Model Checkpoint
- KaLM-embedding-multilingual-mini-v1
- KaLM-embedding-multilingual-max-v1
- Technical Report
- Training and Evaluation Code
- Training Data
Usage
Using this model becomes easy when you have sentence-transformers installed:
pip install -U sentence-transformers
Then you can use the model like this:
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('{MODEL_NAME_OR_PATH}') # Do NOT set trust_remote_code
model.max_seq_length = 512
embeddings = model.encode(
sentences,
normalize_embeddings=True,
batch_size=256,
show_progress_bar=True
)
print(embeddings)
We add instruction for classification and clustering. If you want to add instruction to the query (no instruction for the corpus), you can use the model like this:
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('{MODEL_NAME_OR_PATH}') # Do NOT set trust_remote_code
model.max_seq_length = 512
prompt = "Instruct: Classifying the category of french news. \n Query: "
embeddings = model.encode(
sentences,
prompt=prompt,
normalize_embeddings=True,
batch_size=256,
show_progress_bar=True
)
print(embeddings)
- Downloads last month
- 580
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.
Evaluation results
- accuracy on MTEB AmazonCounterfactualClassification (en-ext)test set self-reported74.160
- ap on MTEB AmazonCounterfactualClassification (en-ext)test set self-reported22.731
- ap_weighted on MTEB AmazonCounterfactualClassification (en-ext)test set self-reported22.731
- f1 on MTEB AmazonCounterfactualClassification (en-ext)test set self-reported61.311
- f1_weighted on MTEB AmazonCounterfactualClassification (en-ext)test set self-reported78.921
- main_score on MTEB AmazonCounterfactualClassification (en-ext)test set self-reported74.160
- accuracy on MTEB AmazonCounterfactualClassification (en)test set self-reported72.358
- ap on MTEB AmazonCounterfactualClassification (en)test set self-reported34.130
- ap_weighted on MTEB AmazonCounterfactualClassification (en)test set self-reported34.130
- f1 on MTEB AmazonCounterfactualClassification (en)test set self-reported65.911