KaLM-Embedding

KaLM-Embedding is a series of embedding models adapted from auto-regressive LLMs with superior training data

KaLM-embedding-multilingual-mini-v1 is trained from Qwen/Qwen2-0.5B with massive pre-training and fine-tuning data.

📑 Open-source Plan

Model Checkpoint
- KaLM-embedding-multilingual-mini-v1
- KaLM-embedding-multilingual-max-v1
Technical Report
Training and Evaluation Code
Training Data

Usage

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

Then you can use the model like this:

from sentence_transformers import SentenceTransformer


sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('{MODEL_NAME_OR_PATH}')   # Do NOT set trust_remote_code
model.max_seq_length = 512

embeddings = model.encode(
    sentences, 
    normalize_embeddings=True,
    batch_size=256, 
    show_progress_bar=True
    )
print(embeddings)

We add instruction for classification and clustering. If you want to add instruction to the query (no instruction for the corpus), you can use the model like this:

from sentence_transformers import SentenceTransformer


sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('{MODEL_NAME_OR_PATH}')   # Do NOT set trust_remote_code
model.max_seq_length = 512

prompt = "Instruct: Classifying the category of french news. \n Query: "
embeddings = model.encode(
    sentences, 
    prompt=prompt,
    normalize_embeddings=True,
    batch_size=256, 
    show_progress_bar=True
    )
print(embeddings)