Edit model card

KaLM-Embedding

KaLM-Embedding is a series of embedding models adapted from auto-regressive LLMs with superior training data

KaLM-embedding-multilingual-mini-v1 is trained from Qwen/Qwen2-0.5B with massive pre-training and fine-tuning data.

📑 Open-source Plan

  • Model Checkpoint
    • KaLM-embedding-multilingual-mini-v1
    • KaLM-embedding-multilingual-max-v1
  • Technical Report
  • Training and Evaluation Code
  • Training Data

Usage

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

Then you can use the model like this:

from sentence_transformers import SentenceTransformer


sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('{MODEL_NAME_OR_PATH}')   # Do NOT set trust_remote_code
model.max_seq_length = 512

embeddings = model.encode(
    sentences, 
    normalize_embeddings=True,
    batch_size=256, 
    show_progress_bar=True
    )
print(embeddings)

We add instruction for classification and clustering. If you want to add instruction to the query (no instruction for the corpus), you can use the model like this:

from sentence_transformers import SentenceTransformer


sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('{MODEL_NAME_OR_PATH}')   # Do NOT set trust_remote_code
model.max_seq_length = 512

prompt = "Instruct: Classifying the category of french news. \n Query: "
embeddings = model.encode(
    sentences, 
    prompt=prompt,
    normalize_embeddings=True,
    batch_size=256, 
    show_progress_bar=True
    )
print(embeddings)
Downloads last month
580
Safetensors
Model size
494M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Evaluation results