Cannot use Alibaba-NLP/gte-multilingual-base in Elasticsearch 8.12

#13
by Rene-A - opened

I want to add this model to my Elasticsearch 8.12 instance on IBM Cloud by using the eland docker image with eland_import_hub_model command. That worked already for different sentence transformer models such as e5-large, e5-base or sentence-transformers/distiluse-base-multilingual-cased-v1. However with Alibaba-NLP/gte-multilingual-bas I get the following error which in the end means that there is no vocabulary available:

  warnings.warn(
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2017, in _from_pretrained
 tokenizer = cls(*init_inputs, **init_kwargs)
File "/usr/local/lib/python3.10/site-packages/transformers/models/xlm_roberta/tokenization_xlm_roberta.py", line 168, in __init__
self.sp_model.Load(str(vocab_file))
File "/usr/local/lib/python3.10/site-packages/sentencepiece/__init__.py", line 961, in Load
  return self.LoadFromFile(model_file)
File "/usr/local/lib/python3.10/site-packages/sentencepiece/__init__.py", line 316, in LoadFromFile
   return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
OSError: Not found: "None": No such file or directory Error #2

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
 File "/usr/local/bin/eland_import_hub_model", line 8, in <module>
   sys.exit(main())
File "/usr/local/lib/python3.10/site-packages/eland/cli/eland_import_hub_model.py", line 254, in main
   tm = TransformerModel(
 File "/usr/local/lib/python3.10/site-packages/eland/ml/pytorch/transformers.py", line 643, in __init__
   self._tokenizer = transformers.AutoTokenizer.from_pretrained(
File "/usr/local/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 736, in from_pretrained
 return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1854, in from_pretrained
  return cls._from_pretrained(
File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2019, in _from_pretrained
  raise OSError(
OSError: Unable to load vocabulary from file. Please check that the provided vocabulary is accessible and not corrupted.

What could I do about this to get the model imported as embedding model into Elasticsearch.

Sign up or log in to comment