Multilingual E5: A Machine Learning Model for Embedding Text in Multiple Languages

David Cochard
axinc-ai
Published in
3 min readJan 31, 2024

--

This is an introduction to Multilingual E5, a machine learning model for embedding text in multiple languages, which allows for the accurate calculation of text similarity across different languages.

Overview

Multilingual E5, released in December 2022, is a model designed for embedding text. Traditionally, for embedding multiple languages in on-premises environments, the SentenceTransformers model paraphrase-multilingual-mpnet-base-v2, released in 2019, was used. However, Multilingual E5 represents a more recent and higher-precision model compared to its predecessor.

Training dataset

Text embedding resolves the issue of keyword mismatches and enables efficient information retrieval. However, previous models were trained on limited labeled data or low-quality machine-translated data, which did not yield sufficient accuracy.

For example the dataset used for the previous SentenceTransformers model, paraphrase-multilingual-mpnet-base-v2, is the STSb(Semantic Textual Similarity Benchmark) extended to 15 languages with Google Translator. For maintaining data quality the sentence pairs with a confidence value below 0.7 were dropped.

Multilingual E5 has been trained on a much larger and more diverse set of data, composed of cleaned text pairs from the internet, known as the CCPairs dataset. This dataset is constructed by combining data sources such as community QA, CommonCrawl, and scientific papers, followed by a filtering process. This approach in E5 ensures more accurate and reliable embeddings, as it leverages a diverse and comprehensive dataset.

More details on the dataset composition can also be found below.

https://huggingface.co/intfloat/multilingual-e5-base

Compatibility between SentenceTransformers and Multilingual E5

The number of dimensions for embeddings in Multilingual E5 is 768 for the base model and 1024 for the large model. XLMRoberta is used as the tokenizer, and since it employs the exact same SentencePiece model files as SentenceTransformers, the tokenizer can be used as is. This means you can easily replace SentenceTransformers with Multilingual E5 in your applications by simply swapping out the models.

Limitations of Multilingual E5

The maximum token length that can be input into Multilingual E5 is 512. Exceeding this limit will result in an error during inference. It’s important to note that this is shorter than OpenAI’s text-embedding-ada-002, which has a maximum token length of 8191. Users should be aware of this limitation when working with longer texts to ensure they are within the allowable range for Multilingual E5.

Usage with ailia SDK

Multilingual E5 can be used with ailia SDK using the following command.

$ python3 multilingual-e5.py -i sample.txt

In this sample, each line of sample.txt is embedded, and the text that is most similar to the input query is displayed.

Usage with Unity

There is a sample available for using Multilingual E5 from Unity. By using the ailia SDK to calculate the relevance between a query and text on a PC, it’s possible to implement a serverless RAG (Retrieval-Augmented Generation) system, similar to langchain or llama-index.

ax Inc. has developed ailia SDK, which enables cross-platform, GPU-based rapid inference.

ax Inc. provides a wide range of services from consulting and model creation, to the development of AI-based applications and SDKs. Feel free to contact us for any inquiry.

--

--