Semantic Search With HuggingFace and Elasticsearch

Let’s rank passages in a dataset using the nearest neighbor search

Published in

Better Programming

8 min readNov 23, 2022

Dense embeddings are a game changer in machine learning, especially in search engines and recommender systems. Dense embeddings are currently being applied in ad-hoc information retrieval, product search, recommendation engines, etc. Many companies are presently adopting some form of embedding-based search in their workflow.

Some articles on how dense embeddings are currently being integrated at top companies are Instacart, DoorDash, Etsy, Google, and Airbnb.

Talk to books by Google is a good demo of how searching using dense vectors works. This form of search is popularly known as semantic search. The demo uses an encoder model to generate embeddings from documents (books in this context) stored in an index and compared to query vectors at search time to retrieve documents most similar to a given query. Semantic search is a massive upgrade from traditional keyword search algorithms such as BM25 because it can retrieve documents relevant to a given query but not necessarily contain the exact words as the query.

Side note: this model (“universal-sentence-encoder”) used in the demo is somewhat old in deep learning genealogy and there are models that produce better embeddings, some of which can be found here.

What Are Dense Embeddings?

A dense embedding is a numeric representation of data (text, users, products, etc.) using high-dimensional vectors. Dense vectors are of varying lengths and are expected to encode information about the raw data such that it is easy to find similar data points using a vector similarity algorithm like cosine similarity. See below for a simple implementation:

from sklearn.metrics.pairwise import cosine_similarity
from numpy import random

array_vec_1 = random.rand(1,10)
array_vec_2 = random.rand(1,10)
print(cosine_similarity(array_vec_1, array_vec_2))

Side note: The higher the cosine similarity between two vectors, the more similar they are. Cosine similarity is also used as a loss function for training neural networks. see pytorch cosineembeddingloss

To generate high-quality information-rich embeddings, you need machine-learning models that have been trained on millions of pairwise examples, and there are some training techniques (e.g., contrastive learning using hard negatives) that have been applied to produce high-quality embeddings. For semantic search and sentence representation in general, there are tons of publicly available pretrained or fine-tuned models on HuggingFace 🤗 and some commercially available APIs such as cohere embeddings and OpenAI embeddings for encoding text.

Indexing and Searching

After encoding our documents (generating document embeddings), we now have to think about indexing the vectors and searching the dense index.

Vector search is often done using clustering algorithms such as nearest neighbor search, and it can be compute-intensive and challenging to implement for a variety of reasons, some of which are:

(i.) Some Encoders produce representational vectors with large dimensions. Large embeddings lead to big embedding tables, which have a high memory cost when performing vector operations and can increase search latency.

(ii.) Updating the dense index with new vectors can be demanding as you might need to update the index clusters for new vectors.

Some open-source libraries have been built for fast vector search to tackle the problems mentioned above, such as Faiss from Meta, Annoy from Spotify, and ScaNN from Google. In Version 8.0+, Elasticsearch announced that their popular open-source search engine now supports nearest neighbor search.

Side note: See this discussion on drawbacks and solutions to implementing semantic search with dense vectors. Also, see this benchmark for different approximate nearest neighbor algorithms search algorithms and libraries.

Approximate Search With ElasticSearch

Let's get to the interesting part!

We’d be encoding and indexing the MS MARCO passage ranking collection, which consists of 8.8 million passages. The goal is to rank the passages based on their relevance to a given query. To download and unzip the collection.

wget https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/msmarco.zip
unzip msmarco.zip

We can preview the data after downloading with the following code:

#inspect the data
head -1 msmarco/corpus.jsonl

#output
"""
{
  '_id': '0', 
  'title': '', 
  'text': 'The presence of communication amid scientific minds was equally important to the success of the Manhattan Project as scientific intellect was. The only cloud hanging over the impressive achievement of the atomic researchers and engineers is what their success truly meant; hundreds of thousands of innocent lives obliterated.', 
  'metadata': {}
}
"""

Next, we need to encode the dataset. For this tutorial, we would use the “sentence-transformers/msmarco-MiniLM-L6-cos-v5” model hosted on HuggingFace. The model was trained on the MS Marco passage ranking collection, producing a 384-length embedding for each input sequence. We need to define an encoder function that takes a piece of text or batch of text and generates embeddings for each.

The model generates a different set of embeddings for each token in the input sentence. We use mean pooling (also called average pooling) to aggregate the embeddings. Alternatively, we can use the embedding vector produced for the [CLS] token.

Side note: Here’s a primer on pooling

class MSMarcoEncoder:
    def __init__(self, model_name: str, device : str='cpu'):
        self.device = device
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModel.from_pretrained(model_name)
        self.model.to(self.device)

    def encode(self, text:str, max_length: int):
        inputs = self.tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=max_length)
        inputs = inputs.to(self.device)
        with torch.no_grad():
            model_output = self.model(**inputs, return_dict=True)
        # Perform pooling
        embeddings = self.mean_pooling(model_output, inputs['attention_mask'])
        # Normalize embeddings
        embeddings = F.normalize(embeddings, p=2, dim=1)
        return embeddings.detach().cpu().numpy()

    def mean_pooling(self, model_output, attention_mask):
        token_embeddings = model_output[0] #First element of model_output contains all token embeddings
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

if __name__ == "__main__":
    encoder = Encoder('sentence-transformers/msmarco-MiniLM-L6-cos-v5')
    embeddings = encoder.encode(batch_info['text'], 512)

Now that we have our encoder, we need to index the data in Elasticsearch, and to do so effectively we need to define an iterator to loop over the data in batches. However, we need to set up and make sure Elasticsearch is running. First, we download and start the elastic search server locally by following the instructions here or start an elasticsearch container by following this.

For a local setup, after downloading elasticsearch, run the following command to start a cluster with one node:

./elasticsearch-8.5.1/bin/elasticsearch

ElasticSearch 8.0 and above have security enabled by default, thus verify that you can connect to the running elasticsearch cluster by running the command below:

curl --cacert config/certs/http_ca.crt -u elastic https://localhost:9200

Now that our elasticsearch server is running, we need to create an index to store the data. To do that we need to define a mapping for the different fields in our dataset. For the nearest neighbor search, however, we need to define a field of type dense_vector , which will house the embeddings for each document.

from elasticsearch import Elasticsearch

es_client = Elasticsearch( "https://localhost:9200", 
                  http_auth=("username", "password"),
                  verify_certs=False)
config = {
    "mappings": {
        "properties": {
            "title": {"type": "text"},
            "text": {"type": "text"},
            "embeddings": {
                    "type": "dense_vector",
                    "dims": 384,
                    "index": false
                }
            }
    },
    "settings": {
        "number_of_shards": 2,
        "number_of_replicas": 1
    }
}

es_client.indices.create(
    index="msmarco-demo",
    settings=config["settings"],
    mappings=config["mappings"],
)

#check if the index has been created successfully
print(es.indices.exists(index=["msmarco-demo"]))
#True

The code snippet below uses the bulk index API from elasticsearch to index documents in batches. The code below runs longer on cpu and is much faster running on gpu or tpu . Here we use a collection iterator class to iterate through the dataset. For each batch, we generate the embeddings and index them.

collection_path = 'path/to/corpus.jsonl'
collection_iterator = JsonlCollectionIterator(collection_path, fields=['title','text'])
encoder = Encoder('sentence-transformers/msmarco-MiniLM-L6-cos-v5')
index_name = "msmarco-demo"

for batch_info in collection_iterator(batch_size=256, shard_id=0, shard_num=1):
    embeddings = encoder.encode(batch_info['text'], 512)
    batch_info["dense_vectors"] = embeddings

    actions = []
    for i in range(len(batch_info['id'])):
        action = {"index": {"_index": index_name, "_id": batch_info['id'][i]}}
        doc = {
                "title": batch_info['title'][i],
                "text": batch_info['text'][i],
                "embeddings": batch_info['dense_vectors'][i].tolist()
            }
        actions.append(action)
        actions.append(doc)
    
    es_client.bulk(index=index_name, operations=actions)

result = es_client.count(index=index_name)

#print the total number of documents in the index
print(result.body['count'])
#8841823

#output one document
print(es_client.get(index=["msmarco-demo"], id="0", request_timeout=60))

'''
{'_index': 'msmarco-demo', '_id': '0', '_version': 2, '_seq_no': 27, '_primary_term': 1, 'found': True, 
'_source': {'title': '', 
'text': 'The presence of communication amid scientific minds was equally important to the success of the Manhattan Project as scientific intellect was. The only cloud hanging over the impressive achievement of the atomic researchers and engineers is what their success truly meant; hundreds of thousands of innocent lives obliterated.', 
'embeddings': [-0.032267116010189056, 0.05750396102666855,...]}}
'''

Now that indexing is done, we can search our index. Elasticsearch provides a python wrapper to perform KNN search and that is what we use in the search code function defined in the snippet below. To search, we need to define the number of search clusters using the parameter k and the query embeddings in query_vector .

def search(query: str, es_client: Elasticsearch, model: str, index: str, top_k: int = 10):

    encoder = Encoder(model)
    query_vector = encoder.encode(query, max_length=64)
    query_dict = {
        "field": "embeddings",
        "query_vector": query_vector[0].tolist(),
        "k": 10,
        "num_candidates": top_k
    }
    res = es_client.knn_search(index=index, knn=query_dict, source=["title", "text", "id"])

    for hit in res["hits"]["hits"]:
        print(hit)
        print(f"Document ID: {hit['_id']}")
        print(f"Document Title: {hit['_source']['title']}")
        print(f"Document Text: {hit['_source']['text']}")
        print("=======================================================\n")


if __name__ == "__main__":

  search(query="What is the capital of France?", 
         es_client=es_client, 
         model="sentence-transformers/msmarco-MiniLM-L6-cos-v5", 
         index=index_name)
#output
"""
{'_index': 'msmarco-demo', '_id': '82390', '_score': 0.81541693, '_source': {'text': "In terms of total household wealth, France is the wealthiest nation in Europe and fourth in the world. It also possesses the world's second-largest exclusive economic zone (EEZ), covering 11,035,000 square kilometres (4,261,000 sq mi).", 'title': ''}}
Document ID: 82390
Document Title: 
Document Text: In terms of total household wealth, France is the wealthiest nation in Europe and fourth in the world. It also possesses the world's second-largest exclusive economic zone (EEZ), covering 11,035,000 square kilometres (4,261,000 sq mi).
=====================================================================

{'_index': 'msmarco-demo', '_id': '162291', '_score': 0.80739325, '_source': {'text': 'Paris in France lies on the Seine River. The docking location is Port de Grenelle/Quai de Grenelle. As one of the largest cities in Europe, finding a property that suit your budget is not a problem. Choose between low cost guest rooms to luxury 4 and 5 star hotels and apartments to rent.', 'title': ''}}
Document ID: 162291
Document Title: 
Document Text: Paris in France lies on the Seine River. The docking location is Port de Grenelle/Quai de Grenelle. As one of the largest cities in Europe, finding a property that suit your budget is not a problem. Choose between low cost guest rooms to luxury 4 and 5 star hotels and apartments to rent.
=====================================================================
"""

Vector search is a trendy field in machine learning right now and there are a lot of other awesome libraries that provide vector search capabilities such as Pinecone, Jina AI, Weaviate, Qdrant, etc.

Voila! It was fun developing this with Github Copilot. To index and search offline, don’t hesitate to check out our awesome library Pyserini. Pyserini is primarily designed to provide effective, reproducible, and easy-to-use first-stage retrieval in a multi-stage ranking architecture.

The entire code for this tutorial can be found here.

Better Programming

Semantic Search With HuggingFace and Elasticsearch

Let’s rank passages in a dataset using the nearest neighbor search

What Are Dense Embeddings?

Indexing and Searching

Approximate Search With ElasticSearch

References

Published in Better Programming

Written by Odunayo Ogundepo

No responses yet