Information Retrieval: Sparse and Dense Vector Search

When talking about of information retrieval, vector embeddings have an indispensable role. They translate documents and queries into numerical vector formats, enabling us to sift through a vector database to identify similar vectors. Two distinct types of vector embeddings – sparse and dense – each offer unique advantages and present certain challenges.

Sparse vectors algorithms, such as TF-IDF or BM25, are characterized by high dimensionality and an abundance of zero values. Their sparse nature has been the subject of extensive research over the years, resulting in compact data structures and efficient retrieval algorithms designed specifically for these vectors. Sparse vectors are typically faster to retrieve and offer good baseline performance. They don’t necessitate model fine-tuning and provide exact matching of terms. However, they have their limitations. For instance, their performance cannot be significantly improved over the baseline, and they suffer from a vocabulary mismatch problem. They also struggle to align with the human-like thought process of abstract concepts.

On the other hand, dense vectors are lower-dimensional but information-rich, populated with non-zero values in most or all dimensions. They are often constructed using neural network models, such as transformers, enabling them to represent more abstract information like the semantic meaning behind some text. Dense vectors can outperform sparse vectors with fine-tuning and facilitate searches with human-like abstract concepts. They also allow multi-modality (text, images, audio, etc.) and cross-modal search (e.g., text-to-image). However, they require training data, making them challenging to implement in low-resource scenarios. They don’t generalize well, particularly for niche terminology, and require more compute and memory than sparse vectors. Additionally, they don’t offer exact matches and are not easily interpretable.

Given these characteristics, it’s clear that a perfect solution would merge the strengths of both sparse and dense vectors, but this remains challenging to achieve.

One common approach to addressing this issue is a two-stage retrieval and ranking system. The first stage utilizes a sparse retrieval method to retrieve a large set of candidate documents, which are then passed to the second stage. Here, a dense model re-ranks the results based on their relevance to the query. This approach offers the efficiency of sparse model retrieval and the accuracy of dense model reranking. However, this two-stage system has its drawbacks. It can be slower than a single-stage system using approximate search algorithms, is more complex and thus introduces more engineering challenges, and heavily relies on the first-stage retriever returning relevant results.

Recognizing these challenges, researchers have focused on improving single-stage retrieval systems. A key development in this space is the introduction of SPLADE (Sparse Lexical and Expansion models), a highly performant sparse embedding model. SPLADE leverages a pretrained language model like BERT to identify connections between words/sub-words (termed “terms” in this context) and uses that knowledge to enhance our sparse vector embedding. It not only weighs the relevance of different terms but also enables term expansion, including alternative but relevant terms beyond those found in the original sequence.

The real power of SPLADE lies in its ability to learn term expansions. Traditional methods required rule-based term expansion, which is time-consuming and fundamentally limited, whereas SPLADE leverages the best language models to learn term expansions and even tweak them based on sentence context. This ability is crucial in minimizing the vocabulary mismatch problem, a common challenge where there is a lack of term overlap between queries and relevant documents due to the complexity of language and the multitude of ways we can describe something.

To build its sparse embeddings, SPLADE starts with a transformer model like BERT using a Masked-Language Modeling (MLM) head, a typical pretraining method utilized by many transformers. This model can be pretrained or even started off-the-shelf.

To further enhance information retrieval systems, researchers have explored methods like deep semantic hashing and neural hashing. Deep semantic hashing involves training a deep neural network to transform high-dimensional data into binary hash codes. It aims to maximize the semantic similarity between documents, making it easier to retrieve relevant documents. Neural hashing also employs a neural network but uses a different approach. It’s trained to generate a compact binary code for each document in the dataset, effectively creating a hash table for efficient retrieval.

The landscape of information retrieval is complex and evolving. While existing methods offer unique strengths and weaknesses, researchers continue to innovate, exploring new ways to improve these systems. Whether it’s enhancing single-stage retrieval systems, developing more efficient hashing techniques, or finding ways to marry the benefits of sparse and dense vectors, the field continues to push the boundaries of what’s possible in information retrieval.