TABLE OF CONTENTS

Introduction Challenges for the RAG Approach to Long-Context Tasks Fast Effective RAG Using Sparse Graphs Testing on the Hash-Hop Dataset The Computational Cost of sPPR Related work

Challenges for the RAG Approach to Long-Context Tasks Fast Effective RAG Using Sparse Graphs Testing on the Hash-Hop Dataset The Computational Cost of sPPR Related work

Introduction Challenges for the RAG Approach to Long-Context Tasks

Introduction Challenges for the RAG Approach to Long-Context Tasks Fast Effective RAG Using Sparse Graphs Testing on the Hash-Hop Dataset The Computational Cost of sPPR Related work

Introduction

AI researchers and developers are currently putting significant effort into building LLMs capable of handling long-context tasks, which require LLMs to reason over a vast sequence of input tokens. This ability is important for LLMs to handle complex problems where a large amount of information needs to be held in “working memory”, such as in processing full textbooks or large codebases.

The context lengths of state-of-the-art models have rapidly expanded from 8k context for the original GPT4, to 128k as a standard of more recent models, to over 1M for Google’s Gemini series. These long-context models often perform well on tasks requiring retrieval and reasoning over such large spans, but have the significant drawback of requiring enormous compute and memory, since they use attention layers whose compute cost increases quadratically with context length.

Magic.dev recently proposed a new model called Long-Term-Memory (LTM) which allegedly can solve long-context reasoning tasks of up to 100M context length. They also proposed a new benchmark, “Hash-Hop”, which requires the model to retrieve a transitive chain of hashes amid an arbitrary number of distractor hashes. Although published details are sparse, they write that their method is approximately 1000x cheaper than naive attention and requires only a single H100 to serve. However, they also state that their approach requires full rewrites of the inference and training stack.

Building on our prior work solving complex multi-hop reasoning with retrieval-augmented-generation (RAG), we demonstrate how we can build a retrieval system that can solve the Hash-Hop task on up to 100M context in only a few seconds on a standard CPU, which can be paired with any off-the-shelf LLM. Our approach utilizes sparse embeddings to save CPU memory when handling long contexts and enables extremely rapid retrieval on CPU by exploiting sparse matrix multiplication routines. We utilize a sparse variant of our previous personalized page-rank graph retriever to perform searches over 100M context in under 100ms. Since our retriever runs entirely on CPU, it can operate in complete synergy with the GPU-based LLM.

Challenges for the RAG Approach to Long-Context Tasks

When attempting to apply RAG to solve long-context retrieval and reasoning tasks, there are two central challenges:

The challenge of fast test-time text embedding: Long-context queries require to operate in a different regime than standard RAG. RAG models first chunk text, then embed the chunks in a vector database, usually using a transformer-based embedding model. In a typical setting, the embedding step happens before deployment, and it is assumed that the same database will be queried multiple times so the cost of embedding the database can be amortized over many queries. However, in settings where the user provides a long document along with the relevant query (e.g., user inputs an entire book into the model along with questions), the embedding must be done at test time. Since the user expects an immediate response, the embedding process must be fast to be practically useful. However, the process of embedding text chunks using standard transformer-based embedders can be slow. Furthermore, dense representations have large memory requirements and, if stored on GPU to enable rapid retrieval, can often overwhelm the limited GPU VRAM.

The challenge of retrieval for complex tasks: Classic RAG methods, which utilize a flat search over a vector database, work well on standard QA and retrieval tasks: A query asks about some content X, and then text chunks most related to X are retrieved. However, many long-context tasks are more complex than this (e.g., summarization and reasoning) and require linking together elements of the input document that may not be directly related to the content of the query, making accurate retrieval difficult.

Fast Effective RAG Using Sparse Graphs

To address these challenges, we propose combining the use of sparse embedding methods with graph-based retrieval;in particular:

We generate sparse embeddings of text chunks and use these to create sparse graph representations of the text data, which represent relations between text chunks. The advantage of such sparse representation is a small memory footprint and amenability to highly efficient sparse linear algebra routines that can be performed entirely on CPUs.

Building on our prior work, during retrieval, we use graph representation to execute an algorithm similar to personalized pagerank (PPR) – a graph-based search and ranking method used originally by Google search. As we explain below, using graph-based retrieval allows for the retrieval of text chunks that have more indirect, complex relationships to the query, and are more likely to be important for reasoning tasks.

Introduction

Challenges for the RAG Approach to Long-Context Tasks

When attempting to apply RAG to solve long-context retrieval and reasoning tasks, there are two central challenges:

The challenge of fast test-time text embedding: Long-context queries require to operate in a different regime than standard RAG. RAG models first chunk text, then embed the chunks in a vector database, usually using a transformer-based embedding model. In a typical setting, the embedding step happens before deployment, and it is assumed that the same database will be queried multiple times so the cost of embedding the database can be amortized over many queries. However, in settings where the user provides a long document along with the relevant query (e.g., user inputs an entire book into the model along with questions), the embedding must be done at test time. Since the user expects an immediate response, the embedding process must be fast to be practically useful. However, the process of embedding text chunks using standard transformer-based embedders can be slow. Furthermore, dense representations have large memory requirements and, if stored on GPU to enable rapid retrieval, can often overwhelm the limited GPU VRAM.

The challenge of retrieval for complex tasks: Classic RAG methods, which utilize a flat search over a vector database, work well on standard QA and retrieval tasks: A query asks about some content X, and then text chunks most related to X are retrieved. However, many long-context tasks are more complex than this (e.g., summarization and reasoning) and require linking together elements of the input document that may not be directly related to the content of the query, making accurate retrieval difficult.

Fast Effective RAG Using Sparse Graphs

To address these challenges, we propose combining the use of sparse embedding methods with graph-based retrieval;in particular:

We generate sparse embeddings of text chunks and use these to create sparse graph representations of the text data, which represent relations between text chunks. The advantage of such sparse representation is a small memory footprint and amenability to highly efficient sparse linear algebra routines that can be performed entirely on CPUs.

Building on our prior work, during retrieval, we use graph representation to execute an algorithm similar to personalized pagerank (PPR) – a graph-based search and ranking method used originally by Google search. As we explain below, using graph-based retrieval allows for the retrieval of text chunks that have more indirect, complex relationships to the query, and are more likely to be important for reasoning tasks.

Testing on the Hash-Hop Dataset

[Figure 1: Left: visualization of hash-hop task. A list of hash assignments are presented, then the model is queried with one hash, in this case Hash 1. The model must then output all hashes that are equal to this hash. This task thus requires a kind of transitive reasoning over the long-context. Right: The relation between hashes can be represented through a graph, where linked nodes are those hash-assignments that share at least one hash. During retrieval, text chunks that share the query hash as well as those part of the same graph as this node will be retrieved, while those not near enough to the query hash in the graph and those lacking any path to the query hash will be left out.]

We provide evidence for the effectiveness of our approach on the Hash-Hop task, recently introduced by Magic. The Hash-Hop task requires performing a kind of transitive reasoning over synthetic hash assignments of the form ‘Hash1 = Hash2,..., Hash2 = Hash3, ...’, where the actual hashes are random combinations of characters. The model is presented with a document of hundreds of thousands or millions of tokens worth of hashes and a query which gives a starting hash. The model must then respond by listing all hashes that were directly or indirectly assigned to the starting hash. For example, given Hash1 the model must output ‘Hash1=Hash2=Hash3=...’.

Standard RAG systems will fail at this task (see below), since when presented with the query, e.g., ‘Hash1’ standard retrievers will only retrieve text directly related to Hash1 and miss those hash assignments transitively/indirectly related to Hash1.

However, graph-based retrieval methods that retrieve both directly and indirectly related items should be able to solve this. As shown in the figure above, the relations between hash assignments are naturally represented as a graph, which can be utilized by a graph-based retrieval process, like the one we implement, to retrieve all the relevant hashes, while ignoring those without relation to the starting hash.

We implement and test a standard vector search model that uses sparse embeddings (sBase-RAG) with our sparse personalized-page rank based retriever (sPPR-RAG). Both RAG systems only retrieve 65 items from memory, which amount to only several thousand tokens. RAG models use gpt-4o-mini as the language generator to take in the retrieved text chunks and output the desired hash chain. A small few-shot prompt is provided so the LLM knows what to do. We compare the results to Magic’s long-term memory language model (Magic-LTM), as reported in their blog post CITE.

Introduction

[Figure 1: Left: visualization of hash-hop task. A list of hash assignments are presented, then the model is queried with one hash, in this case Hash 1. The model must then output all hashes that are equal to this hash. This task thus requires a kind of transitive reasoning over the long-context. Right: The relation between hashes can be represented through a graph, where linked nodes are those hash-assignments that share at least one hash. During retrieval, text chunks that share the query hash as well as those part of the same graph as this node will be retrieved, while those not near enough to the query hash in the graph and those lacking any path to the query hash will be left out.]

Challenges for the RAG Approach to Long-Context Tasks

[Figure 1: Left: visualization of hash-hop task. A list of hash assignments are presented, then the model is queried with one hash, in this case Hash 1. The model must then output all hashes that are equal to this hash. This task thus requires a kind of transitive reasoning over the long-context. Right: The relation between hashes can be represented through a graph, where linked nodes are those hash-assignments that share at least one hash. During retrieval, text chunks that share the query hash as well as those part of the same graph as this node will be retrieved, while those not near enough to the query hash in the graph and those lacking any path to the query hash will be left out.]

When attempting to apply RAG to solve long-context retrieval and reasoning tasks, there are two central challenges:

The challenge of fast test-time text embedding: Long-context queries require to operate in a different regime than standard RAG. RAG models first chunk text, then embed the chunks in a vector database, usually using a transformer-based embedding model. In a typical setting, the embedding step happens before deployment, and it is assumed that the same database will be queried multiple times so the cost of embedding the database can be amortized over many queries. However, in settings where the user provides a long document along with the relevant query (e.g., user inputs an entire book into the model along with questions), the embedding must be done at test time. Since the user expects an immediate response, the embedding process must be fast to be practically useful. However, the process of embedding text chunks using standard transformer-based embedders can be slow. Furthermore, dense representations have large memory requirements and, if stored on GPU to enable rapid retrieval, can often overwhelm the limited GPU VRAM.

The challenge of retrieval for complex tasks: Classic RAG methods, which utilize a flat search over a vector database, work well on standard QA and retrieval tasks: A query asks about some content X, and then text chunks most related to X are retrieved. However, many long-context tasks are more complex than this (e.g., summarization and reasoning) and require linking together elements of the input document that may not be directly related to the content of the query, making accurate retrieval difficult.

Fast Effective RAG Using Sparse Graphs

[Figure 1: Left: visualization of hash-hop task. A list of hash assignments are presented, then the model is queried with one hash, in this case Hash 1. The model must then output all hashes that are equal to this hash. This task thus requires a kind of transitive reasoning over the long-context. Right: The relation between hashes can be represented through a graph, where linked nodes are those hash-assignments that share at least one hash. During retrieval, text chunks that share the query hash as well as those part of the same graph as this node will be retrieved, while those not near enough to the query hash in the graph and those lacking any path to the query hash will be left out.]

Testing on the Hash-Hop Dataset

[Figure 1: Left: visualization of hash-hop task. A list of hash assignments are presented, then the model is queried with one hash, in this case Hash 1. The model must then output all hashes that are equal to this hash. This task thus requires a kind of transitive reasoning over the long-context. Right: The relation between hashes can be represented through a graph, where linked nodes are those hash-assignments that share at least one hash. During retrieval, text chunks that share the query hash as well as those part of the same graph as this node will be retrieved, while those not near enough to the query hash in the graph and those lacking any path to the query hash will be left out.]

The Computational Cost of sPPR

[Figure 1: Left: visualization of hash-hop task. A list of hash assignments are presented, then the model is queried with one hash, in this case Hash 1. The model must then output all hashes that are equal to this hash. This task thus requires a kind of transitive reasoning over the long-context. Right: The relation between hashes can be represented through a graph, where linked nodes are those hash-assignments that share at least one hash. During retrieval, text chunks that share the query hash as well as those part of the same graph as this node will be retrieved, while those not near enough to the query hash in the graph and those lacking any path to the query hash will be left out.]

[Figure 2: The accuracy of basic RAG with sparse embeddings (sBase-RAG), our sparse graph-based RAG (sPPR-RAG), and the accuracies reported by Magic for their LTM. We observe that our method is highly competitive with and even outperforms the magic LTM up to 100M contexts for up to 4 hops and slightly tails off at 5 and 6 hops.]

We see that, as predicted, the base RAG model fails at Hash-Hop tasks that require more than one ‘hop’ of indirect hash assignments to be retrieved. However, our PPR-sRAG model performs close to perfect up to 16 million tokens, with some reduction in performance at greater than 50 million tokens on tasks that require more hops.

We believe these results are significant since Magic’s LTM model was likely trained directly to perform this task, and was reported to require massive amounts of re-engineering of the ‘full-stack’ utilized in training just to implement and train the model CITE.ur method, on the other hand, requires no such re-engineering, can run entirely on the CPU, is extremely fast, and does not touch the LLM component at all, entirely preserving its language capabilities.

Introduction

Challenges for the RAG Approach to Long-Context Tasks

When attempting to apply RAG to solve long-context retrieval and reasoning tasks, there are two central challenges:

The challenge of fast test-time text embedding: Long-context queries require to operate in a different regime than standard RAG. RAG models first chunk text, then embed the chunks in a vector database, usually using a transformer-based embedding model. In a typical setting, the embedding step happens before deployment, and it is assumed that the same database will be queried multiple times so the cost of embedding the database can be amortized over many queries. However, in settings where the user provides a long document along with the relevant query (e.g., user inputs an entire book into the model along with questions), the embedding must be done at test time. Since the user expects an immediate response, the embedding process must be fast to be practically useful. However, the process of embedding text chunks using standard transformer-based embedders can be slow. Furthermore, dense representations have large memory requirements and, if stored on GPU to enable rapid retrieval, can often overwhelm the limited GPU VRAM.

The challenge of retrieval for complex tasks: Classic RAG methods, which utilize a flat search over a vector database, work well on standard QA and retrieval tasks: A query asks about some content X, and then text chunks most related to X are retrieved. However, many long-context tasks are more complex than this (e.g., summarization and reasoning) and require linking together elements of the input document that may not be directly related to the content of the query, making accurate retrieval difficult.

Fast Effective RAG Using Sparse Graphs

To address these challenges, we propose combining the use of sparse embedding methods with graph-based retrieval;in particular:

We generate sparse embeddings of text chunks and use these to create sparse graph representations of the text data, which represent relations between text chunks. The advantage of such sparse representation is a small memory footprint and amenability to highly efficient sparse linear algebra routines that can be performed entirely on CPUs.

Building on our prior work, during retrieval, we use graph representation to execute an algorithm similar to personalized pagerank (PPR) – a graph-based search and ranking method used originally by Google search. As we explain below, using graph-based retrieval allows for the retrieval of text chunks that have more indirect, complex relationships to the query, and are more likely to be important for reasoning tasks.

[Figure 1: Left: visualization of hash-hop task. A list of hash assignments are presented, then the model is queried with one hash, in this case Hash 1. The model must then output all hashes that are equal to this hash. This task thus requires a kind of transitive reasoning over the long-context. Right: The relation between hashes can be represented through a graph, where linked nodes are those hash-assignments that share at least one hash. During retrieval, text chunks that share the query hash as well as those part of the same graph as this node will be retrieved, while those not near enough to the query hash in the graph and those lacking any path to the query hash will be left out.]

Testing on the Hash-Hop Dataset

[Figure 2: The accuracy of basic RAG with sparse embeddings (sBase-RAG), our sparse graph-based RAG (sPPR-RAG), and the accuracies reported by Magic for their LTM. We observe that our method is highly competitive with and even outperforms the magic LTM up to 100M contexts for up to 4 hops and slightly tails off at 5 and 6 hops.]

The Computational Cost of sPPR

[Figure 3: The embedding and retrieval time for our sparse embedding approach vs standard RAG dense-embeddings. Dense embeddings take substantially longer time to embed the full context even at relatively short sequence length. Our sparse embeddings by contrast run on the CPU and can utilize the vastly greater CPU RAM efficiently while also running significantly faster. sBase refers to our basic RAG (flat retrieval) with our sparse embeddings while sPPR denotes our sparse personalized page-rank retrieval algorithm. dBase refers to dense embedding databases that use FAISS flat search.]

In the figure above, we compare the computational cost of our RAG system to more standard methods that use dense embeddings generated by BERT-style semantic embedders. We show both the wall clock time it takes for these systems to embed text of various lengths and to retrieve from it.

We test the following three parameter embedders:

‘thenlper/gte-small’ – 33M parameters (dBase-small)
‘Alibaba-NLP/gte-base-en-v1.5’ – 130M parameters (dBase-medium)
‘Alibaba-NLP/gte-large-en-v1.5’ – 500M parameters (dBase-large)

The models are run on a RTX 4090 GPU, and text chunks packed into batches with sizes optimized for wall clock time, and fed through the models to get the embeddings. We use the standard FAISS library flat indexes for retrieval from the dense vector databases. We compare these models to the sparse PPR retriever and sparse Base retriever used in the last section.

All methods perform retrieval in a small fraction of a second. Importantly, the PPR retriever is fast despite requiring more computations (e.g., numerous matrix multiplies) in its graph retrieval mechanism than the flat vector search of the other methods. Due to the fact that our method can leverage sparse matmuls on the CPU, we can even demonstrate slightly faster speeds than FAISS flat vector search over a dense vector database.

However, BERT embedders embed text chunks much more slowly than the sparse embedding algorithm we used, and the time it takes to embed increases at a faster rate than it does for the sparse methods. Sparse methods are very fast even at longer context lengths, requiring only several seconds to generate embeddings at 16+ million input tokens.

In sum, we demonstrate how to achieve state of the art performance on Magic’s Hash-Hop long-context benchmark, matching or exceeding their model, using a simple and novel sparse graph-based-RAG algorithm which can run fast and efficiently on a single CPU. To do so, we have solved two core problems with using RAG for long-context tasks:

How to rapidly embed the input document without massive GPU memory usage – our method uses sparse embeddings which require much less memory and works entirely on the CPU and hence does not interfere with the LLM on GPU.
How to retrieve sufficient information across multiple hops to solve challenging reasoning problems – we solve this using our novel graph-based retrieval method.

More broadly, we believe we can showcase the power of relatively simple approaches like RAG, when adapted in novel and clever ways, to be combined with existing LLM models to achieve complex reasoning over extremely long contexts.

Related work

Since naively implementing long context is extremely expensive, a variety of alternative approaches have been investigated to reduce this cost. To make such long contexts feasible at all, it is necessary to distribute the computation across multiple GPU nodes and algorithms such as Ring Attention, and our own work on Tree Attention have been proposed to make this process first possible and then efficient. However, these do not address the fundamental computational and memory requirements, but simply enable more hardware to be efficiently dedicated to performing the required processing.

A second approach involves attempting to directly compress the KV cache and/or the input context. However, these methods are still highly experimental, and are lossy which means they can reduce performance. Additionally, they can often only compress by a factor of 2-8x which still requires immense resources especially at the 10m-100m context lengths we consider in this work.

A third possibility is to utilize novel architectures such as state-space-models, where our Zamba2 series of models achieves state-of-the-art performance, and other more exotic architectures such as recurrent memory transformers, which, similar to SSMs, utilize a fixed-size recurrent state which is updated between short transformer context windows and achieve state-of-the-art performance on long-context reasoning benchmarks. However state-space models do not appear as capable as transformers on extremely long sequences and in-context learning tasks, necessitating hybrid architectures to be competitive and hence returning to the computational and memory demands of super long context, although SSM hybrids can often provide a significant boost over pure transformers. Additionally, RMT-style architectures, although reported to perform very well at these long-context tasks have never been validated at scale in standard LLM benchmarks and it is unclear whether they work well as general language models.

Introduction

Challenges for the RAG Approach to Long-Context Tasks

When attempting to apply RAG to solve long-context retrieval and reasoning tasks, there are two central challenges:

The challenge of fast test-time text embedding: Long-context queries require to operate in a different regime than standard RAG. RAG models first chunk text, then embed the chunks in a vector database, usually using a transformer-based embedding model. In a typical setting, the embedding step happens before deployment, and it is assumed that the same database will be queried multiple times so the cost of embedding the database can be amortized over many queries. However, in settings where the user provides a long document along with the relevant query (e.g., user inputs an entire book into the model along with questions), the embedding must be done at test time. Since the user expects an immediate response, the embedding process must be fast to be practically useful. However, the process of embedding text chunks using standard transformer-based embedders can be slow. Furthermore, dense representations have large memory requirements and, if stored on GPU to enable rapid retrieval, can often overwhelm the limited GPU VRAM.

The challenge of retrieval for complex tasks: Classic RAG methods, which utilize a flat search over a vector database, work well on standard QA and retrieval tasks: A query asks about some content X, and then text chunks most related to X are retrieved. However, many long-context tasks are more complex than this (e.g., summarization and reasoning) and require linking together elements of the input document that may not be directly related to the content of the query, making accurate retrieval difficult.

To address these challenges, we propose combining the use of sparse embedding methods with graph-based retrieval;in particular:

We generate sparse embeddings of text chunks and use these to create sparse graph representations of the text data, which represent relations between text chunks. The advantage of such sparse representation is a small memory footprint and amenability to highly efficient sparse linear algebra routines that can be performed entirely on CPUs.

Building on our prior work, during retrieval, we use graph representation to execute an algorithm similar to personalized pagerank (PPR) – a graph-based search and ranking method used originally by Google search. As we explain below, using graph-based retrieval allows for the retrieval of text chunks that have more indirect, complex relationships to the query, and are more likely to be important for reasoning tasks.

Fast Effective RAG Using Sparse Graphs

[Figure 1: Left: visualization of hash-hop task. A list of hash assignments are presented, then the model is queried with one hash, in this case Hash 1. The model must then output all hashes that are equal to this hash. This task thus requires a kind of transitive reasoning over the long-context. Right: The relation between hashes can be represented through a graph, where linked nodes are those hash-assignments that share at least one hash. During retrieval, text chunks that share the query hash as well as those part of the same graph as this node will be retrieved, while those not near enough to the query hash in the graph and those lacking any path to the query hash will be left out.]

[Figure 2: The accuracy of basic RAG with sparse embeddings (sBase-RAG), our sparse graph-based RAG (sPPR-RAG), and the accuracies reported by Magic for their LTM. We observe that our method is highly competitive with and even outperforms the magic LTM up to 100M contexts for up to 4 hops and slightly tails off at 5 and 6 hops.]

[Figure 3: The embedding and retrieval time for our sparse embedding approach vs standard RAG dense-embeddings. Dense embeddings take substantially longer time to embed the full context even at relatively short sequence length. Our sparse embeddings by contrast run on the CPU and can utilize the vastly greater CPU RAM efficiently while also running significantly faster. sBase refers to our basic RAG (flat retrieval) with our sparse embeddings while sPPR denotes our sparse personalized page-rank retrieval algorithm. dBase refers to dense embedding databases that use FAISS flat search.]

We test the following three parameter embedders:

‘thenlper/gte-small’ – 33M parameters (dBase-small)
‘Alibaba-NLP/gte-base-en-v1.5’ – 130M parameters (dBase-medium)
‘Alibaba-NLP/gte-large-en-v1.5’ – 500M parameters (dBase-large)

How to rapidly embed the input document without massive GPU memory usage – our method uses sparse embeddings which require much less memory and works entirely on the CPU and hence does not interfere with the LLM on GPU.
How to retrieve sufficient information across multiple hops to solve challenging reasoning problems – we solve this using our novel graph-based retrieval method.

Testing on the Hash-Hop Dataset

Introduction

Challenges for the RAG Approach to Long-Context Tasks

When attempting to apply RAG to solve long-context retrieval and reasoning tasks, there are two central challenges:

The challenge of fast test-time text embedding: Long-context queries require to operate in a different regime than standard RAG. RAG models first chunk text, then embed the chunks in a vector database, usually using a transformer-based embedding model. In a typical setting, the embedding step happens before deployment, and it is assumed that the same database will be queried multiple times so the cost of embedding the database can be amortized over many queries. However, in settings where the user provides a long document along with the relevant query (e.g., user inputs an entire book into the model along with questions), the embedding must be done at test time. Since the user expects an immediate response, the embedding process must be fast to be practically useful. However, the process of embedding text chunks using standard transformer-based embedders can be slow. Furthermore, dense representations have large memory requirements and, if stored on GPU to enable rapid retrieval, can often overwhelm the limited GPU VRAM.

The challenge of retrieval for complex tasks: Classic RAG methods, which utilize a flat search over a vector database, work well on standard QA and retrieval tasks: A query asks about some content X, and then text chunks most related to X are retrieved. However, many long-context tasks are more complex than this (e.g., summarization and reasoning) and require linking together elements of the input document that may not be directly related to the content of the query, making accurate retrieval difficult.

To address these challenges, we propose combining the use of sparse embedding methods with graph-based retrieval;in particular:

We generate sparse embeddings of text chunks and use these to create sparse graph representations of the text data, which represent relations between text chunks. The advantage of such sparse representation is a small memory footprint and amenability to highly efficient sparse linear algebra routines that can be performed entirely on CPUs.

Building on our prior work, during retrieval, we use graph representation to execute an algorithm similar to personalized pagerank (PPR) – a graph-based search and ranking method used originally by Google search. As we explain below, using graph-based retrieval allows for the retrieval of text chunks that have more indirect, complex relationships to the query, and are more likely to be important for reasoning tasks.

[Figure 1: Left: visualization of hash-hop task. A list of hash assignments are presented, then the model is queried with one hash, in this case Hash 1. The model must then output all hashes that are equal to this hash. This task thus requires a kind of transitive reasoning over the long-context. Right: The relation between hashes can be represented through a graph, where linked nodes are those hash-assignments that share at least one hash. During retrieval, text chunks that share the query hash as well as those part of the same graph as this node will be retrieved, while those not near enough to the query hash in the graph and those lacking any path to the query hash will be left out.]

[Figure 2: The accuracy of basic RAG with sparse embeddings (sBase-RAG), our sparse graph-based RAG (sPPR-RAG), and the accuracies reported by Magic for their LTM. We observe that our method is highly competitive with and even outperforms the magic LTM up to 100M contexts for up to 4 hops and slightly tails off at 5 and 6 hops.]

Fast Effective RAG Using Sparse Graphs

[Figure 3: The embedding and retrieval time for our sparse embedding approach vs standard RAG dense-embeddings. Dense embeddings take substantially longer time to embed the full context even at relatively short sequence length. Our sparse embeddings by contrast run on the CPU and can utilize the vastly greater CPU RAM efficiently while also running significantly faster. sBase refers to our basic RAG (flat retrieval) with our sparse embeddings while sPPR denotes our sparse personalized page-rank retrieval algorithm. dBase refers to dense embedding databases that use FAISS flat search.]

We test the following three parameter embedders:

‘thenlper/gte-small’ – 33M parameters (dBase-small)
‘Alibaba-NLP/gte-base-en-v1.5’ – 130M parameters (dBase-medium)
‘Alibaba-NLP/gte-large-en-v1.5’ – 500M parameters (dBase-large)

How to rapidly embed the input document without massive GPU memory usage – our method uses sparse embeddings which require much less memory and works entirely on the CPU and hence does not interfere with the LLM on GPU.
How to retrieve sufficient information across multiple hops to solve challenging reasoning problems – we solve this using our novel graph-based retrieval method.

Introduction

When attempting to apply RAG to solve long-context retrieval and reasoning tasks, there are two central challenges:

The challenge of fast test-time text embedding: Long-context queries require to operate in a different regime than standard RAG. RAG models first chunk text, then embed the chunks in a vector database, usually using a transformer-based embedding model. In a typical setting, the embedding step happens before deployment, and it is assumed that the same database will be queried multiple times so the cost of embedding the database can be amortized over many queries. However, in settings where the user provides a long document along with the relevant query (e.g., user inputs an entire book into the model along with questions), the embedding must be done at test time. Since the user expects an immediate response, the embedding process must be fast to be practically useful. However, the process of embedding text chunks using standard transformer-based embedders can be slow. Furthermore, dense representations have large memory requirements and, if stored on GPU to enable rapid retrieval, can often overwhelm the limited GPU VRAM.

The challenge of retrieval for complex tasks: Classic RAG methods, which utilize a flat search over a vector database, work well on standard QA and retrieval tasks: A query asks about some content X, and then text chunks most related to X are retrieved. However, many long-context tasks are more complex than this (e.g., summarization and reasoning) and require linking together elements of the input document that may not be directly related to the content of the query, making accurate retrieval difficult.

Challenges for the RAG Approach to Long-Context Tasks

To address these challenges, we propose combining the use of sparse embedding methods with graph-based retrieval;in particular:

We generate sparse embeddings of text chunks and use these to create sparse graph representations of the text data, which represent relations between text chunks. The advantage of such sparse representation is a small memory footprint and amenability to highly efficient sparse linear algebra routines that can be performed entirely on CPUs.

Building on our prior work, during retrieval, we use graph representation to execute an algorithm similar to personalized pagerank (PPR) – a graph-based search and ranking method used originally by Google search. As we explain below, using graph-based retrieval allows for the retrieval of text chunks that have more indirect, complex relationships to the query, and are more likely to be important for reasoning tasks.

Fast Effective RAG Using Sparse Graphs

[Figure 1: Left: visualization of hash-hop task. A list of hash assignments are presented, then the model is queried with one hash, in this case Hash 1. The model must then output all hashes that are equal to this hash. This task thus requires a kind of transitive reasoning over the long-context. Right: The relation between hashes can be represented through a graph, where linked nodes are those hash-assignments that share at least one hash. During retrieval, text chunks that share the query hash as well as those part of the same graph as this node will be retrieved, while those not near enough to the query hash in the graph and those lacking any path to the query hash will be left out.]

Testing on the Hash-Hop Dataset

[Figure 2: The accuracy of basic RAG with sparse embeddings (sBase-RAG), our sparse graph-based RAG (sPPR-RAG), and the accuracies reported by Magic for their LTM. We observe that our method is highly competitive with and even outperforms the magic LTM up to 100M contexts for up to 4 hops and slightly tails off at 5 and 6 hops.]

The Computational Cost of sPPR

[Figure 3: The embedding and retrieval time for our sparse embedding approach vs standard RAG dense-embeddings. Dense embeddings take substantially longer time to embed the full context even at relatively short sequence length. Our sparse embeddings by contrast run on the CPU and can utilize the vastly greater CPU RAM efficiently while also running significantly faster. sBase refers to our basic RAG (flat retrieval) with our sparse embeddings while sPPR denotes our sparse personalized page-rank retrieval algorithm. dBase refers to dense embedding databases that use FAISS flat search.]

We test the following three parameter embedders:

‘thenlper/gte-small’ – 33M parameters (dBase-small)
‘Alibaba-NLP/gte-base-en-v1.5’ – 130M parameters (dBase-medium)
‘Alibaba-NLP/gte-large-en-v1.5’ – 500M parameters (dBase-large)

How to rapidly embed the input document without massive GPU memory usage – our method uses sparse embeddings which require much less memory and works entirely on the CPU and hence does not interfere with the LLM on GPU.
How to retrieve sufficient information across multiple hops to solve challenging reasoning problems – we solve this using our novel graph-based retrieval method.

Related work

Introduction

Challenges for the RAG Approach to Long-Context Tasks

When attempting to apply RAG to solve long-context retrieval and reasoning tasks, there are two central challenges:

The challenge of fast test-time text embedding: Long-context queries require to operate in a different regime than standard RAG. RAG models first chunk text, then embed the chunks in a vector database, usually using a transformer-based embedding model. In a typical setting, the embedding step happens before deployment, and it is assumed that the same database will be queried multiple times so the cost of embedding the database can be amortized over many queries. However, in settings where the user provides a long document along with the relevant query (e.g., user inputs an entire book into the model along with questions), the embedding must be done at test time. Since the user expects an immediate response, the embedding process must be fast to be practically useful. However, the process of embedding text chunks using standard transformer-based embedders can be slow. Furthermore, dense representations have large memory requirements and, if stored on GPU to enable rapid retrieval, can often overwhelm the limited GPU VRAM.

The challenge of retrieval for complex tasks: Classic RAG methods, which utilize a flat search over a vector database, work well on standard QA and retrieval tasks: A query asks about some content X, and then text chunks most related to X are retrieved. However, many long-context tasks are more complex than this (e.g., summarization and reasoning) and require linking together elements of the input document that may not be directly related to the content of the query, making accurate retrieval difficult.

Fast Effective RAG Using Sparse Graphs

To address these challenges, we propose combining the use of sparse embedding methods with graph-based retrieval;in particular:

We generate sparse embeddings of text chunks and use these to create sparse graph representations of the text data, which represent relations between text chunks. The advantage of such sparse representation is a small memory footprint and amenability to highly efficient sparse linear algebra routines that can be performed entirely on CPUs.

Building on our prior work, during retrieval, we use graph representation to execute an algorithm similar to personalized pagerank (PPR) – a graph-based search and ranking method used originally by Google search. As we explain below, using graph-based retrieval allows for the retrieval of text chunks that have more indirect, complex relationships to the query, and are more likely to be important for reasoning tasks.

Testing on the Hash-Hop Dataset

[Figure 1: Left: visualization of hash-hop task. A list of hash assignments are presented, then the model is queried with one hash, in this case Hash 1. The model must then output all hashes that are equal to this hash. This task thus requires a kind of transitive reasoning over the long-context. Right: The relation between hashes can be represented through a graph, where linked nodes are those hash-assignments that share at least one hash. During retrieval, text chunks that share the query hash as well as those part of the same graph as this node will be retrieved, while those not near enough to the query hash in the graph and those lacking any path to the query hash will be left out.]

The Computational Cost of sPPR

[Figure 2: The accuracy of basic RAG with sparse embeddings (sBase-RAG), our sparse graph-based RAG (sPPR-RAG), and the accuracies reported by Magic for their LTM. We observe that our method is highly competitive with and even outperforms the magic LTM up to 100M contexts for up to 4 hops and slightly tails off at 5 and 6 hops.]

Related work

[Figure 3: The embedding and retrieval time for our sparse embedding approach vs standard RAG dense-embeddings. Dense embeddings take substantially longer time to embed the full context even at relatively short sequence length. Our sparse embeddings by contrast run on the CPU and can utilize the vastly greater CPU RAM efficiently while also running significantly faster. sBase refers to our basic RAG (flat retrieval) with our sparse embeddings while sPPR denotes our sparse personalized page-rank retrieval algorithm. dBase refers to dense embedding databases that use FAISS flat search.]

We test the following three parameter embedders:

‘thenlper/gte-small’ – 33M parameters (dBase-small)
‘Alibaba-NLP/gte-base-en-v1.5’ – 130M parameters (dBase-medium)
‘Alibaba-NLP/gte-large-en-v1.5’ – 500M parameters (dBase-large)

How to rapidly embed the input document without massive GPU memory usage – our method uses sparse embeddings which require much less memory and works entirely on the CPU and hence does not interfere with the LLM on GPU.
How to retrieve sufficient information across multiple hops to solve challenging reasoning problems – we solve this using our novel graph-based retrieval method.

Link to Cookbook (GitHub)

Introduction

Challenges for the RAG Approach to Long-Context Tasks

When attempting to apply RAG to solve long-context retrieval and reasoning tasks, there are two central challenges:

The challenge of fast test-time text embedding: Long-context queries require to operate in a different regime than standard RAG. RAG models first chunk text, then embed the chunks in a vector database, usually using a transformer-based embedding model. In a typical setting, the embedding step happens before deployment, and it is assumed that the same database will be queried multiple times so the cost of embedding the database can be amortized over many queries. However, in settings where the user provides a long document along with the relevant query (e.g., user inputs an entire book into the model along with questions), the embedding must be done at test time. Since the user expects an immediate response, the embedding process must be fast to be practically useful. However, the process of embedding text chunks using standard transformer-based embedders can be slow. Furthermore, dense representations have large memory requirements and, if stored on GPU to enable rapid retrieval, can often overwhelm the limited GPU VRAM.

The challenge of retrieval for complex tasks: Classic RAG methods, which utilize a flat search over a vector database, work well on standard QA and retrieval tasks: A query asks about some content X, and then text chunks most related to X are retrieved. However, many long-context tasks are more complex than this (e.g., summarization and reasoning) and require linking together elements of the input document that may not be directly related to the content of the query, making accurate retrieval difficult.

Fast Effective RAG Using Sparse Graphs

To address these challenges, we propose combining the use of sparse embedding methods with graph-based retrieval;in particular:

We generate sparse embeddings of text chunks and use these to create sparse graph representations of the text data, which represent relations between text chunks. The advantage of such sparse representation is a small memory footprint and amenability to highly efficient sparse linear algebra routines that can be performed entirely on CPUs.

Building on our prior work, during retrieval, we use graph representation to execute an algorithm similar to personalized pagerank (PPR) – a graph-based search and ranking method used originally by Google search. As we explain below, using graph-based retrieval allows for the retrieval of text chunks that have more indirect, complex relationships to the query, and are more likely to be important for reasoning tasks.

[Figure 1: Left: visualization of hash-hop task. A list of hash assignments are presented, then the model is queried with one hash, in this case Hash 1. The model must then output all hashes that are equal to this hash. This task thus requires a kind of transitive reasoning over the long-context. Right: The relation between hashes can be represented through a graph, where linked nodes are those hash-assignments that share at least one hash. During retrieval, text chunks that share the query hash as well as those part of the same graph as this node will be retrieved, while those not near enough to the query hash in the graph and those lacking any path to the query hash will be left out.]

Testing on the Hash-Hop Dataset

[Figure 2: The accuracy of basic RAG with sparse embeddings (sBase-RAG), our sparse graph-based RAG (sPPR-RAG), and the accuracies reported by Magic for their LTM. We observe that our method is highly competitive with and even outperforms the magic LTM up to 100M contexts for up to 4 hops and slightly tails off at 5 and 6 hops.]

The Computational Cost of sPPR

[Figure 3: The embedding and retrieval time for our sparse embedding approach vs standard RAG dense-embeddings. Dense embeddings take substantially longer time to embed the full context even at relatively short sequence length. Our sparse embeddings by contrast run on the CPU and can utilize the vastly greater CPU RAM efficiently while also running significantly faster. sBase refers to our basic RAG (flat retrieval) with our sparse embeddings while sPPR denotes our sparse personalized page-rank retrieval algorithm. dBase refers to dense embedding databases that use FAISS flat search.]

We test the following three parameter embedders:

‘thenlper/gte-small’ – 33M parameters (dBase-small)
‘Alibaba-NLP/gte-base-en-v1.5’ – 130M parameters (dBase-medium)
‘Alibaba-NLP/gte-large-en-v1.5’ – 500M parameters (dBase-large)

How to rapidly embed the input document without massive GPU memory usage – our method uses sparse embeddings which require much less memory and works entirely on the CPU and hence does not interfere with the LLM on GPU.
How to retrieve sufficient information across multiple hops to solve challenging reasoning problems – we solve this using our novel graph-based retrieval method.

Related work

What is Annealing?

Introduction

Challenges for the RAG Approach to Long-Context Tasks

When attempting to apply RAG to solve long-context retrieval and reasoning tasks, there are two central challenges:

The challenge of fast test-time text embedding: Long-context queries require to operate in a different regime than standard RAG. RAG models first chunk text, then embed the chunks in a vector database, usually using a transformer-based embedding model. In a typical setting, the embedding step happens before deployment, and it is assumed that the same database will be queried multiple times so the cost of embedding the database can be amortized over many queries. However, in settings where the user provides a long document along with the relevant query (e.g., user inputs an entire book into the model along with questions), the embedding must be done at test time. Since the user expects an immediate response, the embedding process must be fast to be practically useful. However, the process of embedding text chunks using standard transformer-based embedders can be slow. Furthermore, dense representations have large memory requirements and, if stored on GPU to enable rapid retrieval, can often overwhelm the limited GPU VRAM.

The challenge of retrieval for complex tasks: Classic RAG methods, which utilize a flat search over a vector database, work well on standard QA and retrieval tasks: A query asks about some content X, and then text chunks most related to X are retrieved. However, many long-context tasks are more complex than this (e.g., summarization and reasoning) and require linking together elements of the input document that may not be directly related to the content of the query, making accurate retrieval difficult.

Fast Effective RAG Using Sparse Graphs

Testing on the Hash-Hop Dataset

The Computational Cost of sPPR

To address these challenges, we propose combining the use of sparse embedding methods with graph-based retrieval;in particular:

We generate sparse embeddings of text chunks and use these to create sparse graph representations of the text data, which represent relations between text chunks. The advantage of such sparse representation is a small memory footprint and amenability to highly efficient sparse linear algebra routines that can be performed entirely on CPUs.

Building on our prior work, during retrieval, we use graph representation to execute an algorithm similar to personalized pagerank (PPR) – a graph-based search and ranking method used originally by Google search. As we explain below, using graph-based retrieval allows for the retrieval of text chunks that have more indirect, complex relationships to the query, and are more likely to be important for reasoning tasks.

Introduction

When attempting to apply RAG to solve long-context retrieval and reasoning tasks, there are two central challenges:

The challenge of fast test-time text embedding: Long-context queries require to operate in a different regime than standard RAG. RAG models first chunk text, then embed the chunks in a vector database, usually using a transformer-based embedding model. In a typical setting, the embedding step happens before deployment, and it is assumed that the same database will be queried multiple times so the cost of embedding the database can be amortized over many queries. However, in settings where the user provides a long document along with the relevant query (e.g., user inputs an entire book into the model along with questions), the embedding must be done at test time. Since the user expects an immediate response, the embedding process must be fast to be practically useful. However, the process of embedding text chunks using standard transformer-based embedders can be slow. Furthermore, dense representations have large memory requirements and, if stored on GPU to enable rapid retrieval, can often overwhelm the limited GPU VRAM.

The challenge of retrieval for complex tasks: Classic RAG methods, which utilize a flat search over a vector database, work well on standard QA and retrieval tasks: A query asks about some content X, and then text chunks most related to X are retrieved. However, many long-context tasks are more complex than this (e.g., summarization and reasoning) and require linking together elements of the input document that may not be directly related to the content of the query, making accurate retrieval difficult.

Challenges for the RAG Approach to Long-Context Tasks

To address these challenges, we propose combining the use of sparse embedding methods with graph-based retrieval;in particular:

We generate sparse embeddings of text chunks and use these to create sparse graph representations of the text data, which represent relations between text chunks. The advantage of such sparse representation is a small memory footprint and amenability to highly efficient sparse linear algebra routines that can be performed entirely on CPUs.

Building on our prior work, during retrieval, we use graph representation to execute an algorithm similar to personalized pagerank (PPR) – a graph-based search and ranking method used originally by Google search. As we explain below, using graph-based retrieval allows for the retrieval of text chunks that have more indirect, complex relationships to the query, and are more likely to be important for reasoning tasks.

[Figure 1: Left: visualization of hash-hop task. A list of hash assignments are presented, then the model is queried with one hash, in this case Hash 1. The model must then output all hashes that are equal to this hash. This task thus requires a kind of transitive reasoning over the long-context. Right: The relation between hashes can be represented through a graph, where linked nodes are those hash-assignments that share at least one hash. During retrieval, text chunks that share the query hash as well as those part of the same graph as this node will be retrieved, while those not near enough to the query hash in the graph and those lacking any path to the query hash will be left out.]

[Figure 2: The accuracy of basic RAG with sparse embeddings (sBase-RAG), our sparse graph-based RAG (sPPR-RAG), and the accuracies reported by Magic for their LTM. We observe that our method is highly competitive with and even outperforms the magic LTM up to 100M contexts for up to 4 hops and slightly tails off at 5 and 6 hops.]

[Figure 3: The embedding and retrieval time for our sparse embedding approach vs standard RAG dense-embeddings. Dense embeddings take substantially longer time to embed the full context even at relatively short sequence length. Our sparse embeddings by contrast run on the CPU and can utilize the vastly greater CPU RAM efficiently while also running significantly faster. sBase refers to our basic RAG (flat retrieval) with our sparse embeddings while sPPR denotes our sparse personalized page-rank retrieval algorithm. dBase refers to dense embedding databases that use FAISS flat search.]

We test the following three parameter embedders:

‘thenlper/gte-small’ – 33M parameters (dBase-small)
‘Alibaba-NLP/gte-base-en-v1.5’ – 130M parameters (dBase-medium)
‘Alibaba-NLP/gte-large-en-v1.5’ – 500M parameters (dBase-large)

How to rapidly embed the input document without massive GPU memory usage – our method uses sparse embeddings which require much less memory and works entirely on the CPU and hence does not interfere with the LLM on GPU.
How to retrieve sufficient information across multiple hops to solve challenging reasoning problems – we solve this using our novel graph-based retrieval method.

Fast Effective RAG Using Sparse Graphs

Introduction

Challenges for the RAG Approach to Long-Context Tasks

When attempting to apply RAG to solve long-context retrieval and reasoning tasks, there are two central challenges:

The challenge of fast test-time text embedding: Long-context queries require to operate in a different regime than standard RAG. RAG models first chunk text, then embed the chunks in a vector database, usually using a transformer-based embedding model. In a typical setting, the embedding step happens before deployment, and it is assumed that the same database will be queried multiple times so the cost of embedding the database can be amortized over many queries. However, in settings where the user provides a long document along with the relevant query (e.g., user inputs an entire book into the model along with questions), the embedding must be done at test time. Since the user expects an immediate response, the embedding process must be fast to be practically useful. However, the process of embedding text chunks using standard transformer-based embedders can be slow. Furthermore, dense representations have large memory requirements and, if stored on GPU to enable rapid retrieval, can often overwhelm the limited GPU VRAM.

The challenge of retrieval for complex tasks: Classic RAG methods, which utilize a flat search over a vector database, work well on standard QA and retrieval tasks: A query asks about some content X, and then text chunks most related to X are retrieved. However, many long-context tasks are more complex than this (e.g., summarization and reasoning) and require linking together elements of the input document that may not be directly related to the content of the query, making accurate retrieval difficult.

Fast Effective RAG Using Sparse Graphs

To address these challenges, we propose combining the use of sparse embedding methods with graph-based retrieval;in particular:

We generate sparse embeddings of text chunks and use these to create sparse graph representations of the text data, which represent relations between text chunks. The advantage of such sparse representation is a small memory footprint and amenability to highly efficient sparse linear algebra routines that can be performed entirely on CPUs.

Building on our prior work, during retrieval, we use graph representation to execute an algorithm similar to personalized pagerank (PPR) – a graph-based search and ranking method used originally by Google search. As we explain below, using graph-based retrieval allows for the retrieval of text chunks that have more indirect, complex relationships to the query, and are more likely to be important for reasoning tasks.

Testing on the Hash-Hop Dataset

[Figure 2: The accuracy of basic RAG with sparse embeddings (sBase-RAG), our sparse graph-based RAG (sPPR-RAG), and the accuracies reported by Magic for their LTM. We observe that our method is highly competitive with and even outperforms the magic LTM up to 100M contexts for up to 4 hops and slightly tails off at 5 and 6 hops.]

[Figure 1: Left: visualization of hash-hop task. A list of hash assignments are presented, then the model is queried with one hash, in this case Hash 1. The model must then output all hashes that are equal to this hash. This task thus requires a kind of transitive reasoning over the long-context. Right: The relation between hashes can be represented through a graph, where linked nodes are those hash-assignments that share at least one hash. During retrieval, text chunks that share the query hash as well as those part of the same graph as this node will be retrieved, while those not near enough to the query hash in the graph and those lacking any path to the query hash will be left out.]

The Computational Cost of sPPR

[Figure 3: The embedding and retrieval time for our sparse embedding approach vs standard RAG dense-embeddings. Dense embeddings take substantially longer time to embed the full context even at relatively short sequence length. Our sparse embeddings by contrast run on the CPU and can utilize the vastly greater CPU RAM efficiently while also running significantly faster. sBase refers to our basic RAG (flat retrieval) with our sparse embeddings while sPPR denotes our sparse personalized page-rank retrieval algorithm. dBase refers to dense embedding databases that use FAISS flat search.]

We test the following three parameter embedders:

‘thenlper/gte-small’ – 33M parameters (dBase-small)
‘Alibaba-NLP/gte-base-en-v1.5’ – 130M parameters (dBase-medium)
‘Alibaba-NLP/gte-large-en-v1.5’ – 500M parameters (dBase-large)

How to rapidly embed the input document without massive GPU memory usage – our method uses sparse embeddings which require much less memory and works entirely on the CPU and hence does not interfere with the LLM on GPU.
How to retrieve sufficient information across multiple hops to solve challenging reasoning problems – we solve this using our novel graph-based retrieval method.

Related work

Introduction

Challenges for the RAG Approach to Long-Context Tasks

When attempting to apply RAG to solve long-context retrieval and reasoning tasks, there are two central challenges:

The challenge of fast test-time text embedding: Long-context queries require to operate in a different regime than standard RAG. RAG models first chunk text, then embed the chunks in a vector database, usually using a transformer-based embedding model. In a typical setting, the embedding step happens before deployment, and it is assumed that the same database will be queried multiple times so the cost of embedding the database can be amortized over many queries. However, in settings where the user provides a long document along with the relevant query (e.g., user inputs an entire book into the model along with questions), the embedding must be done at test time. Since the user expects an immediate response, the embedding process must be fast to be practically useful. However, the process of embedding text chunks using standard transformer-based embedders can be slow. Furthermore, dense representations have large memory requirements and, if stored on GPU to enable rapid retrieval, can often overwhelm the limited GPU VRAM.

The challenge of retrieval for complex tasks: Classic RAG methods, which utilize a flat search over a vector database, work well on standard QA and retrieval tasks: A query asks about some content X, and then text chunks most related to X are retrieved. However, many long-context tasks are more complex than this (e.g., summarization and reasoning) and require linking together elements of the input document that may not be directly related to the content of the query, making accurate retrieval difficult.

To address these challenges, we propose combining the use of sparse embedding methods with graph-based retrieval;in particular:

We generate sparse embeddings of text chunks and use these to create sparse graph representations of the text data, which represent relations between text chunks. The advantage of such sparse representation is a small memory footprint and amenability to highly efficient sparse linear algebra routines that can be performed entirely on CPUs.

Building on our prior work, during retrieval, we use graph representation to execute an algorithm similar to personalized pagerank (PPR) – a graph-based search and ranking method used originally by Google search. As we explain below, using graph-based retrieval allows for the retrieval of text chunks that have more indirect, complex relationships to the query, and are more likely to be important for reasoning tasks.

[Figure 1: Left: visualization of hash-hop task. A list of hash assignments are presented, then the model is queried with one hash, in this case Hash 1. The model must then output all hashes that are equal to this hash. This task thus requires a kind of transitive reasoning over the long-context. Right: The relation between hashes can be represented through a graph, where linked nodes are those hash-assignments that share at least one hash. During retrieval, text chunks that share the query hash as well as those part of the same graph as this node will be retrieved, while those not near enough to the query hash in the graph and those lacking any path to the query hash will be left out.]

Fast Effective RAG Using Sparse Graphs

[Figure 2: The accuracy of basic RAG with sparse embeddings (sBase-RAG), our sparse graph-based RAG (sPPR-RAG), and the accuracies reported by Magic for their LTM. We observe that our method is highly competitive with and even outperforms the magic LTM up to 100M contexts for up to 4 hops and slightly tails off at 5 and 6 hops.]

[Figure 3: The embedding and retrieval time for our sparse embedding approach vs standard RAG dense-embeddings. Dense embeddings take substantially longer time to embed the full context even at relatively short sequence length. Our sparse embeddings by contrast run on the CPU and can utilize the vastly greater CPU RAM efficiently while also running significantly faster. sBase refers to our basic RAG (flat retrieval) with our sparse embeddings while sPPR denotes our sparse personalized page-rank retrieval algorithm. dBase refers to dense embedding databases that use FAISS flat search.]

We test the following three parameter embedders:

‘thenlper/gte-small’ – 33M parameters (dBase-small)
‘Alibaba-NLP/gte-base-en-v1.5’ – 130M parameters (dBase-medium)
‘Alibaba-NLP/gte-large-en-v1.5’ – 500M parameters (dBase-large)

How to rapidly embed the input document without massive GPU memory usage – our method uses sparse embeddings which require much less memory and works entirely on the CPU and hence does not interfere with the LLM on GPU.
How to retrieve sufficient information across multiple hops to solve challenging reasoning problems – we solve this using our novel graph-based retrieval method.

Introduction

Challenges for the RAG Approach to Long-Context Tasks

When attempting to apply RAG to solve long-context retrieval and reasoning tasks, there are two central challenges:

The challenge of fast test-time text embedding: Long-context queries require to operate in a different regime than standard RAG. RAG models first chunk text, then embed the chunks in a vector database, usually using a transformer-based embedding model. In a typical setting, the embedding step happens before deployment, and it is assumed that the same database will be queried multiple times so the cost of embedding the database can be amortized over many queries. However, in settings where the user provides a long document along with the relevant query (e.g., user inputs an entire book into the model along with questions), the embedding must be done at test time. Since the user expects an immediate response, the embedding process must be fast to be practically useful. However, the process of embedding text chunks using standard transformer-based embedders can be slow. Furthermore, dense representations have large memory requirements and, if stored on GPU to enable rapid retrieval, can often overwhelm the limited GPU VRAM.

The challenge of retrieval for complex tasks: Classic RAG methods, which utilize a flat search over a vector database, work well on standard QA and retrieval tasks: A query asks about some content X, and then text chunks most related to X are retrieved. However, many long-context tasks are more complex than this (e.g., summarization and reasoning) and require linking together elements of the input document that may not be directly related to the content of the query, making accurate retrieval difficult.

Fast Effective RAG Using Sparse Graphs

To address these challenges, we propose combining the use of sparse embedding methods with graph-based retrieval;in particular:

We generate sparse embeddings of text chunks and use these to create sparse graph representations of the text data, which represent relations between text chunks. The advantage of such sparse representation is a small memory footprint and amenability to highly efficient sparse linear algebra routines that can be performed entirely on CPUs.

Building on our prior work, during retrieval, we use graph representation to execute an algorithm similar to personalized pagerank (PPR) – a graph-based search and ranking method used originally by Google search. As we explain below, using graph-based retrieval allows for the retrieval of text chunks that have more indirect, complex relationships to the query, and are more likely to be important for reasoning tasks.

[Figure 1: Left: visualization of hash-hop task. A list of hash assignments are presented, then the model is queried with one hash, in this case Hash 1. The model must then output all hashes that are equal to this hash. This task thus requires a kind of transitive reasoning over the long-context. Right: The relation between hashes can be represented through a graph, where linked nodes are those hash-assignments that share at least one hash. During retrieval, text chunks that share the query hash as well as those part of the same graph as this node will be retrieved, while those not near enough to the query hash in the graph and those lacking any path to the query hash will be left out.]

[Figure 2: The accuracy of basic RAG with sparse embeddings (sBase-RAG), our sparse graph-based RAG (sPPR-RAG), and the accuracies reported by Magic for their LTM. We observe that our method is highly competitive with and even outperforms the magic LTM up to 100M contexts for up to 4 hops and slightly tails off at 5 and 6 hops.]

[Figure 3: The embedding and retrieval time for our sparse embedding approach vs standard RAG dense-embeddings. Dense embeddings take substantially longer time to embed the full context even at relatively short sequence length. Our sparse embeddings by contrast run on the CPU and can utilize the vastly greater CPU RAM efficiently while also running significantly faster. sBase refers to our basic RAG (flat retrieval) with our sparse embeddings while sPPR denotes our sparse personalized page-rank retrieval algorithm. dBase refers to dense embedding databases that use FAISS flat search.]

We test the following three parameter embedders:

‘thenlper/gte-small’ – 33M parameters (dBase-small)
‘Alibaba-NLP/gte-base-en-v1.5’ – 130M parameters (dBase-medium)
‘Alibaba-NLP/gte-large-en-v1.5’ – 500M parameters (dBase-large)

How to rapidly embed the input document without massive GPU memory usage – our method uses sparse embeddings which require much less memory and works entirely on the CPU and hence does not interfere with the LLM on GPU.
How to retrieve sufficient information across multiple hops to solve challenging reasoning problems – we solve this using our novel graph-based retrieval method.

Table 1: Evaluation scores for Zyda-2 vs alternative datasets broken down more granularly by specific evaluation metric

Testing on the Hash-Hop Dataset

The Computational Cost of sPPR

Related work

Analysis of Global Duplicates

We present histograms depicting distribution of cluster sizes in all the datasets (see Fig. 7-11). Please, note that all the figures are in log-log scale. We see a significant drop in the number of clusters starting from the size of around 100. This drop is present both in DCLM and FineWeb-Edu2 (see Fig. 8 and 9 respectively), and most likely is explained by a combination of the deduplication strategy and quality when creating both datasets: DCLM deduplication was done individually within 10 shards, while FineWeb-Edu2 was deduplicated within every Common Crawl snapshot. We find that large clusters usually contain low quality material (repeated advertisements, license agreements templates, etc), so it’s not surprising that such documents were removed. Notably, DCLM still contained one cluster with the size close to 1 million documents, containing low quality documents seemingly coming from the advertisements (see Appendix).We find both Zyda-1and Dolma-CC contain a small amount of duplicates, which is expected, since both datasets were deduplicated globally by their authors. Remaining duplicates are likely false negatives from the initial deduplication procedure. Note, that distribution of duplicates clusters sizes of these two datasets (Fig. 10 and 11) don’t contain any sharp drops, but rather hyper exponentially decreases with cluster size.

Figure 7: Distribution of cluster sizes of duplicates in global dataset (log-log scale).

Figure 8: Distribution of cluster sizes of duplicates in DCLM (log-log scale).

Figure 9: Distribution of cluster sizes of duplicates in FineWeb-Edu2 (log-log scale).

Figure 10: Distribution of cluster sizes of duplicates in Zyda-1 (log-log scale).

Figure 11: Distribution of cluster sizes of duplicates in Dolma-CC (log-log scale).

Largest cluster in DCLM

Below is an example of the document from the largest cluster (~1M documents) of duplicates in DCLM (quality score 0.482627):
‍
‍Is safe? Is scam?
Is safe for your PC?
Is safe or is it scam?
Domain is SafeSafe score: 1
‍‍
‍The higher the number, the more dangerous the website.Any number higher than 1 means DANGER.
‍‍
‍Positive votes:
Negative votes:
Vote Up Vote Down review
‍‍
‍Have you had bad experience with Warn us, please!

Examples of varying quality score in a cluster of duplicates in DCLM

Below one will find a few documents with different quality scores from DCLM coming from the same duplicates cluster. Quality score varies from ~0.2 to ~0.04.

Document ID: <urn:uuid:941f22c0-760e-4596-84fa-0b21eb92b8c4>

Quality score of: 0.19616

Thrill Jockey instrumental duo Rome are, like many of the acts on the Chicago-based independent label, generally categorized as loose adherents of "post-rock," a period-genre arising in the mid-'90s to refer to rock-based bands utilizing the instruments and structures of music in a non-traditionalist or otherwise heavily mutated fashion. Unlike other Thrill Jockey artists such as Tortoise and Trans-Am, however, Rome draw less obviously from the past, using instruments closely associated with dub (melodica, studio effects), ambient (synthesizers, found sounds), industrial (machine beats, abrasive sounds), and space music (soundtrack-y atmospherics), but fashioning from them a sound which clearly lies beyond the boundaries of each. Perhaps best described as simply "experimental," Rome formed in the early '90s as the trio of Rik Shaw (bass), Le Deuce (electronics), and Elliot Dicks (drums). Based in Chicago, their Thrill Jockey debut was a soupy collage of echoing drums, looping electronics, and deep, droning bass, with an overwhelmingly live feel (the band later divulged that much of the album was the product of studio jamming and leave-the-tape-running-styled improvisation). Benefiting from an early association with labelmates Tortoise as representing a new direction for American rock, Rome toured the U.S. and U.K. with the group (even before the album had been released), also appearing on the German Mille Plateaux label's tribute compilation to French philosopher Gilles Deleuze, In Memoriam. Although drummer Dicks left the group soon after the first album was released, Shaw and Deuce wasted no time with new material, releasing the "Beware Soul Snatchers" single within weeks of its appearance. An even denser slab of inboard studio trickery, "Soul Snatchers" was the clearest example to date of the group's evolving sound, though further recordings failed to materialize. ~ Sean Cooper, Rovi

Document ID: <urn:uuid:0df10da5-58b8-44d8-afcb-66aa73d1518b>

Quality score of: 0.091928

Thrill Jockey instrumental duo Rome are, like many of the acts on the Chicago-based independent label, generally grouped in as loose adherents of "post-rock," a period-genre arising in the mid-'90s to refer to rock-based bands utilizing the instruments and structures of the music in a non-traditionalist or otherwise heavily mutated fashion. Unlike other Thrill Jocky artists such as Tortoise and Trans-Am, however, Rome draw less obviously from the past, using instruments closely associated with dub (melodica, studio effects), ambient (synthesizers, found sounds), industrial (machine beats, abrasive sounds), and space music (soundtrack-y atmospherics), but fashioning from them a sound which lay clearly beyond the boundaries of each. Perhaps best described as simply experimental, Rome formed in the early '90s as the trio of Rik Shaw (bass), Le Deuce (electronics), and Elliot Dick (drums). Based in Chicago, their Thrill Jockey debut was a soupy collage of echoing drums, looping electronics, and deep, droning bass, with an overwhelmingly live feel (the band later divulged that much of the album was the product of studio jamming and leave-the-tape-running styled improvisation). Benefiting from an early association with labelmates Tortoise as representing a new direction for American rock, Rome toured the U.S. and U.K. with the group (even before the album had been released), also appearing on the German Mille Plateaux label's tribute compilation to French philosopher Gilles Deleuze, In Memoriam. Although drummer Elliot Dick left the group soon after the first album was released, Shaw and Deuce wasted no time with new material, releasing the "Beware Soul Snatchers" single within weeks of its appearance. An even denser slab of inboard studio trickery, "Soul Snatchers" was the clearest example to date of the group's evolving sound, though further recordings failed to materialize.
Sean Cooper, Rovi
‍
More Rome
‍
You may also like...

Document ID: <urn:uuid:4986ef09-3ee3-4e13-9084-7898aaf72aaf>

Quality score of: 0.072259

recent on-air advertisers

Now Playing

You Control the ...

Artist Snapshot:

Thrill Jockey instrumental duo Rome are, like many of the acts on the Chicago-based independent label, generally grouped in as loose adherents of "post-rock," a period-genre arising in the mid-'90s to refer to rock-based bands utilizing the instruments and structures of the music in a non-traditionalist or otherwise heavily mutated fashion. Unlike other Thrill Jocky artists such as Tortoise and Trans-Am, however, Rome draw less obviously from the past, using instruments closely associated with dub (melodica, studio effects), ambient (synthesizers, found sounds), industrial (machine beats, abrasive sounds), and space music (soundtrack-y atmospherics), but fashioning from them a sound which lay clearly beyond the boundaries of each. Perhaps best described as simply experimental, Rome formed in the early '90s as the trio of Rik Shaw (bass), Le Deuce (electronics), and Elliot Dick (drums). Based in Chicago, their Thrill Jockey debut was a soupy collage of echoing drums, looping electronics, and deep, droning bass, with an overwhelmingly live feel (the band later divulged that much of the album was the product of studio jamming and leave-the-tape-running styled improvisation). Benefiting from an early association with labelmates Tortoise as representing a new direction for American rock, Rome toured the U.S. and U.K. with the group (even before the album had been released), also appearing on the German Mille Plateaux label's tribute compilation to French philosopher Gilles Deleuze, In Memoriam. Although drummer Elliot Dick left the group soon after the first album was released, Shaw and Deuce wasted no time with new material, releasing the "Beware Soul Snatchers" single within weeks of its appearance. An even denser slab of inboard studio trickery, "Soul Snatchers" was the clearest example to date of the group's evolving sound, though further recordings failed to materialize. ~ Sean Cooper, RoviSean Cooper, Rovi
‍
More Rome
‍
You may also like...

Document ID: <urn:uuid:1e0496a9-0116-418a-9aec-e65b1d20e709>

Quality score of: 0.0424

18 June 2015

ROME self titled 1996

by request

Artist Biography by

Thrill Jockey instrumental duo Rome are, like many of the acts on the Chicago-based independent label, generally categorized as loose adherents of "post-rock," a period-genre arising in the mid-'90s to refer to rock-based bands utilizing the instruments and structures of music in a non-traditionalist or otherwise heavily mutated fashion. Unlike other Thrill Jockey artists such as Tortoise and Trans-Am, however, Rome draw less obviously from the past, using instruments closely associated with dub (melodica, studio effects), ambient (synthesizers, found sounds), industrial (machine beats, abrasive sounds), and space music (soundtrack-y atmospherics), but fashioning from them a sound which clearly lies beyond the boundaries of each. Perhaps best described as simply "experimental," Rome formed in the early '90s as the trio of Rik Shaw (bass), Le Deuce (electronics), and Elliot Dicks (drums). Based in Chicago, their Thrill Jockey debut was a soupy collage of echoing drums, looping electronics, and deep, droning bass, with an overwhelmingly live feel (the band later divulged that much of the album was the product of studio jamming and leave-the-tape-running-styled improvisation). Benefiting from an early association with labelmates Tortoise as representing a new direction for American rock, Rome toured the U.S. and U.K. with the group (even before the album had been released), also appearing on the German Mille Plateaux label's tribute compilation to French philosopher Gilles Deleuze, In Memoriam. Although drummer Dicks left the group soon after the first album was released, Shaw and Deuce wasted no time with new material, releasing the "Beware Soul Snatchers" single within weeks of its appearance. An even denser slab of inboard studio trickery, "Soul Snatchers" was the clearest example to date of the group's evolving sound, though further recordings failed to materialize.
‍
1 Leaving Perdition 8:10
2 Intermodal 3:39
3 Lunar White 3:25
4 She's A Black Belt 3:14
5 Rohm 1:09
6 Radiolucence (Version) 5:31
7 Deepest Laws 14:14

No comments: