FLARE pattern
The Flare pattern is an extension of the The Retrieval Augmented Generation (RAG) pattern that adds a feedback loop to improve the quality of the answers provided by the LLM.
You can read the original paper describing the FLARE pattern here.
Please note that FLARE can currently be implemented only with OpenAI models and the ai-text-completions agent because it requires the probability of correctness (logprobs) for each token.
How does Flare work ?
The idea behind Flare is quite simple: using the Completions API, the LLM returns a "probability" for each token in the generated text. With this information, we can identify the tokens that are more likely to be wrong and retrieve more pieces of information (documents) to automatically build a better prompt for the LLM.
This is the flow of a Flare pipeline:
Start with a query text
Add the query text to the list of queries for the vector database ("documents to retrieve")
Compute the embeddings for each document to retrieve
Lookup relevant documents from a vector database
Add the results to the list of documents to use to build the prompt
Build a prompt with the query text and the most relevant documents
Query the LLM with the prompt to get the final response, and the logprobs for each token
Build a list of "Uncertain spans" (sequences of tokens with low probability)
If there is at least one uncertain span, add them to the "documents to retrieve" list and go back to step 3
If there is no uncertain span or we did too many iterations, return the response to the user
As you can see at point 6, there is a feedback loop that allows us to improve the quality of the answer.
In LangStream we implement the loop by sending the current record to the topic that is used as input for the pipeline at step 3.
This is pretty easy and intuitive and allows you to implement a Flare pipeline with just a few lines of code.
### Benefits of using a topic to perform the loop
The presence of a buffer topic to implement the loop has several benefits:
You can retry in case of failure while computing the embeddings or querying the vector database
Back pressure is handled automatically by the platform
You can deal with multiple queries at the same time
You can batch requests to the embedding service
You can batch requests to the vector database
You can prevent the vector database and the embedding service from being overloaded
Basically, a buffer topic allows you to build a scalable version of the Flare pattern.
Using the Flare Controller Agent
The FLARE loop is handled by the flare-controller agent. The flare controller agent handles steps 5, 6 and 7 of the flow above:
collects the list of uncertain spans
adds the uncertain spans to the list of documents to retrieve
triggers the loop (by writing to the input topic of the flare loop)
handles the maximum number of iterations
The flare-controller agent uses a field in the message called "flare_iterations" by default to handle the number of iterations.
Using the embedding service over a list of documents
To implement the Flare pattern, you need to query the embedding service multiple times. LangStream provides an easy way to perform the same operation over a list of documents with the 'loop-over' capability.
In the example below we use the 'loop-over' capability to query the embedding service for each document in the list of documents to retrieve.
When you use "loop-over", the agent executes for each element in a list instead of operating on the whole message. Use "record.xxx" to refer to the current element in the list.
The snippet above computes the embeddings for each element in the list "documents_to_retrieve". The list is expected to be a struct like this:
After running the agent the contents of the list are:
Querying the vector database over a list of documents
To implement the Flare pattern, you need to query the vector database to look up documents relevant to more than one input query. LangStream provides an easy way to perform the same operation over a list of documents with the 'loop-over' capability.
In the example below we use the 'loop-over' capability to query the database for each document in the list of documents to retrieve.
When you use "loop-over", the agent executes for each element in a list instead of operating on the whole message. Use "record.xxx" to refer to the current element in the list.
The snippet above computes the embeddings for each element in the list "documents_to_retrieve". The list is expected to be a struct like this:
The agent then adds all the results to a new field named "retrieved_documents" in the message.
After running the agent the contents of the list are:
This behaviour is different from the "compute-ai-embeddings" agent. The "query-vector-db" agent used here adds the results to the message instead of replacing the original list, and the results are all added to the same field.
Example
What’s next?
Do you have a website lying around just waiting to be turned into a useful chatbot ? This complete pipeline is available in the LangStream repo, and running it on your own is no sweat.
Last updated