Vector Databases
Vector databases are a crucial part of the generative AI workflow. Vector databases store vector representations (embeddings) of text, images, sound, etc. They include search tools to enable similarity search across the vector representations to find semantically similar data.
Vector databases are typically used as part of retrieval augmented generation (RAG). In the RAG workflow, relevant documents or passages are retrieved from a vector database based on their semantic relevance. These documents or passages are then included in the LLM prompt to provide additional context for the LLM to use when generating a response. This pattern is used to:
Reduce AI hallucinations
Provide more accurate, up-to-date, and context-aware responses
Extend the knowledge base of the LLM
LangStream makes it easy to build applications using the RAG pattern. It currently has native support for DataStax Astra DB, Pinecone, Milvus/Zilliz, OpenSearch and Apache Cassandra.
When working with a vector database you will either be writing vector embeddings to a vector database or performing semantic similarity queries across the vectors in the database. Check out the vector-db-sink agent for writing to vector databases and the query-vector-db agent for querying.
There is also built-in support for querying and writing to any database which has a JDBC driver.
Please refer to the Data Storage section for more information on how to configure your vector database.
You can also write to any database with an available Kafka Connect connector.
Vectorization example
An embedding model, such as OpenAI's text-embedding-ada-002
, converts each document into a vector representation. These document vectors are stored in a vector database designed for similarity search.
A user enters a query text related to a news article about technology trends. The query text is processed using the same embedding model to obtain a vector representation.
A similarity search with this query is performed in the vector database. This involves calculating the cosine similarity or another suitable distance metric between the query vector and the vectors of all stored documents.
The system retrieves the top N documents with the highest similarity scores to the query vector, and the retrieved documents are presented to the user, providing articles that are thematically or contextually similar to the query text.
Similarity search application
Let’s put this similarity search example in the context of a LangStream application - for example, we have a bucket full of PDFs (unstructured data) and we want to turn them into meaningful embeddings.
A LangStream application is a series of steps called a pipeline. At each step, an agent acts on messages streamed through the application.
Here’s how you would make a text similarity search application in LangStream. The full pipeline yaml is available here.
The first section names the pipeline and declares a “chunks-topic” Kafka topic for message transport into the pipeline.
The first step of the pipeline is to read our data from an S3 bucket.
This part of the pipeline pulls credentials from a secrets.yaml file with references to secrets.
In the second step the agent extracts metadata and text from records using Apache Tika. The records don’t have to be PDFs - Tika supports thousands of formats.
The text normaliser agent forces the text into lower case and removes leading and trailing spaces.
The language detector agent identifies a record’s language. In this case, non-English records are skipped, and English records continue to the next step in the pipeline.
The records are split into chunks of text. This is an important part of the vectorization pipeline to understand, because it requires balancing between performance and accuracy. chunk_size controls the maximum number of characters of the chunked documents, and chunk_overlap controls the amount of overlap between chunks. A little overlap keeps results more consistent. chunk_size defaults to 1000 characters, and chunk_overlap defaults to 200 characters.
This agent converts the unstructured data to structured JSON.
The compute agent structures the output into values the final compute step can work with.
Now that the text is processed and structured, an agent computes embeddings and sends them to the Kafka "chunks-topic".
The final agent takes the embeddings from the "chunks-topic" and writes them to an Astra vector database. As with the S3 agent, these credentials are pulled from secrets.yaml.
Now, all the information from your PDFs is embedded in a vector database. Try setting up a chatbot and asking questions about all the information you've made available!
What’s next?
Do you have a website lying around just waiting to be turned into useful, vectorized text? This complete pipeline is available in the LangStream repo, and running it on your own is no sweat.
Last updated