webcrawler-source
Last updated
Last updated
The webcrawler-source agent crawls a website and outputs the site's URL and an HTML document. Crawling a website is an ideal first step in a .
This agent keeps the status of the crawling in a persistent storage. Storage won’t contain a copy of the crawl data, but a single JSON file with a name computed from the name of the agent and the id of the LangStream application.
By default, it requires an S3-compatible bucket that must be defined using bucketName
, endpoint
, access-key
, secret-key
and region
properties. Another solution is to store the status in a . This can be achieved by setting state-storage: disk
.
Example webcrawler agent in a pipeline:
Multiple seed-urls and allowed-domains are allowed.
To add them to your pipeline, use this syntax:
Input
None, it’s a source
Output
pendingUrls
String
Holds the URLs that have been discovered but are yet to be processed.
remainingUrls
String
Holds the URLs that have been discovered but are yet to be processed.
visitedUrls
String
Holds all URLs that have been visited to prevent cyclic crawling.
Topics necessary for the application are declared: we will later pass the chunked embeddings to a vector database via the Kafka "chunks-topic".
For each seed-url, the webcrawler:
Sets up a connection using Jsoup.
Catches HTTP status errors and handles retries or skips based on the error code (HttpStatusException
for HTTP errors and UnsupportedMimeTypeException
for non-HTML content types)
If the content is HTML, the webcrawler fetches the links (hrefs) from the page and adds them to the list of URLs to be crawled, if the href passes the allowed-domain rule set in the pipeline configuration.
Gets the page's HTML and creates a document with the site url as a header and raw HTML as output.
The webcrawler then passes the document on to the next agent.
Structured text (JSON)
Implicit topic
Checkout the full configuration properties in the .
Using the webcrawler source agent as a starting point, this workflow will crawl a website, get a page's raw HTML, and process that information into chunks for embedding in a vector database. The complete example code is available in the
The webcrawler source agent configuration is declared. For help finding your credential information, see For help configuring the webcrawler-source, see
The webcrawler itself uses the library to parse HTML with the . The webcrawler explores the web starting from a list of seed URLs and follows links within pages to discover more content.
The extracts metadata and text from records using Apache Tika.
The forces the text into lower case and removes leading and trailing spaces.
The identifies a record’s language. In this case, non-English records are skipped, and English records continue to the next step in the pipeline.
The splits the document into chunks of text. This is an important part of the vectorization pipeline to understand, because it requires balancing between performance and accuracy. chunk_size controls the maximum number of characters of the chunked documents, and chunk_overlap controls the amount of overlap between chunks. A little overlap keeps results more consistent. chunk_size defaults to 1000 characters, and chunk_overlap defaults to 200 characters.
The agent converts the unstructured data to structured JSON.
The structures the JSON output into values the final compute step can work with.
Now that the text is processed and structured, the agent computes embeddings and sends them to the Kafka "chunks-topic".
10. Where to next? If you've got an , use the agent to sink the vectorized embeddings via the Kafka "chunks-topic" to your database. From there, you can your vector data, or ask questions with a . It's up to you!