Web Search node

Web Search node allows you to search the web. It searches the web, scrapes top results, renders them according to user-specified template, and merges them.

Scraping some pages may fail and it will be silently ignored. However, if scraping failed for all pages, then this node will return an error.

If you plan to pass the output of this node to LLM, then we suggest to use models such as Claude, which have larger context window size. To illustrate, some long Wikipedia-article pages may result in gpt-3.5-turbo failing, since those article are longer than its content window which fits only 16,385 tokens.

Another alternative is to only use relevant chunks of text from each scraped page by enabling Only semantically similar chunks.

Currently, this node uses Google search for searching, so you’ll need to provide your google search keys (Read more). No keys need to be provided for web scraping, since we manage it by ourselves. Feel free to contact us at founders@lmnr.ai if you need other search or scrape/crawl providers.

Configuration

Scraped pages

The number of pages to be scraped from top results of the search. We suggest to put at least 2-3, since sometimes the top result may be irrelevant or it may not contain any useful content.

Template

You can control the output through a template where keys are placed into {{mustache_syntax}}. This template is applied to every content retrieved by semantic search.

All retrieved contents, rendered according to template, are joined via a newline character \n.

The keys which can be used in the template:

  • title - The title of the scraped web page.
  • url - The url of the scraped web page.
  • content - The whole content of the scraped page OR chunk’s content, depending on if Only semantically similar chunks is enabled. Note that when we scrape the web page’s content, we remove the html boilerplate and try to convert to markdown.

Only semantically similar chunks

If set to true, each scraped web page’s text will be chunked, then each chunk will be turned into embedding, and only chunks which are semantically similar to the query will be returned.

If this is enabled, then Template configs will be applied to each of the relevant chunks, not to the whole web page’s text. Thus, content in the template will render only the chunk’s content, not the whole web page’s text.

Usually, the result of web search is next passed to LLM to make a human-readable response. However, feeding very long webpages may not fit into the LLM’s context window. Thus, it’s better to just extract the relevant chunks and then feed them to LLM.

Limit

Limit is the maximum number of chunks to be returned. Valid range: non-negative integers.

This field is only present if “Only semantically similar chunks” is enabled.

Env variables

You’ll need to set the following variables at env variables page:

  • GOOGLE_SEARCH_ENGINE_ID
  • GOOGLE_SEARCH_API_KEY

Go to Google’s Custom Search JSON API docs to learn more on how to issue these keys.