Recursive URL Loader add custom URLs #4231

MK-cypher · 2024-02-01T10:41:24Z

MK-cypher
Feb 1, 2024

is there a way to include custom directories to the Recursive URL Loader?
incase it can't reach them or maybe it's one or two links from another website.

maybe something like this:

  const loader = new RecursiveUrlLoader(url, {
    extractor: compiledConvert,
    maxDepth: 10,
    excludeDirs: [],
    includeDirs: [
      "https://includeExample.com/example-1",
      "https://includeExample.com/example-2",
      "https://includeExample.com/example-3",
      "https://includeExample.com/example-4",
    ],
  });

@dosu-bot · 2024-02-01T10:41:49Z

dosubot[bot]
bot Feb 1, 2024

🤖

Hey @kareiweb, nice to see you diving into new challenges! How's everything going on your end?

Based on the current implementation of the RecursiveUrlLoader in the langchainjs framework, it does not directly support the inclusion of custom directories or URLs as you've described. The RecursiveUrlLoader is designed to recursively load URLs from a given starting URL, and it does not have an includeDirs option in its configuration.

However, you can exclude certain directories from being loaded using the excludeDirs option, as shown in the test cases from the langchain/src/document_loaders/tests/recursive_url.int.test.ts file.

const loader = new RecursiveUrlLoader(url, {
  extractor: compiledConvert,
  maxDepth: 1,
  excludeDirs: ["https://js.langchain.com/docs/api/"],
});

In this example, the excludeDirs option is used to specify a list of directories that the RecursiveUrlLoader should not load.

If you want to load documents from specific directories or URLs, you might consider using the DirectoryLoader class, which allows you to specify a directory path when creating a new instance. Here's an example from the langchain/src/document_loaders/fs/directory.ts file:

const directoryLoader = new DirectoryLoader(
  "src/document_loaders/example_data/",
  {
    ".pdf": (path: string) => new PDFLoader(path),
  },
);

const docs = await directoryLoader.load();
console.log({ docs });

In this example, the DirectoryLoader is used to load documents from the example_data directory.

If you need to load documents from multiple directories or URLs, you could create multiple instances of the DirectoryLoader or RecursiveUrlLoader as needed.

I hope this helps! If you have any other questions or need further clarification, feel free to ask.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

1 reply

MK-cypher Feb 1, 2024
Author

@dosu-bot is there at least a way to load custom URLs in langchain js ?
like the UnstructuredURLLoader in python langchain where we can load an array of URLs.
it has to be URLs not documents or pdf so I can't use the DirectoryLoader.

@dosu-bot · 2024-02-01T11:27:40Z

dosubot[bot]
bot Feb 1, 2024

🤖

Currently, the RecursiveUrlLoader in langchainjs does not support loading an array of URLs or including custom directories directly. It is designed to recursively load URLs from a single base URL, excluding any directories specified in the excludeDirs option.

However, you can achieve similar functionality by creating multiple instances of RecursiveUrlLoader, each with a different base URL from your array, and then loading them separately. Here's an example:

const urls = ['http://example1.com', 'http://example2.com', 'http://example3.com']; // Your array of URLs
const loaders = urls.map(url => new RecursiveUrlLoader(url, { /* your options here */ }));

// Load documents from all URLs
const allDocuments = [];
for (const loader of loaders) {
  const documents = await loader.load();
  allDocuments.push(...documents);
}

In this example, allDocuments will contain all documents loaded from all URLs in the array. Please note that this will load the URLs concurrently, which might not be what you want if you have a large number of URLs or if the servers you're loading from can't handle many requests at once. If that's the case, you might want to load the URLs sequentially or in smaller batches.

For more information, you can refer to the RecursiveUrlLoader in the langchainjs codebase.

Regarding your question about a feature similar to the UnstructuredURLLoader in python langchain, currently, langchainjs does not have a direct equivalent. The closest feature is the RecursiveUrlLoader, which allows for multiple URLs to be loaded at once from a single base URL and its linked pages, controlled by the maxDepth option. Here's an example:

const url = "https://js.langchain.com/docs/get_started/introduction";

const compiledConvert = compile({ wordwrap: 130 }); // returns (input: string) => string;

const loader = new RecursiveUrlLoader(url, {
  extractor: compiledConvert,
  maxDepth: 1,
  excludeDirs: ["https://js.langchain.com/docs/api/"],
});

const docs = await loader.load();
expect(docs.length).toBeGreaterThan(1);
expect(docs[0].pageContent).toContain("LangChain");

In this example, the RecursiveUrlLoader is initialized with a URL and options. The load method is then called to load the content of the URL and any URLs linked from that page (because maxDepth is set to 1). The loaded content is then stored in the docs array. The length of the docs array is expected to be greater than 1, indicating that multiple URLs have been loaded.

You can find this example in the langchainjs test suite.

I hope this helps! If you have any more questions or need further clarification, feel free to ask.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

0 replies

shreyanmitra · 2024-09-06T04:44:01Z

shreyanmitra
Sep 6, 2024

@kareiweb Any updates on this? I want to do something in JS similar to WebBaseLoader in python, where the arrays of urls I want are passed in.
If this is not possible, how can I merge multiple results of loader.load() and use them as input to a RAG app?

1 reply

MK-cypher Oct 6, 2024
Author

sadly, I've spent too much time looking for a solution and couldn't find anything so I had to make my own web crawler instead.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recursive URL Loader add custom URLs #4231

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Recursive URL Loader add custom URLs #4231

MK-cypher Feb 1, 2024

Replies: 3 comments · 2 replies

dosubot[bot] bot Feb 1, 2024

Sources

MK-cypher Feb 1, 2024 Author

dosubot[bot] bot Feb 1, 2024

Sources

shreyanmitra Sep 6, 2024

MK-cypher Oct 6, 2024 Author

MK-cypher
Feb 1, 2024

Replies: 3 comments 2 replies

dosubot[bot]
bot Feb 1, 2024

MK-cypher Feb 1, 2024
Author

dosubot[bot]
bot Feb 1, 2024

shreyanmitra
Sep 6, 2024

MK-cypher Oct 6, 2024
Author