Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support /extract and /crawl for self-hosted #1137

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

rothnic
Copy link
Contributor

@rothnic rothnic commented Feb 5, 2025

The goal of this branch would be to support retrieving the crawl and extract data, even if there is no supabase for persistant storage and db authentication.

Things I'm not sure of:

  • Does this make sense or should we be working towards relying on supabase, even for self-hosted?
  • There doesn't appear to be any types representing what is stored in redis with the job results. Am i missing that somewhere?

Error When Getting Extract by ID (after completed)

Here is an error from extract/{id}: 
> GET http://10.0.0.4:3002/v1/extract/2820aba6-c6eb-4115-83d5-53b98a17b5aa
< 500 Internal Server Error
< x-powered-by: Express
< access-control-allow-origin: *
< content-type: application/json; charset=utf-8
< content-length: 155
< etag: W/"9b-ATFXC27oGwA7oeVlivs77BM3h9g"
< date: Wed, 05 Feb 2025 11:26:53 GMT
< connection: close
 returning 
{
  "success": false,
  "error": "An unexpected error occurred. Please contact help@firecrawl.com for help. Your exception ID is 9b5a0024e20b41c3aa0b56ed96519b1f"
}

Extract Job in Redis

Here is a completed job stored in redis underneath bull:{extractQueue}:extract:${extract_job_id}. I noticed the completed extract exists here, so updated extract-status.ts to fetch this data when db auth is set to false.

{
  "jobData": {
    "request": {
      "urls": [
        "https://www.mendable.ai/*"
      ],
      "prompt": "Summarize 5 pages starting with the home page",
      "ignoreSitemap": false,
      "includeSubdomains": true,
      "allowExternalLinks": false,
      "enableWebSearch": false,
      "origin": "api",
      "urlTrace": false,
      "timeout": 60000,
      "__experimental_streamSteps": false,
      "__experimental_llmUsage": false,
      "__experimental_showSources": true
    },
    "teamId": "bypass",
    "extractId": "9e814406-a9ee-46ab-9a73-fb6b0a546a89"
  },
  "returnValue": {
    "success": true,
    "data": {
      "pages": [
        {
          "url": "https://www.mendable.ai",
          "title": "Mendable",
          "summary": "Mendable is a platform that allows businesses to build AI chat applications. It offers customizable components and a robust API to integrate LLM capabilities into applications. The platform supports data ingestion from various sources and provides enterprise-grade security features. Mendable is trusted by top companies and offers solutions for sales enablement, customer success, and product copilot use cases."
        },
        {
          "url": "https://www.mendable.ai/blog/getting-started",
          "title": "Everything you need to know about Mendable: Build and deploy AI Chat Search",
          "summary": "This page provides a comprehensive guide to setting up an AI chatbot using Mendable. It covers the platform's modular design, data ingestion methods, and model customization options. The guide also includes steps for inviting team members, testing the chatbot, and troubleshooting common issues. It emphasizes the importance of customizing the model to fit specific needs and offers tips for deployment."
        },
        {
          "url": "https://www.mendable.ai/blog/mendable-launch",
          "title": "Introducing Mendable.ai",
          "summary": "This blog post announces the launch of Mendable.ai, highlighting its capabilities as a universal API for chatting with data. It describes Mendable's developer-centric approach, customizable components, and robust API for integrating chat-powered search functionality into applications. The post emphasizes the platform's ability to provide accurate and contextually relevant answers based on company data."
        },
        {
          "url": "https://www.mendable.ai/blog/building-copilots",
          "title": "Building context-aware AI copilots with Mendable",
          "summary": "This article discusses how to create AI copilots using Mendable that provide personalized answers by utilizing application context. It explains the difference between traditional QA AI and product copilots, and provides a step-by-step guide to integrating a Mendable component into a React application. The article highlights the importance of passing dynamic context to enhance user interactions."
        },
        {
          "url": "https://www.mendable.ai/blog/building-safe-rag",
          "title": "Building Safe RAG systems with the LLM OWASP top 10",
          "summary": "This blog post addresses the security concerns of building LLM systems for enterprises, focusing on the OWASP Top 10 for LLMs. It discusses the challenges and solutions for mitigating risks such as prompt injection, insecure output handling, and sensitive information disclosure. The post emphasizes the importance of implementing security measures to ensure enterprise readiness of RAG solutions."
        }
      ]
    },
    "extractId": "9e814406-a9ee-46ab-9a73-fb6b0a546a89",
    "llmUsage": 0.04475,
    "totalUrlsScraped": 5,
    "sources": {
      "pages": [
        "https://www.mendable.ai/",
        "https://www.mendable.ai/blog/getting-started",
        "https://www.mendable.ai/blog/mendable-launch",
        "https://www.mendable.ai/blog/building-copilots",
        "https://www.mendable.ai/blog/building-safe-rag"
      ]
    }
  }
}

Example Response from /extract/{id} After My Changes

{
  "success": true,
  "data": {
    "pages": [
      {
        "url": "https://www.mendable.ai",
        "title": "Mendable",
        "summary": "Mendable is a platform that allows businesses to build AI chat applications. It offers customizable components and a robust API to integrate LLM capabilities into applications. The platform supports data ingestion from various sources and provides enterprise-grade security features. Mendable is trusted by top companies and offers solutions for sales enablement, customer success, and product copilot use cases."
      },
      {
        "url": "https://www.mendable.ai/blog/getting-started",
        "title": "Everything you need to know about Mendable: Build and deploy AI Chat Search",
        "summary": "This page provides a comprehensive guide to setting up an AI chatbot using Mendable. It covers the platform's modular design, data ingestion methods, and model customization options. The guide also includes steps for inviting team members, testing the chatbot, and troubleshooting common issues. It emphasizes the importance of customizing the model to fit specific needs and offers tips for deployment."
      },
      {
        "url": "https://www.mendable.ai/blog/mendable-launch",
        "title": "Introducing Mendable.ai",
        "summary": "This blog post announces the launch of Mendable.ai, highlighting its capabilities as a universal API for chatting with data. It describes Mendable's developer-centric approach, customizable components, and robust API for integrating chat-powered search functionality into applications. The post emphasizes the platform's ability to provide accurate and contextually relevant answers based on company data."
      },
      {
        "url": "https://www.mendable.ai/blog/building-copilots",
        "title": "Building context-aware AI copilots with Mendable",
        "summary": "This article discusses how to create AI copilots using Mendable that provide personalized answers by utilizing application context. It explains the difference between traditional QA AI and product copilots, and provides a step-by-step guide to integrating a Mendable component into a React application. The article highlights the importance of passing dynamic context to enhance user interactions."
      },
      {
        "url": "https://www.mendable.ai/blog/building-safe-rag",
        "title": "Building Safe RAG systems with the LLM OWASP top 10",
        "summary": "This blog post addresses the security concerns of building LLM systems for enterprises, focusing on the OWASP Top 10 for LLMs. It discusses the challenges and solutions for mitigating risks such as prompt injection, insecure output handling, and sensitive information disclosure. The post emphasizes the importance of implementing security measures to ensure enterprise readiness of RAG solutions."
      }
    ]
  },
  "status": "completed",
  "expiresAt": "2025-02-05T21:11:26.000Z",
  "sources": {
    "pages": [
      "https://www.mendable.ai/",
      "https://www.mendable.ai/blog/getting-started",
      "https://www.mendable.ai/blog/mendable-launch",
      "https://www.mendable.ai/blog/building-copilots",
      "https://www.mendable.ai/blog/building-safe-rag"
    ]
  }
}

This returns the job response from redis rather than supabase when db auth is disabled (self hosted mode)
@rothnic
Copy link
Contributor Author

rothnic commented Feb 5, 2025

@mogery or anyone else. Can you let me know any thoughts on this issue with self hosting? I'm mainly wondering whether returning from redis (easier and as-implemented in this example) vs supabase (much harder) in a self-hosted non-db-authenticated environment would be desired.

@user72356
Copy link

@mogery or anyone else. Can you let me know any thoughts on this issue with self hosting? I'm mainly wondering whether returning from redis (easier and as-implemented in this example) vs supabase (much harder) in a self-hosted non-db-authenticated environment would be desired.

I personally agree that using redis for self-hosted is much more desirable than supabase. It keeps with the spirit of it being self-hosted.

@mogery
Copy link
Member

mogery commented Feb 6, 2025

This makes sense for self-hosted! What I'm more confused about is the manual retrieval via hgetall instead of using getExtractQueue()

@rothnic
Copy link
Contributor Author

rothnic commented Feb 6, 2025

This makes sense for self-hosted! What I'm more confused about is the manual retrieval via hgetall instead of using getExtractQueue()

You are absolutely right, was mainly trying to understand the differences in the getExtract and what I saw in the queue, but never have come across this setup. I think I will add getExtractJob to extract-redis.ts that will handle choosing where to get it and return a consistent response.

I was having trouble though finding a type to describe the data from supabase vs the extract job returnValue. For example, jobData[0].docs seems different than what I'm seeing in the queue. There is no docs in there. Is that the same thing as pages?

@rothnic
Copy link
Contributor Author

rothnic commented Feb 7, 2025

@mogery

After creating getExtractJob, i was looking for types and found getJob and getJobs in crawl-status.ts. I assume we should be consolidating this functionality and making some of this more clear on what these functions are returning. Here's a rough outline of what I think needs to change to make it easier to see what’s actually being returned:

  1. New completed-jobs.ts in `apps/api/src/lib (or, suggestions?)

    • Move the logic from getJob and getJobs into one place.
    • Rename them to getCompletedJob(queue, id) and getCompletedJobs(queue, ids), since they’re really for finished work.
    • Create wrappers like getCompletedCrawlJob, getCompletedCrawlJobs, and getCompletedExtractJob that just pass in the right queue references.
    • Make sure we have all relevant types located here as well (DBJob, PsuedoJob, etc)
  2. Rename Anything Returning Only IDs

    • For example, getCrawlJobs actually returns job IDs, so something like getCrawlJobIds might make more sense.
  3. Rename getExtract to Show It’s Redis Storage

    • Currently, getExtract returns a StoredExtract from Redis, not the final extracted data from the queue.
    • Something like getStoredExtract or getExtractState would help show that it’s the record we saved in Redis, not the completed extraction result. getStoredExtract would align with the returned type as well.
    • Updating the naming of these kinds of functions to make things consistent

With these changes, calls like getCompletedExtractJob(id) will clearly return the finished job data, while calls like getStoredExtract(id) will return the Redis record that tracks status, timestamps, etc. It should cut down on confusion and help keep it organized.

@rothnic
Copy link
Contributor Author

rothnic commented Feb 7, 2025

I did go ahead and make use of getJob in crawl-status.ts and did not create the centralized completed-jobs.ts in lib. We can leave it as-is or I can make the other proposed changes. This could be merged as-is.

@rothnic rothnic marked this pull request as ready for review February 7, 2025 16:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants