Skip to content

Kollektiv - open-source LLM chatbot that you can sync with any web content to get contextualized replies

License

Notifications You must be signed in to change notification settings

alexander-zuev/kollektiv

Repository files navigation

🚀 Kollektiv - LLMs + Up-to-date knowledge

🌟 Overview

Kollektiv is a Retrieval-Augmented Generation (RAG) system designed for one purpose - allow you to chat with your favorite docs (of libraries, frameworks, tools primarily) easily.

This project aims to allow LLMs to tap into the most up-to-date knowledge in 2 clicks so that you don't have to worry about incorrect replies, hallucinations or inaccuracies when working with the best LLMs.

❓Why?

This project was born out of a personal itch - whenever a new feature of my favorite library comes up, I know I can't rely on the LLM to help me build with it - because it simply doesn't know about it!

The root cause - LLMs lack access to the most recent documentation or private knowledge, as they are trained on a set of data that was accumulated way back (sometimes more than a year ago).

The impact - hallucinations in answers, inaccurate, incorrect or outdated information, which directly decreases productivity and usefulness of using LLMs

But there is a better way...

What if LLMs could tap into a source of up-to-date information on libraries, tools, frameworks you are building with?

Imagine your LLM could intelligently decide when it needs to check the documentation source and always provide an accurate reply?

🎯 Goal

Meet Kollektiv -> an open-source RAG app that helps you easily:

  • parse the docs of your favorite libraries
  • efficiently stores and embeds them in a local vector storage
  • sets up an LLM chat which you can rely on

Note this is v.0.1.* and reliability of the system can be characterized as following:

  • in 50% of the times it works every time!

So do let me know if you are experiencing issues and I'll try to fix them.

⚙️ Key Features

  • 🕷️ Intelligent Web Crawling: Utilizes FireCrawl API to efficiently crawl and extract content from specified documentation websites.
  • 🧠 Advanced Document Processing: Implements custom chunking strategies to optimize document storage and retrieval.
  • 🔍 Vector Search: Employs Chroma DB for high-performance similarity search of document chunks.
  • 🔄 Multi-Query Expansion: Enhances search accuracy by generating multiple relevant queries for each user input.
  • 📊 Smart Re-ranking: Utilizes Cohere's re-ranking API to improve relevancy of search results
  • 🤖 AI-Powered Responses: Integrates with Claude 3.5 Sonnet to generate human-like, context-aware responses.
  • 🧠 Dynamic system prompt: Automatically summarizes the embedded documentation to improve RAG decision-making.

🛠️ Technical Stack

  • Language: Python 3.7+
  • Web Crawling: FireCrawl API
  • Vector Database: Chroma DB
  • Embeddings: OpenAI's text-embedding-3-small
  • LLM: Anthropic's Claude 3.5 Sonnet
  • Re-ranking: Cohere API
  • Additional Libraries: tiktoken, chromadb, anthropic, cohere

🚀 Quick Start

  1. Clone the repository:

    git clone https://github.com/Twist333d/kollektiv.git
    cd rag-docs
    
  2. Set up environment variables: Create a .env file in the project root with the following:

    FIRECRAWL_API_KEY="your_firecrawl_api_key"
    OPENAI_API_KEY="your_openai_api_key"
    ANTHROPIC_API_KEY="your_anthropic_api_key"
    COHERE_API_KEY="your_cohere_api_key"
    
  3. Install dependencies:

    poetry install
    
  4. Run the application:

    poetry run python app.py

    or through a poetry alias:

    python app.py

💡 Usage

  1. Crawl Documentation:

    Update the root url you want to parse. For example:

    urls_to_crawl = ["https://docs.anthropic.com/en/docs/"]

    Ensure you include the url patterns of sub-pages you want to parse and exclude url patterns of sub-pages you don't want to parse:

     "includePaths": ["/tutorials/*", "/how-tos/*", "/concepts/*"],
     "excludePaths": ["/community/*"],

    Set the maximum number of pages you want to crawl:

     crawler.async_crawl_url(urls_to_crawl, page_limit=250)
  2. Chunk FireCrawl parsed docs:

    Next step is to chunk all the parsed documents

    poetry run python -m src.processing.chunking
  3. Configure basic parameters:

    Set up the following parameters in the app.py:

    1. Whether to load only specified or all processed chunks inPROCESSED_DATA_DIR
     docs = ["docs_anthropic_com_en_20240928_135426-chunked.json"]
     initializer = ComponentInitializer(reset_db=reset_db, load_all_docs=True, files=[])
    1. Whether to reset the database, which will clear all the data in local ChromaDB - use with caution. Defaults to false.
    if __name__ == "__main__":
     main(debug=False, reset_db=False)
  4. Chat with documentation: You can run application via the following command:

    python app.py

❤️‍🩹 Current Limitations

  • Only terminal UI (no Chainlit for now)
  • Image data not supported - ONLY text-based embeddings.
  • No automatic re-indexing of documents
  • Basic chat flow supported
    • Either RAG tool is used or not
      • if a tool is used -> retrieves up to 5 most relevant documents (after re-ranking)

🛣️ Roadmap

For a brief roadmap please check out project wiki page.

📈 Performance Metrics

Evaluation is currently done using ragas library. There are 2 key parts assessed:

  1. End-to-end generation
    • Faithfulness
    • Answer relevancy
    • Answer correctness
  2. Retriever (TBD)
    • Context recall
    • Context precision

📜 License

Kollektiv is licensed under a modified version of the Apache License 2.0. While it allows for free use, modification, and distribution for non-commercial purposes, any commercial use requires explicit permission from the copyright owner.

  • For non-commercial use: You are free to use, modify, and distribute this software under the terms of the Apache License 2.0.
  • For commercial use: Please contact azuev@outlook.com to obtain a commercial license.

See the LICENSE file for the full license text and additional conditions.

Project Renaming Notice

The project has been renamed from OmniClaude to Kollektiv to:

  • avoid confusion / unintended copyright infringement of Anthropic
  • emphasize the goal to become a tool to enhance collaboration through simplifying access to knowledge
  • overall cool name (isn't it?)

If you have any questions regarding the renaming, feel free to reach out.

🙏 Acknowledgements

📞 Support

For any questions or issues, please open an issue


Built with ❤️ by AZ

About

Kollektiv - open-source LLM chatbot that you can sync with any web content to get contextualized replies

Topics

Resources

License

Stars

Watchers

Forks

Languages