md-embed (c) 2024 web3dguy

A Python script for processing Markdown files, generating embeddings, and storing them in a vector store. This tool allows you to clean, split, and embed Markdown documents using various methods and embedding models. Features

Data Cleaning: Removes duplicates and filters out unwanted content like '404' pages and lines containing the '©' symbol.
Flexible Input: Supports input from JSON files containing URLs and Markdown data, folders of Markdown files, or single Markdown files.
Document Splitting: Splits documents using Markdown headers or recursive character splitting.
Embedding Options: Supports embedding using HuggingFace or Ollama embeddings.
Vector Store Integration: Stores embeddings in a Chroma vector store for efficient retrieval and analysis.
Customizable Filters: Option to disable filters that remove specific content.
Logging: Generates logs for duplicates and removed files for better traceability.

Installation Prerequisites

    Python 3.7 or higher
    pip
    Git (optional, for cloning the repository)

Clone the Repository

git clone https://github.com/GATERAGE/mdmbed.git
cd mdmbed

Install Required Packages

Install the required Python packages using pip:

pip install -r requirements.txt

Note: The requirements.txt file should list all the dependencies, such as tqdm, langchain, chromadb, huggingface, etc. Usage

Run the script using Python:

python md-embed.py [--filters-off]

Command-Line Arguments

--filters-off: Disable filters that remove lines containing '©' and skip files containing both '404' and 'page not found'.

Upon running the script, you will be prompted to choose an input method:

JSON Input File Containing URLs and Markdown Data
Folder of Markdown Files
Single Markdown File

JSON Input File

If you choose Option 1, you will be asked to provide:

Path of the JSON input file: The file should be a JSON array of objects, each containing url and markdown keys.
Path of the output folder: The folder where cleaned Markdown files and logs will be saved.

The script will:

Clean the data by removing duplicates.
Save the cleaned Markdown files to the specified output folder.
Generate a file_to_url.json mapping file.
Display a summary of the processing.

Folder of Markdown Files

If you choose Option 2, you will be asked to provide:

Path of the folder containing Markdown files.

The script will:

Load all .md files from the specified folder.
Optionally filter out unwanted content.
Proceed to document splitting.

Single Markdown File

If you choose Option 3, you will be asked to provide:

Path of the Markdown file.

The script will:

Load the specified Markdown file.
Optionally filter out unwanted content.
Proceed to document splitting.

Document Splitting

After loading the documents, you will be prompted to split them:

Split Method: Choose between markdown or recursive splitting.
Remove Links: Optionally remove links from the Markdown content.
Language: Specify the programming language or language of the content.
Additional Settings:
    For Markdown Splitting:
        Header Levels: Specify which header levels (#, ##, etc.) to split on.
    For Recursive Splitting:
        Chunk Size: Specify the maximum size of each chunk.
        Chunk Overlap: Specify the number of overlapping characters between chunks.

You will have the option to preview the split data before proceeding. Embedding and Saving

After splitting, you will be prompted to embed and save the documents:

Embedding Method: Choose between huggingface or ollama.
    HuggingFace: Enter the embedding model name (default: all-MiniLM-L6-v2).
    Ollama: Enter the Ollama model name (default: nomic-embed-text).
Persist Directory: Specify the directory to save the vector store database.
Collection Name: Enter a name for the Chroma collection.

The script will:

Embed the documents using the chosen embedding method.
Save the embeddings to a Chroma vector store.
Display information about the saved collections.

Examples Example 1: Process JSON Input File

python md-embed.py

Choose Input Method: 1

Enter the path of the JSON input file: ./data/input.json
Enter the path of the output folder: ./output

Proceed through the prompts to clean data, split documents, and embed them. Example 2: Process Folder of Markdown Files with Filters Off

python md-embed.py --filters-off

Choose Input Method: 2

Enter the path of the folder containing markdown files: ./markdown_files

Proceed through the prompts to load, split, and embed the documents. Contributing

Contributions are welcome! Please follow these steps:

Fork the repository.

Create a new branch:

git checkout -b feature/your-feature-name

Make your changes and commit them:

git commit -m "Add your message"

Push to the branch:

git push origin feature/your-feature-name

Open a Pull Request.

Please make sure your code adheres to the existing style and that all tests pass. License

This project is licensed under the MIT License. Acknowledgments web3dguy LangChain for text splitting and document handling. HuggingFace for embedding models. Chroma for the vector store. TQDM for progress bars. The open-source community for continuous support and contributions.

Disclaimer: This tool is provided "as is" without warranty of any kind. Use it at your own risk. Open source or go away.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitattributes		.gitattributes
README.md		README.md
md-embed.py		md-embed.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

md-embed (c) 2024 web3dguy

Command-Line Arguments

About

Releases

Packages

Languages

GATERAGE/mdmbed

Folders and files

Latest commit

History

Repository files navigation

md-embed (c) 2024 web3dguy

Command-Line Arguments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages