Dataset Miner is a powerful tool designed to generate question-answer (Q&A) pairs from various file types using AI models. It processes documents such as PDFs, text files, Word documents, JSON, CSV, and Excel files to create a dataset suitable for fine-tuning language models or other NLP tasks.
- Supports multiple file formats: PDF, TXT, DOCX, JSON, CSV, XLSX, XLS
- Uses Ollama AI models for Q&A pair generation
- Processes files in chunks to handle large documents efficiently
- Provides cost analysis for token usage
- Outputs data in Alpaca dataset format
- Includes logging for easy debugging and progress tracking
- Python 3.6+
- Dependencies listed in
requirements.txt
-
Clone this repository:
git clone https://github.com/your-username/dataset-miner.git cd dataset-miner
-
Install the required dependencies:
pip install -r requirements.txt
Run the main script with the following command:
python main.py -source <input_directory> -model <ollama_model_id> [--output <output_file>] [--debug]
Arguments:
-source
: Directory containing files to mine (required)-model
: Ollama AI model ID/slug to use for mining (required)--output
: Output JSON file for the mined dataset (default: mined_dataset.json)--debug
: Enable debug logging
Example:
python main.py -source ./documents -model gpt-4o-mini --output mined_data.json
main.py
: Entry point of the applicationdata_extractor.py
: Contains functions to extract text from various file formatsllm_utils.py
: Utility functions for interacting with the AI model and processing textcost_analyzer.py
: Tracks token usage and estimates costsprompt_templates.py
: Defines prompt templates for the AI modelrequirements.txt
: Lists all required Python packages
The tool generates a JSON file containing Q&A pairs in the Alpaca dataset format:
[
{
"instruction": "Question goes here",
"input": "Any relevant input (if applicable)",
"output": "Answer goes here"
},
...
]
The tool provides a summary of token usage and estimated costs based on the specified model's pricing. This information is logged at the end of the mining process.
Detailed logs are written to the console, including progress updates, token usage, and any errors encountered during the mining process. Use the --debug
flag for more verbose logging.
Contributions to the Dataset Miner project are welcome! Please feel free to submit pull requests or open issues to suggest improvements or report bugs.
[Specify your chosen license here]
This tool uses AI models to generate content. While it strives for accuracy, the generated Q&A pairs should be reviewed for quality and appropriateness before use in production systems or datasets.