Welcome to DataBuilder! This project leverages the power of large language models to create high-quality, customized datasets for AI training and evaluation. Whether you're building a new AI model or enhancing an existing one, DataBuilder is here to streamline your data preparation process.
- Custom Dataset Generation: Tailor datasets to your specific needs using advanced AI models.
- High Quality & Diversity: Ensure your datasets are diverse and representative of real-world scenarios.
- Scalable Solutions: Easily scale your dataset generation to meet the demands of any project size.
- User-Friendly Interface: Intuitive design makes it easy for anyone to generate datasets, regardless of technical expertise.
Follow these steps to get started with DataBuilder:
- Python 3.10 or higher
- pip (Python package manager)
-
Clone the repository:
git clone https://github.com/DemonDamon/DataBuilder.git cd DataBuilder
-
Install the required packages:
pip install -r requirements.txt
-
Configure your environment:
- Copy
.env.example
to.env
- Update the
.env
file with your OpenAI API credentials:
OPENAI_API_KEY=your-api-key-here OPENAI_API_BASE=https://api.openai.com/v1
- Copy
-
Test API Connection:
python examples/test_api.py
If successful, you should see:
API 测试成功! 模型回复: 你好,请问有什么可以帮助您的吗?
-
Interactive Chat Example:
python examples/chat_completion.py
This will start an interactive chat session with GPT-3.5. You can:
- Type your message and press Enter to chat
- Type 'clear' to clear chat history
- Type 'quit' to exit
Example code:
from dotenv import load_dotenv from openai import OpenAI import os # Load environment variables load_dotenv() # Configure OpenAI client client = OpenAI( base_url=os.getenv("OPENAI_API_BASE"), api_key=os.getenv("OPENAI_API_KEY") ) messages = [] print("Welcome to GPT-3.5 Chat! (Type 'quit' to exit, 'clear' to clear history)") while True: user_input = input("\nYou: ").strip() if user_input.lower() == 'quit': break messages.append({"role": "user", "content": user_input}) print("\nGPT: ", end="", flush=True) response = client.chat.completions.create( model="gpt-3.5-turbo", messages=messages, stream=True ) assistant_message = "" for chunk in response: if chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="", flush=True) assistant_message += chunk.choices[0].delta.content if assistant_message: messages.append({"role": "assistant", "content": assistant_message}) print()
-
Generate Data Examples:
a. Quick Generation generate_data.py:
b. Batch Generation generate_dataset.py:
- Uses YAML configuration
- Supports async processing
- Handles large-scale generation
Example configuration config/default_config.yaml:
Run the generator:
python examples/generate_dataset.py
Key differences:
generate_data.py
: Quick testing and small datasetsgenerate_dataset.py
: Production use with:- Configuration management
- Async processing
- Batch generation
- Flexible parameter control
c. Classification Example classification.py:
Features:
- Automatic Prompt Optimization: Automatically optimize prompts through multiple training rounds
- Accuracy Feedback: Calculate and display accuracy for each training round
- Format Standardization: Automatically standardize model output formats
- Progressive Learning: Support multi-round iterative training to improve effectiveness
Run Examples:
python examples/classification.py
Output Example:
开始第 1 轮训练... 训练准确率: {'sentiment_accuracy': 0.4} 新提示词效果更好: 1.0 > 0.4 开始第 2 轮训练... 训练准确率: {'sentiment_accuracy': 1.0} 开始第 3 轮训练... 训练准确率: {'sentiment_accuracy': 1.0} 优化后的提示词: ``` 对商品评论进行情感分类。 输入模板: 评论文本: {text} 输出模板: 情感分类: {sentiment} 可用标签: {'sentiment': ['正面', '负面', '中性']} ```
- API Connection Errors:
- Verify your API key is correct
- Check if you need to use a proxy (add to
.env
):OPENAI_PROXY=http://127.0.0.1:7890
- Ensure your API base URL is correct
We welcome contributions from the community! If you'd like to contribute, please follow these steps:
- Fork the repository.
- Create a new branch (
git checkout -b feature/YourFeature
). - Commit your changes (
git commit -m 'Add some feature'
). - Push to the branch (
git push origin feature/YourFeature
). - Open a pull request.
This project is licensed under the MIT License - see the LICENSE file for details.
For any questions or feedback, please reach out to us at bingzhenli@hotmail.com.
This project draws inspiration from the Adala framework, incorporating several key architectural concepts:
-
Agent-Based Architecture
- Autonomous data generation agents
- Iterative learning capabilities
- Environment-aware processing
-
Runtime System
- Flexible model integration
- Configurable execution environments
- Support for multiple LLM providers
-
Environment Management
- Ground truth dataset handling
- Feedback collection mechanisms
- Performance metrics tracking
-
Skills System
- Task-specific capabilities
- Customizable output formats
- Validation mechanisms
We've enhanced several aspects while adapting Adala's concepts:
- Simplified Configuration: Streamlined setup process with YAML-based configs
- Async Processing: Added robust async support for batch operations
- Enhanced Validation: Improved data quality checks and error handling
- Metrics Collection: Added comprehensive generation metrics and monitoring
Planned features inspired by Adala:
- Memory management for long-term learning
- Advanced feedback collection mechanisms
- Multi-modal data generation support
- Enhanced runtime optimization
For more details about Adala's architecture, see their documentation.
- Implement teacher-student model architecture (using stronger models to guide weaker ones)
- Support multiple skill combinations and collaboration
- Add long-term memory management mechanism
- Implement asynchronous feedback collection
- Support human-in-the-loop feedback mechanisms
- Add real-time environment interaction capabilities
- Implement dynamic dataset management
- Support incremental learning scenarios
- Support multiple LLM providers (e.g., Claude, Wenxin, etc.)
- Implement automatic model performance evaluation
- Add model call cost tracking
- Support batch processing optimization
- Add Named Entity Recognition (NER) skills
- Implement multi-task parallel learning
- Support cross-language skill transfer
- Add text generation skill templates
- Provide command-line tools
- Implement REST API interfaces
- Support Jupyter Notebook integration
- Add Web UI interfaces
- Implement detailed metrics collection system
- Add performance visualization panels
- Support experiment comparison analysis
- Implement automated testing frameworks
- Add image processing capabilities
- Support speech input and output
- Implement cross-modal skills
- Support video content processing