Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Experimental Training Features, Providers Refactor #1155

Merged
merged 110 commits into from
Apr 2, 2024
Merged

Conversation

Josh-XT
Copy link
Owner

@Josh-XT Josh-XT commented Mar 29, 2024

Providers Refactor

  • Removed many providers and extensions during transition.
  • Added vision for Claude provider.
  • Automated model selection for vision for OpenAI and Claude.
  • Added a default provider that uses gpt4free for LLM, faster-whisper for audio transcription/translation, streamlabs for text-to-speech, ONNX all-MiniLM-L6-v2 embedder (256 chunk size), and stable diffusion on Hugging Face for image generation (Requires HUGGINGFACE_API_KEY).

Refactor TTS, Audio to Text, Embeddings, and Image Generation to Providers

There are now multiple provider services instead of having multiple extensions for different providers for things like TTS, audio to text, embeddings, and image generation.

Each provider now has a services property which is a list services available from that provider. Providers with an embeddings service will have an additional property for chunk_size for the embedder.

For example, the OpenAI provider has:

self.chunk_size = 1024

@staticmethod
def services():
    return [
      "llm", # Language model
      "tts", # Text to speech
      "image", # Image generation
      "embeddings", # Embeddings creation
      "transcription", # Audio transcription to text
      "translation", # Audio translation to text in English
  ]

New Experimental Training Features

These new training features require some testing and will improve as better training methods become available. The first implementation for training that I have built in is DPO. Open to feedback and improvements.

DPO, CPO, and ORPO style Dataset Creation Functionality Created

  • AGiXT can now take all memories created and turn them into a synthetic question/good answer/bad answer dataset in DPO / CPO / ORPO format to be used in Transformers (or pick your solution) to fine-tune models.
  • API Endpoint /api/agent/{agent_name}/memory/dataset
  • Once dataset is done being created, it can be found at AGiXT/agixt/WORKSPACE/{dataset_name}.json.

Example with Python SDK

The example below will consume the AGiXT GitHub repository into the agent's memory, then create a synthetic dataset with the learned information.

from agixtsdk import AGiXTSDK

agixt = AGiXTSDK(base_uri="http://localhost:7437", api_key="Your AGiXT API Key")

# Define the agent we're working with
agent_name="gpt4free"

# Consume the whole AGiXT GitHub Repository to the agent's memory.
agixt.learn_github_repo(
    agent_name=agent_name,
    github_repo="Josh-XT/AGiXT",
    collection_number=0,
)

# Create a synthetic dataset in DPO/CPO/ORPO format.
agixt.create_dataset(
    agent_name=agent_name, dataset_name="Your_dataset_name", batch_size=5
)

Model Training Based on Agent Memories

Finally making training a full process instead of stopping at the memories. After your agent learns from GitHub repo, files, arXiv articles, websites, or YouTube captions, you can use the new training endpoint to:

  • Turn all of the agent's memories into synthetic DPO/CPO/ORPO format dataset
  • Turn the dataset into a DPO QLoRA with unsloth
  • Merge into the model of your choosing to make your own model from the data you trained your AGiXT agent on.
  • Uploads your new model to Hugging Face with your choice of private_repo on a bool once complete if your agent has a HUGGINGFACE_API_KEY in its config.
from agixtsdk import AGiXTSDK

agixt = AGiXTSDK(base_uri="http://localhost:7437", api_key="Your AGiXT API Key")

# Define the agent we're working with
agent_name="gpt4free"

# Consume the whole AGiXT GitHub Repository to the agent's memory.
agixt.learn_github_repo(
    agent_name=agent_name,
    github_repo="Josh-XT/AGiXT",
    collection_number=0,
)

# Train the desired model on a synthetic DPO dataset created based on the agents memories.
agixt.train(
      agent_name="AGiXT",
      dataset_name="dataset",
      model="unsloth/mistral-7b-v0.2",
      max_seq_length=16384,
      huggingface_output_path="JoshXT/finetuned-mistral-7b-v0.2",
      private_repo=True,
)

Chat Completions endpoint modifications

Several modifications have been made to the Chat Completions endpoint to bring it more in line with the OpenAI endpoints. These modifications were in addition to changes in #1154 .

@Josh-XT Josh-XT changed the title Refactor TTS, Audio to Text, and Image Generation to Providers Refactor TTS, Audio to Text, Embeddings, and Image Generation to Providers Mar 29, 2024
@Josh-XT Josh-XT changed the title Refactor TTS, Audio to Text, Embeddings, and Image Generation to Providers Providers Refactor, Dataset Creation Functionality Created Mar 29, 2024
agixt/endpoints/Completions.py Dismissed Show dismissed Hide dismissed
@Josh-XT Josh-XT merged commit 409909b into main Apr 2, 2024
8 checks passed
@Josh-XT Josh-XT deleted the lean-agixt branch April 2, 2024 18:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant