Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SciPhi/AgentSearch-V1 · Datasets at Hugging Face #386

Open
1 task
irthomasthomas opened this issue Jan 18, 2024 · 1 comment
Open
1 task

SciPhi/AgentSearch-V1 · Datasets at Hugging Face #386

irthomasthomas opened this issue Jan 18, 2024 · 1 comment
Labels
dataset public datasets and embeddings embeddings vector embeddings and related tools

Comments

@irthomasthomas
Copy link
Owner

Getting Started

The AgentSearch-V1 dataset is a comprehensive collection of over one billion embeddings, produced using jina-v2-base. It includes more than 50 million high-quality documents and over 1 billion passages, covering a vast range of content from sources such as Arxiv, Wikipedia, Project Gutenberg, and includes carefully filtered Creative Commons (CC) data. Our team is dedicated to continuously expanding and enhancing this corpus to improve the search experience. We welcome your thoughts and suggestions – please feel free to reach out with your ideas!

To access and utilize the AgentSearch-V1 dataset, you can stream it via HuggingFace with the following Python code:

from datasets import load_dataset
import json
import numpy as np

# To stream the entire dataset:
ds = load_dataset("SciPhi/AgentSearch-V1", data_files="**/*", split="train", streaming=True)

# Optional, stream just the "arxiv" dataset
# ds = load_dataset("SciPhi/AgentSearch-V1", data_files="**/*", split="train", data_files="arxiv/*", streaming=True)

# To process the entries:
for entry in ds:
    embeddings = np.frombuffer(
        entry['embeddings'], dtype=np.float32
    ).reshape(-1, 768)
    text_chunks = json.loads(entry['text_chunks'])
    metadata = json.loads(entry['metadata'])
    print(f'Embeddings:\n{embeddings}\n\nChunks:\n{text_chunks}\n\nMetadata:\n{metadata}')
    break

A full set of scripts to recreate the dataset from scratch can be found here. Further, you may check the docs for details on how to perform RAG over AgentSearch.

Languages

English.

Dataset Structure

The raw dataset structure is as follows:

{
    "url": ...,
    "title": ...,
    "metadata": {"url": "...", "timestamp": "...", "source": "...", "language": "..."},
    "text_chunks": ...,
    "embeddings": ...,
    "dataset": "book" | "arxiv" | "wikipedia" | "stack-exchange" | "open-math" | "RedPajama-Data-V2"
}

Dataset Creation

This dataset was created as a step towards making humanities most important knowledge openly searchable and LLM optimal. It was created by filtering, cleaning, and augmenting locally publicly available datasets.

To cite our work, please use the following:

@software{SciPhi2023AgentSearch,
author = {SciPhi},
title = {AgentSearch [ΨΦ]: A Comprehensive Agent-First Framework and Dataset for Webscale Search},
year = {2023},
url = {https://github.com/SciPhi-AI/agent-search}
}

Source Data

@online{wikidump,
author = "Wikimedia Foundation",
title = "Wikimedia Downloads",
url = "https://dumps.wikimedia.org"
}

@misc{paster2023openwebmath,
title={OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text},
author={Keiran Paster and Marco Dos Santos and Zhangir Azerbayev and Jimmy Ba},
year={2023},
eprint={2310.06786},
archivePrefix={arXiv},
primaryClass={cs.AI}
}

@software{together2023redpajama,
author = {Together Computer},
title = {RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset},
month = April,
year = 2023,
url = {https://github.com/togethercomputer/RedPajama-Data}
}

License

Please refer to the licenses of the data subsets you use.

  • Open-Web (Common Crawl Foundation Terms of Use)
  • Books: the_pile_books3 license and pg19 license
  • ArXiv Terms of Use
  • Wikipedia License
  • StackExchange license on the Internet Archive

Suggested labels

{ "key": "knowledge-dataset", "value": "A dataset with one billion embeddings from various sources, such as Arxiv, Wikipedia, Project Gutenberg, and carefully filtered Creative Commons data" }

@irthomasthomas irthomasthomas added New-Label Choose this option if the existing labels are insufficient to describe the content accurately embeddings vector embeddings and related tools labels Jan 18, 2024
@irthomasthomas
Copy link
Owner Author

irthomasthomas commented Jan 18, 2024

/# Time LLM Embed Multi Commands

This markdown document contains the output of various llm embed-multi commands used for embedding data into an embeddings database. Each command is separated by a horizontal rule for easy reading.

CPU Batch 10 (The maximum I can run in 64GB)

time llm embed-multi prompts-cpu-b10-run1 -d embeddings.db --attach logs $(llm logs path) --sql 'SELECT id, prompt FROM logs.responses LIMIT 1000' -m jina-embeddings-v2-base-en --prefix prompt/ --batch-size 10

Output:

llm embed-multi prompts-cpu-b10-run1 -d embeddings.db --attach logs  --sql  -m  7142.22s user 3449.01s system 1320% cpu 13:21.98 total

GPU

time llm embed-multi prompts-gpu-b1-run2 -d embeddings.db --attach logs  --sql  -m jina-embeddings-v2-base-en --prefix prompt/ --batch-size 1

Output:

llm embed-multi prompts-gpu-b1-run2 -d embeddings.db --attach logs  --sql  -m  25.44s user 11.39s system 130% cpu 28.310 total

@irthomasthomas irthomasthomas added dataset public datasets and embeddings and removed New-Label Choose this option if the existing labels are insufficient to describe the content accurately labels Jan 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataset public datasets and embeddings embeddings vector embeddings and related tools
Projects
None yet
Development

No branches or pull requests

1 participant