SciPhi/AgentSearch-V1 · Datasets at Hugging Face #386

irthomasthomas · 2024-01-18T00:09:37Z

SciPhi/AgentSearch-V1 · Datasets at Hugging Face

Getting Started

The AgentSearch-V1 dataset is a comprehensive collection of over one billion embeddings, produced using jina-v2-base. It includes more than 50 million high-quality documents and over 1 billion passages, covering a vast range of content from sources such as Arxiv, Wikipedia, Project Gutenberg, and includes carefully filtered Creative Commons (CC) data. Our team is dedicated to continuously expanding and enhancing this corpus to improve the search experience. We welcome your thoughts and suggestions – please feel free to reach out with your ideas!

To access and utilize the AgentSearch-V1 dataset, you can stream it via HuggingFace with the following Python code:

from datasets import load_dataset
import json
import numpy as np

# To stream the entire dataset:
ds = load_dataset("SciPhi/AgentSearch-V1", data_files="**/*", split="train", streaming=True)

# Optional, stream just the "arxiv" dataset
# ds = load_dataset("SciPhi/AgentSearch-V1", data_files="**/*", split="train", data_files="arxiv/*", streaming=True)

# To process the entries:
for entry in ds:
    embeddings = np.frombuffer(
        entry['embeddings'], dtype=np.float32
    ).reshape(-1, 768)
    text_chunks = json.loads(entry['text_chunks'])
    metadata = json.loads(entry['metadata'])
    print(f'Embeddings:\n{embeddings}\n\nChunks:\n{text_chunks}\n\nMetadata:\n{metadata}')
    break

A full set of scripts to recreate the dataset from scratch can be found here. Further, you may check the docs for details on how to perform RAG over AgentSearch.

Languages

English.

Dataset Structure

The raw dataset structure is as follows:

{
    "url": ...,
    "title": ...,
    "metadata": {"url": "...", "timestamp": "...", "source": "...", "language": "..."},
    "text_chunks": ...,
    "embeddings": ...,
    "dataset": "book" | "arxiv" | "wikipedia" | "stack-exchange" | "open-math" | "RedPajama-Data-V2"
}

Dataset Creation

This dataset was created as a step towards making humanities most important knowledge openly searchable and LLM optimal. It was created by filtering, cleaning, and augmenting locally publicly available datasets.

To cite our work, please use the following:

@software{SciPhi2023AgentSearch,
author = {SciPhi},
title = {AgentSearch [ΨΦ]: A Comprehensive Agent-First Framework and Dataset for Webscale Search},
year = {2023},
url = {https://github.com/SciPhi-AI/agent-search}
}

Source Data

@online{wikidump,
author = "Wikimedia Foundation",
title = "Wikimedia Downloads",
url = "https://dumps.wikimedia.org"
}

@misc{paster2023openwebmath,
title={OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text},
author={Keiran Paster and Marco Dos Santos and Zhangir Azerbayev and Jimmy Ba},
year={2023},
eprint={2310.06786},
archivePrefix={arXiv},
primaryClass={cs.AI}
}

@software{together2023redpajama,
author = {Together Computer},
title = {RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset},
month = April,
year = 2023,
url = {https://github.com/togethercomputer/RedPajama-Data}
}

License

Please refer to the licenses of the data subsets you use.

Open-Web (Common Crawl Foundation Terms of Use)
Books: the_pile_books3 license and pg19 license
ArXiv Terms of Use
Wikipedia License
StackExchange license on the Internet Archive

Suggested labels

{ "key": "knowledge-dataset", "value": "A dataset with one billion embeddings from various sources, such as Arxiv, Wikipedia, Project Gutenberg, and carefully filtered Creative Commons data" }

irthomasthomas · 2024-01-18T00:19:21Z

/# Time LLM Embed Multi Commands

This markdown document contains the output of various llm embed-multi commands used for embedding data into an embeddings database. Each command is separated by a horizontal rule for easy reading.

CPU Batch 10 (The maximum I can run in 64GB)

time llm embed-multi prompts-cpu-b10-run1 -d embeddings.db --attach logs $(llm logs path) --sql 'SELECT id, prompt FROM logs.responses LIMIT 1000' -m jina-embeddings-v2-base-en --prefix prompt/ --batch-size 10

Output:

llm embed-multi prompts-cpu-b10-run1 -d embeddings.db --attach logs  --sql  -m  7142.22s user 3449.01s system 1320% cpu 13:21.98 total

GPU

time llm embed-multi prompts-gpu-b1-run2 -d embeddings.db --attach logs  --sql  -m jina-embeddings-v2-base-en --prefix prompt/ --batch-size 1

Output:

llm embed-multi prompts-gpu-b1-run2 -d embeddings.db --attach logs  --sql  -m  25.44s user 11.39s system 130% cpu 28.310 total

irthomasthomas added New-Label Choose this option if the existing labels are insufficient to describe the content accurately embeddings vector embeddings and related tools labels Jan 18, 2024

irthomasthomas added dataset public datasets and embeddings and removed New-Label Choose this option if the existing labels are insufficient to describe the content accurately labels Jan 18, 2024

This was referenced Mar 6, 2024

My Knowledge Wiki: personal wiki pattern #705

Open

phidata/cookbook/groq/README.md at main · phidatahq/phidata #712

Open

ShellLM mentioned this issue Apr 7, 2024

Exploring the More-Agents-Is-All-You-Need GitHub Repository #800

Open

1 task

ShellLM mentioned this issue Jun 19, 2024

jjleng/sensei: Yet another open source Perplexity #837

Open

1 task

This was referenced Aug 2, 2024

lesser-known Python libraries | Hacker News #860

Open

Reader API #865

Open

graphrag prompts · microsoft/graphrag #869

Open

ShellLM mentioned this issue Aug 11, 2024

Doing RAG? Vector search is *not* enough #879

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SciPhi/AgentSearch-V1 · Datasets at Hugging Face #386

SciPhi/AgentSearch-V1 · Datasets at Hugging Face #386

irthomasthomas commented Jan 18, 2024

irthomasthomas commented Jan 18, 2024 •

edited

Loading

SciPhi/AgentSearch-V1 · Datasets at Hugging Face #386

SciPhi/AgentSearch-V1 · Datasets at Hugging Face #386

Comments

irthomasthomas commented Jan 18, 2024

Getting Started

Languages

Dataset Structure

Dataset Creation

Source Data

License

Suggested labels

{ "key": "knowledge-dataset", "value": "A dataset with one billion embeddings from various sources, such as Arxiv, Wikipedia, Project Gutenberg, and carefully filtered Creative Commons data" }

irthomasthomas commented Jan 18, 2024 • edited Loading

CPU Batch 10 (The maximum I can run in 64GB)

GPU

irthomasthomas commented Jan 18, 2024 •

edited

Loading