Skip to content

Commit

Permalink
project refactor
Browse files Browse the repository at this point in the history
  • Loading branch information
emcf committed Mar 23, 2024
1 parent 77fa6a3 commit 0a29b6e
Show file tree
Hide file tree
Showing 30 changed files with 644 additions and 770 deletions.
160 changes: 0 additions & 160 deletions .gitignore

This file was deleted.

21 changes: 0 additions & 21 deletions LICENSE

This file was deleted.

91 changes: 41 additions & 50 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,52 +1,45 @@
# <img src="https://rpnutzemutbrumczwvue.supabase.co/storage/v1/object/public/assets/pipeline_small%20(1).png" alt="Pipeline Illustration" style="width:96px; height:72px; vertical-align:middle;"> The Pipe

[![codecov](https://codecov.io/gh/emcf/thepipe/graph/badge.svg?token=KHD1PDOSHF)](https://codecov.io/gh/emcf/thepipe) ![python-gh-action](https://github.com/emcf/thepipe/actions/workflows/python-ci.yml/badge.svg)
[![codecov](https://codecov.io/gh/emcf/thepipe/graph/badge.svg?token=OE7CUEFUL9)](https://codecov.io/gh/emcf/thepipe) ![python-gh-action](https://github.com/emcf/thepipe/actions/workflows/python-ci.yml/badge.svg)

The pipe is a tool for feeding complex real-world data into large language models. It is built on top of dozens of carefully-crafted heuristics to create sensible representations from a variety of sources, including code projects, scientific papers, web pages, github repos, data files, databases, and more.
The pipe is a multimodal-first tool for feeding real-world data into large language models. It is built on top of dozens of carefully-crafted heuristics to create sensible representations from complex sources, including files, directories, scientific papers, web pages, github repos, etc.

## 🛠️ How it works
## Features 🌟

The pipe is accessible from the command line or from [Python](https://www.python.org/downloads/). The input source is either a file path, a URL, or a directory (or zip file) path. The pipe will extract information from the source and process it for downstream use with [LLMs](https://en.wikipedia.org/wiki/Large_language_model). The output from the pipe is a sensible text-based (or multimodal) representation of the extracted information, carefully crafted to fit within context windows for any models from [gemma-7b](https://huggingface.co/google/gemma-7b) to [GPT-4](https://openai.com/gpt-4). It uses a variety of heuristics to optimize the output for LLMs, including [AI-native PDF extraction](https://docs.mathpix.com/#process-a-pdf), [efficient token compression](https://arxiv.org/abs/2403.12968), [code compression with Ctags](https://en.wikipedia.org/wiki/Ctags), automatic [image encoding](https://en.wikipedia.org/wiki/Base64), reranking for [LITM](https://arxiv.org/abs/2307.03172) effects, and more, all pre-built to work out-of-the-box.
- Prepare prompts from dozens of complex file types 📄
- Visual document extraction for complex PDFs, markdown, etc 🧠
- Outputs optimized for multimodal LLMs 🖼️ + 💬
- Multi-threaded ⚡️
- Works with missing file extensions, in-memory data streams 💾
- Works with directories, URL, git repos, and more 🌐
## How it works 🛠️

## 📂 Supported input sources
The pipe is accessible from the command line or from [Python](https://www.python.org/downloads/). The input source is either a file path, a URL, or a directory (or zip file) path. The pipe will extract information from the source and process it for downstream use with [language models](https://en.wikipedia.org/wiki/Large_language_model), [vision transformers](https://en.wikipedia.org/wiki/Vision_transformer), or [vision-language models](https://arxiv.org/abs/2304.00685). The output from the pipe is a sensible text-based (or multimodal) representation of the extracted information, carefully crafted to fit within context windows for any models from [gemma-7b](https://huggingface.co/google/gemma-7b) to [GPT-4](https://openai.com/gpt-4). It uses a variety of heuristics for optimal performance with vision-language models, including AI [filetype detection](https://opensource.googleblog.com/2024/02/magika-ai-powered-fast-and-efficient-file-type-identification.html), AI [PDF extraction](https://mathpix.com), efficient [token compression](https://arxiv.org/abs/2403.12968), automatic [image encoding](https://en.wikipedia.org/wiki/Base64), [reranking](https://arxiv.org/abs/2310.06839) for [lost-in-the-middle](https://arxiv.org/abs/2307.03172) effects, and more, all pre-built to work out-of-the-box.

### Sources
- 📁 **Project directories** (any directory)
- 🗂️ **Zip / Tarballs** (`.zip`, `.tar`, `.gz`)
- 🔗 **URLs** (any input containing `http` or `www`, or `.url` shortcut file)
- 🐙 **GitHub Repositories** (any input containing `github.com`)
- 🗃️ **Business Database** (any input containing `supabase`)
## Getting Started 🚀

### Documents
- 📜 **Code files** (`.py`, `.cpp`, `.ts`, `.css`, `.h`, etc.)
- 📚 **PDFs** (`.pdf`) (`.pdf` or any input containing `arxiv`, extract images/tables/math with `--mathpix`)
- 🖼️ **Images** (`.png`, `.jpg`, `.jpeg`, `.gif`)
- 📊 **Spreadsheets** (`.csv`, `.xlsx`)
- 📜 **Configuration files** (`.yaml`, `.json`, `.xml`, `.ini`, `.xaml`, `.cfg`, `.config`)
- 📓 **IPython notebooks** (`.ipynb`)
- 📝 **Word documents** (`.docx`)
- 📊 **Powerpoint presentations** (`.pptx`)



## 🚀 Getting Started
To use The Pipe, simply clone this repository and install the requirements:
```bash
git clone https://github.com/emcf/thepipe
pip install -r requirements.txt
```

To use The Pipe, simply clone this repository and run
To use The Pipe from the command line, simply run

```bash
python thepipe.py --source /path/to/directory --output prompt.txt
python thepipe.py path/to/directory --limit 100000
```

This command will process all supported files within the specified directory, compressing the information over the token limit if necessary, and outputting the result to `output.txt`.
This command will process all supported files within the specified directory, compressing any information over the token limit if necessary, and outputting the result to a folder.

Arguments are:
- `--source` (required): The input source, can be a file path, a URL, or a directory path.
- `--output` (required): The output file path.
- `--limit` (optional): The token limit for the output, defaults to 64K.
- The input source (required): can be a file path, a URL, or a directory path.
- `--match` (optional): Glob pattern to match files in the directory.
- `--limit` (optional): The token limit for the output prompt, defaults to 100K. Prompts exceeding the limit will be compressed.
- `--mathpix` (optional): Extract images, tables, and math from PDFs using [Mathpix](https://docs.mathpix.com/#process-a-pdf).
- `--text` (optional): Output text scraped from images instead of [base64](https://en.wikipedia.org/wiki/Base64) encoded images.
- `--text_only` (optional): Do not extract images from documents or websites. Additionally, image files will be represented with OCR instead of as images.

Alternatively, to use the pipe from Python:
To use the pipe from Python:

```python
import openai
Expand All @@ -60,21 +53,19 @@ response_content = response.choices[0].message.content
print(response_content)
```

## ⚖️ Heuristics

To optimize the output for downstream tasks, the pipe uses a variety of assumptions and heuristics to extract the most important information from the input data, and to format it. Here are some of the most important ones:
- **Optional [Mathpix](https://docs.mathpix.com/#process-a-pdf) PDF extraction**: Optional, extracts images, tables, and math from PDFs.
- **[Ctags](https://en.wikipedia.org/wiki/Ctags) token compression**: When the output prompt is too large, automatically extracts essential code structure (functions, classes, variables, types) and throws away the rest. Useful for high-quality coding under strict token constraints.
- **[LLMLingua](https://arxiv.org/abs/2403.12968) token compression**: When the output prompt is too large, automatically extracts essential tokens, can improve downstream performance by removing noise.
- **[LITM](https://arxiv.org/abs/2307.03172) Reranking**: Reformats the output to minimize the impact of the "lost in the middle" effect to improve downstream performance with LLMs.
- **Image resizing, [base64](https://en.wikipedia.org/wiki/Base64) encoding**: Maximum image dimensions are clipped to 512 pixels and encoded in base64 for easy downstream use with vision language models. Can alternatively output a text description of all images with `--text`, or text scraped from all images with `--scrape`.
- [Unstructured](https://github.com/Unstructured-IO/unstructured) extraction from unknown sources
- **Ignore Rules**: Sensible out-of-the-box ignore rules for common directories and files that are not useful for downstream tasks, such as `node_modules`, `__pycache__`, `.gitignore`, etc. Feel free to customize these for your own use case by modifying `FILES_TO_IGNORE` in `config.py`.

## License 📜

Distributed under the MIT License. See `LICENSE` for more information.

---

Made with ❤️ and Python.
## Supported File Types 📚

| Source Type | Input types | Token Compression 🗜️ | Image Extraction 👁️ | Notes 📌 |
|---------------------------------------|------------------------------------------|-------------------|------------------|---------------------------------------------------------|
| Directory | Any `/path/to/directory` | Yes | No | Extracts from all files in directory, supports match and ignore patterns |
| Code | `.h`, `.json`, `.js`, `.jsx`, `.ts`, `.tsx`, `.cs`, `.java`, `.html`, `.css`, `.ini`, `.xml`, `.yaml`, `.xaml`, `.sh`, `.c`, `.cpp`, `.py` | Varies (ctags) | No | Combines all code files. `.c`, `.cpp`, `.py` are compressible with ctags, others are not |
| Plaintext | `.txt`, `.md`, `.rtf` | Yes | No | Regular text files |
| PDF | `.pdf` | Yes | Yes (Optional) | Extracts text and optionally images; can use Mathpix for enhanced extraction |
| Image | `.jpg`, `.jpeg`, `.png`, `.gif`, `.bmp`, `.tiff`, `.webp`, `.svg` | No | Yes | Extracts images and can convert to text using OCR |
| Spreadsheet | `.csv`, `.xls`, `.xlsx` | No | No | Extracts data from spreadsheets; converts to text representation |
| Jupyter Notebook | `.ipynb` | Yes | No | Extracts content from Jupyter notebooks |
| Microsoft Word Document | `.docx` | Yes | No | Extracts text from Word documents |
| Microsoft PowerPoint Presentation | `.pptx` | Yes | No | Extracts text from PowerPoint presentations |
| Website | URLs (http, https, www, ftp) | No | Yes (Optional) | Extracts content from web pages; text-only extraction available |
| GitHub Repository | GitHub repo URLs | Yes | No | Extracts from GitHub repositories; supports branch specification |
| ZIP File | `.zip` | Yes | No | Extracts contents of ZIP files; supports nested directory extraction |
84 changes: 84 additions & 0 deletions compress.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
import json
import shutil
import subprocess
import tempfile
from typing import *
import os
from core import Chunk, SourceTypes

CTAGS_LANGUAGES = {'py': "Python", 'ts': "Typescript", "tsx": "Typescript", "cpp": "C++", "c": "C"}

def compress_with_ctags(chunk: Chunk, extension: str) -> Chunk:
if chunk.text is None:
return Chunk(path=chunk.path, text=chunk.text, image=chunk.image, source_type=SourceTypes.UNCOMPRESSIBLE_CODE)
language = CTAGS_LANGUAGES[extension]
tmp_dir = tempfile.mkdtemp()
try:
file_path = os.path.join(tmp_dir, "tempfile")+'.'+extension
with open(file_path, 'w', encoding='utf-8') as tmp_file:
tmp_file.write(chunk.text)
# need custom options for ctags to work with typescript
if language == "Typescript":
cmd = [
"./ctags",
f"--languages={language}",
"--output-format=json",
f"--options=./ts.ctags",
"-f", "-",
file_path
]
else:
cmd = [
"./ctags",
f"--languages={language}",
"--output-format=json",
"-f", "-",
file_path
]
result = subprocess.run(cmd, capture_output=True, text=True)
if result.returncode != 0:
raise Exception(f"Error running ctags: {result.stderr}")
# write output to file
with open('ctags_output.json', 'w', encoding='utf-8') as f:
f.write(result.stdout)
# Process the JSON output
ctag_matches = []
for line in result.stdout.strip().splitlines():
tag = json.loads(line)
if 'pattern' in tag:
pattern_without_regex = tag['pattern'][2:-2]
ctag_matches.append(pattern_without_regex)
finally:
shutil.rmtree(tmp_dir)
# remove the json file
if os.path.exists('ctags_output.json'):
os.remove('ctags_output.json')
ctags_skeleton = '\n'.join(ctag_matches)
return Chunk(path=chunk.path, text=ctags_skeleton, image=chunk.image, source_type=SourceTypes.UNCOMPRESSIBLE_CODE)

def compress_chunks(chunks: List[Chunk]) -> List[Chunk]:
from llmlingua import PromptCompressor
llm_lingua = PromptCompressor(model_name="microsoft/llmlingua-2-xlm-roberta-large-meetingbank", use_llmlingua2=True)
new_chunks = []
for chunk in chunks:
new_chunk = None
if chunk is None or chunk.text is None:
new_chunk = chunk
elif chunk.source_type == SourceTypes.COMPRESSIBLE_CODE:
extension = chunk.path.split('.')[-1]
new_chunk = compress_with_ctags(chunk, extension=extension)
elif chunk.source_type in {SourceTypes.PLAINTEXT, SourceTypes.PDF, SourceTypes.DOCX, SourceTypes.PPTX}:
# Compress the text with llmlingua
new_chunk_text = ""
WINDOW_SIZE = 500
for i in range(0, len(chunk.text), WINDOW_SIZE):
window_text = chunk.text[i:i+WINDOW_SIZE]
result = llm_lingua.compress_prompt(window_text, rate=0.5)
new_window_text = result['compressed_prompt']
new_chunk_text += new_window_text
new_chunk = Chunk(path=chunk.path, text=new_chunk_text, image=chunk.image, source_type=chunk.source_type)
else:
# if the chunk is not compressible, keep the original text
new_chunk = chunk
new_chunks.append(new_chunk)
return new_chunks
Loading

0 comments on commit 0a29b6e

Please sign in to comment.