project refactor

emcf · Mar 23, 2024 · 0a29b6e · 0a29b6e
1 parent 77fa6a3
commit 0a29b6e
Show file tree

Hide file tree

Showing 30 changed files with 644 additions and 770 deletions.
diff --git a/.gitignore b/.gitignore
diff --git a/LICENSE b/LICENSE
diff --git a/README.md b/README.md
@@ -1,52 +1,45 @@
 # <img src="https://rpnutzemutbrumczwvue.supabase.co/storage/v1/object/public/assets/pipeline_small%20(1).png" alt="Pipeline Illustration" style="width:96px; height:72px; vertical-align:middle;"> The Pipe
 
-[![codecov](https://codecov.io/gh/emcf/thepipe/graph/badge.svg?token=KHD1PDOSHF)](https://codecov.io/gh/emcf/thepipe) ![python-gh-action](https://github.com/emcf/thepipe/actions/workflows/python-ci.yml/badge.svg)
+[![codecov](https://codecov.io/gh/emcf/thepipe/graph/badge.svg?token=OE7CUEFUL9)](https://codecov.io/gh/emcf/thepipe) ![python-gh-action](https://github.com/emcf/thepipe/actions/workflows/python-ci.yml/badge.svg)
 
-The pipe is a tool for feeding complex real-world data into large language models. It is built on top of dozens of carefully-crafted heuristics to create sensible representations from a variety of sources, including code projects, scientific papers, web pages, github repos, data files, databases, and more.
+The pipe is a multimodal-first tool for feeding real-world data into large language models. It is built on top of dozens of carefully-crafted heuristics to create sensible representations from complex sources, including files, directories, scientific papers, web pages, github repos, etc. 
 
-## 🛠️ How it works 
+## Features 🌟
 
-The pipe is accessible from the command line or from [Python](https://www.python.org/downloads/). The input source is either a file path, a URL, or a directory (or zip file) path. The pipe will extract information from the source and process it for downstream use with [LLMs](https://en.wikipedia.org/wiki/Large_language_model). The output from the pipe is a sensible text-based (or multimodal) representation of the extracted information, carefully crafted to fit within context windows for any models from [gemma-7b](https://huggingface.co/google/gemma-7b) to [GPT-4](https://openai.com/gpt-4). It uses a variety of heuristics to optimize the output for LLMs, including [AI-native PDF extraction](https://docs.mathpix.com/#process-a-pdf), [efficient token compression](https://arxiv.org/abs/2403.12968), [code compression with Ctags](https://en.wikipedia.org/wiki/Ctags), automatic [image encoding](https://en.wikipedia.org/wiki/Base64), reranking for [LITM](https://arxiv.org/abs/2307.03172) effects, and more, all pre-built to work out-of-the-box.
+- Prepare prompts from dozens of complex file types 📄 
+- Visual document extraction for complex PDFs, markdown, etc 🧠
+- Outputs optimized for multimodal LLMs 🖼️ + 💬
+- Multi-threaded ⚡️
+- Works with missing file extensions, in-memory data streams 💾
+- Works with directories, URL, git repos, and more 🌐
+##  How it works 🛠️
 
-## 📂 Supported input sources
+The pipe is accessible from the command line or from [Python](https://www.python.org/downloads/). The input source is either a file path, a URL, or a directory (or zip file) path. The pipe will extract information from the source and process it for downstream use with [language models](https://en.wikipedia.org/wiki/Large_language_model), [vision transformers](https://en.wikipedia.org/wiki/Vision_transformer), or [vision-language models](https://arxiv.org/abs/2304.00685). The output from the pipe is a sensible text-based (or multimodal) representation of the extracted information, carefully crafted to fit within context windows for any models from [gemma-7b](https://huggingface.co/google/gemma-7b) to [GPT-4](https://openai.com/gpt-4). It uses a variety of heuristics for optimal performance with vision-language models, including AI [filetype detection](https://opensource.googleblog.com/2024/02/magika-ai-powered-fast-and-efficient-file-type-identification.html), AI [PDF extraction](https://mathpix.com), efficient [token compression](https://arxiv.org/abs/2403.12968), automatic [image encoding](https://en.wikipedia.org/wiki/Base64), [reranking](https://arxiv.org/abs/2310.06839) for [lost-in-the-middle](https://arxiv.org/abs/2307.03172) effects, and more, all pre-built to work out-of-the-box.
 
-### Sources
-- 📁 **Project directories** (any directory)
-- 🗂️ **Zip / Tarballs** (`.zip`, `.tar`, `.gz`)
-- 🔗 **URLs** (any input containing `http` or `www`, or `.url` shortcut file)
-- 🐙 **GitHub Repositories** (any input containing `github.com`)
-- 🗃️ **Business Database** (any input containing `supabase`)
+## Getting Started 🚀
 
-### Documents
-- 📜 **Code files** (`.py`, `.cpp`, `.ts`, `.css`, `.h`, etc.)
-- 📚 **PDFs** (`.pdf`) (`.pdf` or any input containing `arxiv`, extract images/tables/math with `--mathpix`)
-- 🖼️ **Images** (`.png`, `.jpg`, `.jpeg`, `.gif`)
-- 📊 **Spreadsheets** (`.csv`, `.xlsx`)
-- 📜 **Configuration files** (`.yaml`, `.json`, `.xml`, `.ini`, `.xaml`, `.cfg`, `.config`)
-- 📓 **IPython notebooks** (`.ipynb`)
-- 📝 **Word documents** (`.docx`)
-- 📊 **Powerpoint presentations** (`.pptx`)
-
-
-
-## 🚀 Getting Started
+To use The Pipe, simply clone this repository and install the requirements:
+```bash
+git clone https://github.com/emcf/thepipe
+pip install -r requirements.txt
+```
 
-To use The Pipe, simply clone this repository and run
+To use The Pipe from the command line, simply run
 
 ```bash
-python thepipe.py --source /path/to/directory --output prompt.txt
+python thepipe.py path/to/directory --limit 100000
 ```
 
-This command will process all supported files within the specified directory, compressing the information over the token limit if necessary, and outputting the result to `output.txt`.
+This command will process all supported files within the specified directory, compressing any information over the token limit if necessary, and outputting the result to a folder.
 
 Arguments are:
-- `--source` (required): The input source, can be a file path, a URL, or a directory path.
-- `--output` (required): The output file path.
-- `--limit` (optional): The token limit for the output, defaults to 64K.
+- The input source (required): can be a file path, a URL, or a directory path.
+- `--match` (optional): Glob pattern to match files in the directory.
+- `--limit` (optional): The token limit for the output prompt, defaults to 100K. Prompts exceeding the limit will be compressed.
 - `--mathpix` (optional): Extract images, tables, and math from PDFs using [Mathpix](https://docs.mathpix.com/#process-a-pdf).
-- `--text` (optional): Output text scraped from images instead of [base64](https://en.wikipedia.org/wiki/Base64) encoded images.
+- `--text_only` (optional): Do not extract images from documents or websites. Additionally, image files will be represented with OCR instead of as images.
 
-Alternatively, to use the pipe from Python:
+To use the pipe from Python:
 
 ```python
 import openai
@@ -60,21 +53,19 @@ response_content = response.choices[0].message.content
 print(response_content)
 ```
 
-## ⚖️ Heuristics
-
-To optimize the output for downstream tasks, the pipe uses a variety of assumptions and heuristics to extract the most important information from the input data, and to format it. Here are some of the most important ones:
-- **Optional [Mathpix](https://docs.mathpix.com/#process-a-pdf) PDF extraction**: Optional, extracts images, tables, and math from PDFs.
-- **[Ctags](https://en.wikipedia.org/wiki/Ctags) token compression**: When the output prompt is too large, automatically extracts essential code structure (functions, classes, variables, types) and throws away the rest. Useful for high-quality coding under strict token constraints.
-- **[LLMLingua](https://arxiv.org/abs/2403.12968) token compression**: When the output prompt is too large, automatically extracts essential tokens, can improve downstream performance by removing noise.
-- **[LITM](https://arxiv.org/abs/2307.03172) Reranking**: Reformats the output to minimize the impact of the "lost in the middle" effect to improve downstream performance with LLMs.
-- **Image resizing, [base64](https://en.wikipedia.org/wiki/Base64) encoding**: Maximum image dimensions are clipped to 512 pixels and encoded in base64 for easy downstream use with vision language models. Can alternatively output a text description of all images with `--text`, or text scraped from all images with `--scrape`.
-- [Unstructured](https://github.com/Unstructured-IO/unstructured) extraction from unknown sources
-- **Ignore Rules**: Sensible out-of-the-box ignore rules for common directories and files that are not useful for downstream tasks, such as `node_modules`, `__pycache__`, `.gitignore`, etc. Feel free to customize these for your own use case by modifying `FILES_TO_IGNORE` in `config.py`.
-
-## License 📜
-
-Distributed under the MIT License. See `LICENSE` for more information.
-
----
-
-Made with ❤️ and Python.
+## Supported File Types 📚
+
+| Source Type                           | Input types        | Token Compression 🗜️ | Image Extraction 👁️ | Notes 📌                                                  |
+|---------------------------------------|------------------------------------------|-------------------|------------------|---------------------------------------------------------|
+| Directory                             | Any `/path/to/directory`                 | Yes               | No               | Extracts from all files in directory, supports match and ignore patterns |
+| Code                                  | `.h`, `.json`, `.js`, `.jsx`, `.ts`, `.tsx`, `.cs`, `.java`, `.html`, `.css`, `.ini`, `.xml`, `.yaml`, `.xaml`, `.sh`, `.c`, `.cpp`, `.py` | Varies (ctags)   | No               | Combines all code files. `.c`, `.cpp`, `.py` are compressible with ctags, others are not |
+| Plaintext                             | `.txt`, `.md`, `.rtf`                    | Yes               | No               | Regular text files                                                      |
+| PDF                                   | `.pdf`                                  | Yes               | Yes (Optional)   | Extracts text and optionally images; can use Mathpix for enhanced extraction |
+| Image                                 | `.jpg`, `.jpeg`, `.png`, `.gif`, `.bmp`, `.tiff`, `.webp`, `.svg` | No                | Yes              | Extracts images and can convert to text using OCR                        |
+| Spreadsheet                           | `.csv`, `.xls`, `.xlsx`                  | No                | No               | Extracts data from spreadsheets; converts to text representation         |
+| Jupyter Notebook                      | `.ipynb`                                | Yes               | No               | Extracts content from Jupyter notebooks                                  |
+| Microsoft Word Document               | `.docx`                                 | Yes               | No               | Extracts text from Word documents                                        |
+| Microsoft PowerPoint Presentation     | `.pptx`                                 | Yes               | No               | Extracts text from PowerPoint presentations                              |
+| Website                               | URLs (http, https, www, ftp)             | No                | Yes (Optional)   | Extracts content from web pages; text-only extraction available          |
+| GitHub Repository                     | GitHub repo URLs                         | Yes               | No               | Extracts from GitHub repositories; supports branch specification         |
+| ZIP File                              | `.zip`                                  | Yes               | No               | Extracts contents of ZIP files; supports nested directory extraction     |
diff --git a/compress.py b/compress.py
@@ -0,0 +1,84 @@
+import json
+import shutil
+import subprocess
+import tempfile
+from typing import *
+import os
+from core import Chunk, SourceTypes
+
+CTAGS_LANGUAGES = {'py': "Python", 'ts': "Typescript", "tsx": "Typescript", "cpp": "C++", "c": "C"}
+
+def compress_with_ctags(chunk: Chunk, extension: str) -> Chunk:
+    if chunk.text is None:
+        return Chunk(path=chunk.path, text=chunk.text, image=chunk.image, source_type=SourceTypes.UNCOMPRESSIBLE_CODE)
+    language = CTAGS_LANGUAGES[extension]
+    tmp_dir = tempfile.mkdtemp()
+    try:
+        file_path = os.path.join(tmp_dir, "tempfile")+'.'+extension
+        with open(file_path, 'w', encoding='utf-8') as tmp_file:
+            tmp_file.write(chunk.text)
+        # need custom options for ctags to work with typescript
+        if language == "Typescript":
+            cmd = [
+                "./ctags",
+                f"--languages={language}",
+                "--output-format=json",
+                f"--options=./ts.ctags",
+                "-f", "-",
+                file_path
+            ]
+        else:
+            cmd = [
+                "./ctags",
+                f"--languages={language}",
+                "--output-format=json",
+                "-f", "-",
+                file_path
+            ]     
+        result = subprocess.run(cmd, capture_output=True, text=True)
+        if result.returncode != 0:
+            raise Exception(f"Error running ctags: {result.stderr}")
+        # write output to file
+        with open('ctags_output.json', 'w', encoding='utf-8') as f:
+            f.write(result.stdout)
+        # Process the JSON output
+        ctag_matches = []
+        for line in result.stdout.strip().splitlines():
+            tag = json.loads(line)
+            if 'pattern' in tag:
+                pattern_without_regex = tag['pattern'][2:-2]
+                ctag_matches.append(pattern_without_regex)
+    finally:
+        shutil.rmtree(tmp_dir)
+    # remove the json file
+    if os.path.exists('ctags_output.json'):
+        os.remove('ctags_output.json')
+    ctags_skeleton = '\n'.join(ctag_matches)
+    return Chunk(path=chunk.path, text=ctags_skeleton, image=chunk.image, source_type=SourceTypes.UNCOMPRESSIBLE_CODE)
+
+def compress_chunks(chunks: List[Chunk]) -> List[Chunk]:
+    from llmlingua import PromptCompressor
+    llm_lingua = PromptCompressor(model_name="microsoft/llmlingua-2-xlm-roberta-large-meetingbank", use_llmlingua2=True)
+    new_chunks = []
+    for chunk in chunks:
+        new_chunk = None
+        if chunk is None or  chunk.text is None:
+            new_chunk = chunk
+        elif chunk.source_type == SourceTypes.COMPRESSIBLE_CODE:
+            extension = chunk.path.split('.')[-1]
+            new_chunk = compress_with_ctags(chunk, extension=extension)
+        elif chunk.source_type in {SourceTypes.PLAINTEXT, SourceTypes.PDF, SourceTypes.DOCX, SourceTypes.PPTX}:
+            # Compress the text with llmlingua
+            new_chunk_text = ""
+            WINDOW_SIZE = 500
+            for i in range(0, len(chunk.text), WINDOW_SIZE):
+                window_text = chunk.text[i:i+WINDOW_SIZE]
+                result = llm_lingua.compress_prompt(window_text, rate=0.5)
+                new_window_text = result['compressed_prompt']
+                new_chunk_text += new_window_text
+            new_chunk = Chunk(path=chunk.path, text=new_chunk_text, image=chunk.image, source_type=chunk.source_type)
+        else:
+            # if the chunk is not compressible, keep the original text
+            new_chunk = chunk
+        new_chunks.append(new_chunk)
+    return new_chunks