Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PyPI Package Manager Support #49

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

stevenlei
Copy link
Contributor

This PR adds PyPI package manager support to CHAI. The implementation includes data fetching, transformation and some other fixes in the codebase.

Key Components

1. PyPI Fetcher (fetcher.py)

  • Implements parallel downloading of package metadata from PyPI's JSON API
  • Stores downloaded data in batches for efficient processing
  • Maintains progress tracking for resumable downloads
  • There are ~600k packages as of January 2025, we are saving them in batch for 100 packages per file, so there will be ~6000 JSON files, and the size of all JSON data is ~13.4GB.

2. PyPI Transformer (transformer.py)

  • Transforms PyPI JSON data into CHAI's standardized format
  • Implements data transformation for:
    • Packages
    • Licenses
    • Package versions
    • Dependencies and version constraints
    • URLs and package relationships
  • Handles special cases like:
    • Complex version constraints (>=, ~=, etc.)
    • Missing version numbers in dependencies: to fetch the latest version in this case

Implementation Flow

The PyPI implementation follows a two-phase approach:

Phase 1: Data Collection

First, we download all package data from PyPI (~600k packages as of Jan 2025). This approach was chosen for several reasons:

  1. Efficient dependency resolution - Having all package data locally allows us to:
    • Resolve missing version numbers by looking up latest versions
    • Validate package relationships before insertion
  2. Resumable downloads - Progress tracking allows interrupted downloads to be resumed
  3. Batch processing - Data is stored in manageable batches for efficient processing

Phase 2: Data Processing

Once all data is downloaded, we process it in the following order:

  1. Package Insertion
  2. URL Processing
  3. Version Processing
  4. Dependency Processing
  5. Load History

We are not processing for users-related tables, because the author name and email on PyPI are not GitHub related.

Reference Screenshots

Screenshot 2025-01-04 at 10 12 43 PM image Screenshot 2025-01-04 at 10 13 18 PM Screenshot 2025-01-04 at 10 13 38 PM Screenshot 2025-01-04 at 10 14 23 PM

Other Changes

Future Improvements

  1. Implement re-download logic for package updates
  2. Enhance version comparison for dependency resolution
  3. Optimize batch processing for larger datasets

core/db.py Outdated Show resolved Hide resolved
core/db.py Outdated Show resolved Hide resolved
core/models/__init__.py Outdated Show resolved Hide resolved
docker-compose.yml Show resolved Hide resolved
package_managers/pypi/fetcher.py Show resolved Hide resolved
package_managers/pypi/transformer.py Show resolved Hide resolved
package_managers/pypi/transformer.py Show resolved Hide resolved
package_managers/pypi/transformer.py Show resolved Hide resolved
package_managers/pypi/transformer.py Show resolved Hide resolved
package_managers/pypi/transformer.py Show resolved Hide resolved
@stevenlei
Copy link
Contributor Author

Thanks for the suggestions @snosratiershad I have made changes to the code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants