A powerful command-line tool and API for managing and deploying Spark jobs on Amazon EMR clusters. EMRRunner simplifies the process of submitting and managing Spark jobs while handling all the necessary environment setup.
- Command-line interface for quick job submission
- RESTful API for programmatic access
- Support for both client and cluster deploy modes
- Automatic S3 synchronization of job files
- Configurable job parameters
- Easy dependency management
- Bootstrap action support for cluster setup
- Python 3.9+
- AWS Account with EMR access
- Configured AWS credentials
- Active EMR cluster
pip install emrrunner
# Clone the repository
git clone https://github.com/yourusername/EMRRunner.git
cd EMRRunner
# Create and activate virtual environment
python -m venv venv
source venv/bin/activate # On Windows: .\venv\Scripts\activate
# Install the package
pip install -e .
Create a .env
file in the project root with your AWS configuration:
Note: Export these variables in your terminal before running:
AWS_ACCESS_KEY=your_access_key
AWS_SECRET_KEY=your_secret_key
AWS_REGION=your_region
EMR_CLUSTER_ID=your_cluster_id
S3_PATH=s3://your-bucket/path
For EMR cluster setup with required dependencies, create a bootstrap script (bootstrap.sh
):
#!/bin/bash -xe
# Example structure of a bootstrap script
# Create and activate virtual environment
python3 -m venv /home/hadoop/myenv
source /home/hadoop/myenv/bin/activate
# Install system dependencies
sudo yum install python3-pip -y
sudo yum install -y [your-system-packages]
# Install Python packages
pip3 install [your-required-packages]
deactivate
Upload the bootstrap script to S3 and reference it in your EMR cluster configuration.
EMRRunner/
βββ Dockerfile
βββ LICENSE.md
βββ README.md
βββ app/
β βββ __init__.py
β βββ cli.py # Command-line interface
β βββ config.py # Configuration management
β βββ emr_client.py # EMR interaction logic
β βββ emr_job_api.py # Flask API endpoints
β βββ run_api.py # API server runner
β βββ schema.py # Request/Response schemas
βββ bootstrap/
β βββ bootstrap.sh # EMR bootstrap script
βββ tests/
β βββ __init__.py
β βββ test_config.py
β βββ test_emr_job_api.py
β βββ test_schema.py
βββ pyproject.toml
βββ requirements.txt
βββ setup.py
The S3_PATH
in your configuration should point to a bucket with the following structure:
s3://your-bucket/
βββ jobs/
β βββ job1/
β β βββ dependencies.py # Shared functions and utilities
β β βββ job.py # Main job execution script
β βββ job2/
β βββ dependencies.py
β βββ job.py
βββ common/
βββ shared_utils.py # Cross-job shared utilities
Each job in the S3 bucket follows a standard structure:
-
dependencies.py
- Contains reusable functions and utilities specific to the job
- Example functions:
def process_data(df): # Data processing logic pass def validate_input(data): # Input validation logic pass def transform_output(result): # Output transformation logic pass
-
job.py
- Main execution script that uses functions from dependencies.py
- Standard structure:
from dependencies import process_data, validate_input, transform_output def main(): # 1. Read input data input_data = spark.read.parquet("s3://input-path") # 2. Validate input validate_input(input_data) # 3. Process data processed_data = process_data(input_data) # 4. Transform output final_output = transform_output(processed_data) # 5. Write results final_output.write.parquet("s3://output-path") if __name__ == "__main__": main()
Start a job in client mode:
emrrunner start --job job1 --step process_daily_data
Start a job in cluster mode:
emrrunner start --job job1 --step process_daily_data --deploy-mode cluster
Start a job via API in client mode (default):
curl -X POST http://localhost:8000/api/emr/start-job \
-H "Content-Type: application/json" \
-d '{"job_name": "job1", "step": "process_daily_data"}'
Start a job via API in cluster mode:
curl -X POST http://localhost:8000/api/emr/start-job \
-H "Content-Type: application/json" \
-d '{"job_name": "job1", "step": "process_daily_data", "deploy_mode": "cluster"}'
To contribute to EMRRunner:
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
-
Bootstrap Actions
- Keep bootstrap scripts modular
- Version control your dependencies
- Use specific package versions
- Test bootstrap scripts locally when possible
- Store bootstrap scripts in S3 with versioning enabled
-
Job Dependencies
- Maintain a requirements.txt for each job
- Use virtual environments
- Document system-level dependencies
- Test dependencies in a clean environment
-
Job Organization
- Follow the standard structure for jobs
- Keep dependencies.py focused and modular
- Use clear naming conventions
- Document all functions and modules
- Supports AWS credential management
- Validates all input parameters
- Secure handling of bootstrap scripts
This project is licensed under the MIT License - see the LICENSE.md file for details.
Contributions are welcome! Please feel free to submit a Pull Request.
If you discover any bugs, please create an issue on GitHub with:
- Your operating system name and version
- Any details about your local setup that might be helpful in troubleshooting
- Detailed steps to reproduce the bug
Built with β€οΈ using Python and AWS EMR