Utility belt to handle data on AWS.
Contents: Use Cases | Installation | Examples
- Pandas -> Parquet (S3)
- Pandas -> CSV (S3)
- Pandas -> Glue Catalog
- Pandas -> Athena
- Pandas -> Redshift
- CSV (S3) -> Pandas
- Athena -> Pandas
- PySpark -> Redshift
pip install awswrangler
Runs only with Python 3.6 and beyond.
Runs anywhere (AWS Lambda, AWS Glue, EMR, EC2, on-premises, local, etc).
P.S. Lambda Layer bundle and Glue egg are available to download. It's just upload to your account and run! 🚀
session = awswrangler.Session()
session.pandas.to_parquet(
dataframe=dataframe,
database="database",
path="s3://...",
partition_cols=["col_name"],
)
If a Glue Database name is passed, all the metadata will be created in the Glue Catalog. If not, only the s3 data write will be done.
session = awswrangler.Session()
dataframe = session.pandas.read_sql_athena(
sql="select * from table",
database="database"
)
session = awswrangler.Session()
dataframe = session.pandas.read_csv(path="s3://...")
import pandas
import awswrangler
df = pandas.read_... # Read from anywhere
# Typical Pandas, Numpy or Pyarrow transformation HERE!
session = awswrangler.Session()
session.pandas.to_parquet( # Storing the data and metadata to Data Lake
dataframe=dataframe,
database="database",
path="s3://...",
partition_cols=["col_name"],
)
session = awswrangler.Session(spark_session=spark)
session.spark.to_redshift(
dataframe=df,
path="s3://...",
connection=conn,
schema="public",
table="table",
iam_role="IAM_ROLE_ARN",
mode="append",
)