Amazon_Vine_Analysis

Analyzed 50 datasets of Amazon reviews written by members of the paid Amazon Vine program. Implemented ETL using Amazon Web Services, PySpark and Postgres SQL.

Overview of Project

The client $ellby is about to release a large catalog of products on a leading retail website. They want to know how the reviews of their products compare to the reviews of similar products sold by their competitors. They're also interested in enrolling in a program that gives out free products to select reviewers but they want to know if it's worth the cost. There are thousands of reviews and they're in words not numbers so you'll need to translate them in order to analyze them.

In this project, we'll dig into what the industry means by big data. We'll explore the big data ecosystem including Hadoop, the four V's (Volume, Velocity, Variety and Veracity), MapReduce, Google Colaboratory and Spark. We'll also cover natural language processing (NLP). We'll close with an introduction to cloud services. Cloud services let us store large amounts of data at remote locations rather than locally, on top of many other services. This allows for more scalability and performance. We'll use the most popular cloud service available: Amazon Web Services (AWS).

Follow below the goals for this project:

Objective 1: Perform ETL on Amazon Product Reviews
Objective 2: Determine Bias of Vine Reviews

Resources

Data Source: Amazon_Reviews_ETL.ipynb and Amazon_Vine_Analysis.ipynb. The schema SQL is available on challenge_schema.sql. Amazon Dataset available on https://s3.amazonaws.com/amazon-reviews-pds/tsv/index.txt
Software & Data Tools: Python 3.8.8, Visual Studio Code 1.64.2, PySpark in Google Colab Notebook, Pyspark 3.0.3, AWS RDS-Postgresql and PgAdmin 5.7

Results & Code

Objective 1: Perform ETL on Amazon Product Reviews

From the following Amazon Review datasets (Links to an external site.), pick a dataset that you would like to analyze.

We selected the videogame Dataset (https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Video_Games_v1_00.tsv.gz)

Install packages for libraries that we want to use. Running with cloud notebooks with PySpark.
Loading Amazon Data into Spark DataFrame
Creating DataFrames to match tables. Read in the review data set as a Dataframe and creating the customers_table DataFrame.
Using the groupby() function on the customer_id column of the DataFrame. Count all the customer ids using the agg() function by chaining it to the groupby() function. After we use this function, a new column will be created, count(customer_id). Rename the count(customer_id) column using the withColumnRenamed() function so it matches the schema for the customers_table in pgAdmin.
Create the products_Table DataFrame
Create the products_table, use the select() function to select the product_id and product_title, then drop duplicates with the drop_duplicates() function to retrieve only unique product_ids
The review_id_table DataFrame
To create the review_id_table, use the select() function to select the columns that are in the review_id_table in pgAdmin (as shown in the following image), and convert the review_date column to a date

The vine_table DataFrame
To create the vine_table, use the select() function to select only the columns that are in the vine_table in pgAdmin

Load the DataFrames into pgAdmin
Make the connection to your AWS RDS instance. Load the DataFrames that correspond to tables in pgAdmin. In pgAdmin, run a query to check that the tables have been populated.

Customer Table on Pgadmin (SQL)

Product Table on Pgadmin (SQL)

Review_ID Table on Pgadmin (SQL)

Vine Table on Pgadmin (SQL)

Objective 2: Determine Bias of Vine Reviews

There is a DataFrame or table for the vine_table data using PySpark method

The data is filtered to create a DataFrame or table where there are 20 or more total votes

The data is filtered to create a DataFrame or table where the percentage of helpful_votes is equal to or greater than 50%

The data is filtered to create a DataFrame or table where there is a Vine review (Paid)

The data is filtered to create a DataFrame or table where there isn’t a Vine review (unpaid)
The total number of reviews, the number of 5-star reviews, and the percentage 5-star reviews are calculated for all Vine and non-Vine reviews

Vine Program Overview

SUMMARY

Due to the huge difference in number of reviews between a Vine member and a non-member, we can conclude that there is no strong relationship between the numbers. A more in-depth and detailed analysis would be necessary to obtain a better conclusion.
On the other hand, due to the very low number of reviews made by Vine members (94 members), only 48 members gave a 5-star review. The company $ellby may distribute free products to selected reviewers as it is worth the cost.
According to the results, just 0.2% (94 members) of the Vine member (Paid) provides a review for videogame products, while 40471 non-members provided reviews; this represents 99.8% of the total reviews analyzed.
Still, according to the results, just 0.3% (48 members) of the Vine member (Unpaid) provided a 5-star review for videogame products, which represents 51.06%, for other hand, non-members provided 15663 reviews with 5-star, representing 99.7% of the total number of 5-star reviews.
More than 50% of Vine members provided a 5-stars review for video games products, representing 48 members. The non-members provided 38.70% of 5-star reviews, representing 15663 non-members.

RECOMMENDED ADDITIONAL ANALYSIS

Develop a new analysis, separated by video game categories, selecting keywords from the main video game consoles, such as PS3, PS4, XBOX, and Nintendo.
Include three to five stars from reviews for additional analysis
Add PC game review

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
Resources/Images		Resources/Images
Amazon_Reviews_ETL.ipynb		Amazon_Reviews_ETL.ipynb
Amazon_Vine_Analysis.ipynb		Amazon_Vine_Analysis.ipynb
README.md		README.md
challenge_schema.sql		challenge_schema.sql

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Amazon_Vine_Analysis

Overview of Project

Resources

Results & Code

Objective 1: Perform ETL on Amazon Product Reviews

Objective 2: Determine Bias of Vine Reviews

Vine Program Overview

SUMMARY

RECOMMENDED ADDITIONAL ANALYSIS

About

Releases

Packages

Languages

DougUOT/Amazon_Vine_Analysis

Folders and files

Latest commit

History

Repository files navigation

Amazon_Vine_Analysis

Overview of Project

Resources

Results & Code

Objective 1: Perform ETL on Amazon Product Reviews

Objective 2: Determine Bias of Vine Reviews

Vine Program Overview

SUMMARY

RECOMMENDED ADDITIONAL ANALYSIS

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages