Skip to content

Analyzed 50 datasets of Amazon reviews written by members of the paid Amazon Vine program. Implemented ETL using Amazon Web Services, PySpark and Postgres SQL.

Notifications You must be signed in to change notification settings

DougUOT/Amazon_Vine_Analysis

Repository files navigation

Amazon_Vine_Analysis

Analyzed 50 datasets of Amazon reviews written by members of the paid Amazon Vine program. Implemented ETL using Amazon Web Services, PySpark and Postgres SQL.

Overview of Project

The client $ellby is about to release a large catalog of products on a leading retail website. They want to know how the reviews of their products compare to the reviews of similar products sold by their competitors. They're also interested in enrolling in a program that gives out free products to select reviewers but they want to know if it's worth the cost. There are thousands of reviews and they're in words not numbers so you'll need to translate them in order to analyze them.

In this project, we'll dig into what the industry means by big data. We'll explore the big data ecosystem including Hadoop, the four V's (Volume, Velocity, Variety and Veracity), MapReduce, Google Colaboratory and Spark. We'll also cover natural language processing (NLP). We'll close with an introduction to cloud services. Cloud services let us store large amounts of data at remote locations rather than locally, on top of many other services. This allows for more scalability and performance. We'll use the most popular cloud service available: Amazon Web Services (AWS).

Follow below the goals for this project:

  1. Objective 1: Perform ETL on Amazon Product Reviews
  2. Objective 2: Determine Bias of Vine Reviews

Resources

Results & Code

Objective 1: Perform ETL on Amazon Product Reviews

  • From the following Amazon Review datasets (Links to an external site.), pick a dataset that you would like to analyze.

We selected the videogame Dataset (https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Video_Games_v1_00.tsv.gz)

  • Install packages for libraries that we want to use. Running with cloud notebooks with PySpark.

  • Loading Amazon Data into Spark DataFrame

  • Creating DataFrames to match tables. Read in the review data set as a Dataframe and creating the customers_table DataFrame.

  • Using the groupby() function on the customer_id column of the DataFrame. Count all the customer ids using the agg() function by chaining it to the groupby() function. After we use this function, a new column will be created, count(customer_id). Rename the count(customer_id) column using the withColumnRenamed() function so it matches the schema for the customers_table in pgAdmin.

  • Create the products_Table DataFrame

  • Create the products_table, use the select() function to select the product_id and product_title, then drop duplicates with the drop_duplicates() function to retrieve only unique product_ids

  • The review_id_table DataFrame

  • To create the review_id_table, use the select() function to select the columns that are in the review_id_table in pgAdmin (as shown in the following image), and convert the review_date column to a date

  • The vine_table DataFrame
  • To create the vine_table, use the select() function to select only the columns that are in the vine_table in pgAdmin

  • Load the DataFrames into pgAdmin
  • Make the connection to your AWS RDS instance. Load the DataFrames that correspond to tables in pgAdmin. In pgAdmin, run a query to check that the tables have been populated.

  • Customer Table on Pgadmin (SQL)

  • Product Table on Pgadmin (SQL)

  • Review_ID Table on Pgadmin (SQL)

  • Vine Table on Pgadmin (SQL)

Objective 2: Determine Bias of Vine Reviews

  • There is a DataFrame or table for the vine_table data using PySpark method

  • The data is filtered to create a DataFrame or table where there are 20 or more total votes

  • The data is filtered to create a DataFrame or table where the percentage of helpful_votes is equal to or greater than 50%

  • The data is filtered to create a DataFrame or table where there is a Vine review (Paid)

  • The data is filtered to create a DataFrame or table where there isn’t a Vine review (unpaid)

  • The total number of reviews, the number of 5-star reviews, and the percentage 5-star reviews are calculated for all Vine and non-Vine reviews

Vine Program Overview

SUMMARY

  1. Due to the huge difference in number of reviews between a Vine member and a non-member, we can conclude that there is no strong relationship between the numbers. A more in-depth and detailed analysis would be necessary to obtain a better conclusion.
  2. On the other hand, due to the very low number of reviews made by Vine members (94 members), only 48 members gave a 5-star review. The company $ellby may distribute free products to selected reviewers as it is worth the cost.
  3. According to the results, just 0.2% (94 members) of the Vine member (Paid) provides a review for videogame products, while 40471 non-members provided reviews; this represents 99.8% of the total reviews analyzed.
  4. Still, according to the results, just 0.3% (48 members) of the Vine member (Unpaid) provided a 5-star review for videogame products, which represents 51.06%, for other hand, non-members provided 15663 reviews with 5-star, representing 99.7% of the total number of 5-star reviews.
  5. More than 50% of Vine members provided a 5-stars review for video games products, representing 48 members. The non-members provided 38.70% of 5-star reviews, representing 15663 non-members.

RECOMMENDED ADDITIONAL ANALYSIS

  1. Develop a new analysis, separated by video game categories, selecting keywords from the main video game consoles, such as PS3, PS4, XBOX, and Nintendo.
  2. Include three to five stars from reviews for additional analysis
  3. Add PC game review

About

Analyzed 50 datasets of Amazon reviews written by members of the paid Amazon Vine program. Implemented ETL using Amazon Web Services, PySpark and Postgres SQL.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published