This project will showcase a comprehensive analysis of the Android app market by comparing over 10,000 apps in Google Play across different categories. We'll look for insights in the data to devise strategies to drive growth and retention.
This project is based on an exercise in the "Data Scientist with Python Track" on DataCamp which I am currently undertaking (Jan 2021) with various additions and changes of my own.
The Dataset comprises two files:
-
apps.csv: details of the applications on Google Play. There are 13 features that describe any given app in the dataset.
-
user_reviews.csv: contains 100 reviews for each app, most helpful first. The text in each review has been pre-processed and attributed with three new features: Sentiment (Positive, Negative or Neutral), Sentiment Polarity and Sentiment Subjectivity.
We'll start by importing the apps csv file and running a few preliminary queries to:
- drop duplicates
- check how many unique apps are in the dataset
- print an info summary of the data to check data types, null values, columns etc.
- print a random sample of 5 rows
# Read in dataset
import pandas as pd
apps_with_duplicates = pd.DataFrame(pd.read_csv('datasets/apps.csv'))
# Drop duplicates
apps = apps_with_duplicates.drop_duplicates()
# Print the total number of apps
print('Total number of apps in the dataset = ', len(apps.index))
# Print a concise summary of apps dataframe
print(apps.info())
# Have a look at a random sample of n rows
n = 5
apps.sample(n)
Total number of apps in the dataset = 9659
<class 'pandas.core.frame.DataFrame'>
Int64Index: 9659 entries, 0 to 9658
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 9659 non-null int64
1 App 9659 non-null object
2 Category 9659 non-null object
3 Rating 8196 non-null float64
4 Reviews 9659 non-null int64
5 Size 8432 non-null float64
6 Installs 9659 non-null object
7 Type 9659 non-null object
8 Price 9659 non-null object
9 Content Rating 9659 non-null object
10 Genres 9659 non-null object
11 Last Updated 9659 non-null object
12 Current Ver 9651 non-null object
13 Android Ver 9657 non-null object
dtypes: float64(2), int64(2), object(10)
memory usage: 1.1+ MB
None
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Unnamed: 0 | App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3755 | 4671 | V LIVE - Star Live App | FAMILY | 4.4 | 397147 | NaN | 10,000,000+ | Free | 0 | Teen | Entertainment | August 2, 2018 | Varies with device | Varies with device |
6046 | 7089 | CA UIM Mobile | BUSINESS | 3.2 | 54 | 7.7 | 1,000+ | Free | 0 | Everyone | Business | February 29, 2016 | 8.4.1 | 4.0 and up |
7616 | 8726 | DRAGON QUEST VI | FAMILY | 4.4 | 3358 | 17.0 | 100,000+ | Paid | $14.99 | Everyone | Role Playing | August 1, 2018 | 1.0.4 | 2.3 and up |
4018 | 4961 | Ad Blocker Turbo - Adblocker Browser | COMMUNICATION | 3.9 | 107 | 44.0 | 10,000+ | Free | 0 | Everyone | Communication | January 29, 2018 | 1.0.5 | 4.1 and up |
8447 | 9578 | EM Launcher for EMUI | PERSONALIZATION | 4.1 | 3175 | 2.3 | 500,000+ | Free | 0 | Everyone | Personalization | April 6, 2018 | 1.0.6 | 4.1 and up |
The features/columns that we will focus on are Category, Rating, Installs, Size, and Price.
The info() function (from the previous task) told us that the Installs and Price columns are of type 'object' and not int64 or float64 as we would expect. This is because the column contains some characters more than just digits. Ideally, we would want these columns to be numeric so we can perform calculations on them for further analysis.
Let's move on to data cleaning, which would involve:
- removing special characters (, $ +) in the Installs and Price columns as those make their conversion to a numerical data type difficult.
- converting those columns to numeric data type.
# List of characters to remove
chars_to_remove = ['+', ',', '$']
# List of column names to clean
cols_to_clean = ['Installs', 'Price']
# Loop for each column
for col in cols_to_clean:
for char in chars_to_remove:
apps[col] = apps[col].astype(str).str.replace(char, "")
# Convert col to numeric
apps[col] = pd.to_numeric(apps[col])
#check type has been changed
apps.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 9659 entries, 0 to 9658
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 9659 non-null int64
1 App 9659 non-null object
2 Category 9659 non-null object
3 Rating 8196 non-null float64
4 Reviews 9659 non-null int64
5 Size 8432 non-null float64
6 Installs 9659 non-null int64
7 Type 9659 non-null object
8 Price 9659 non-null float64
9 Content Rating 9659 non-null object
10 Genres 9659 non-null object
11 Last Updated 9659 non-null object
12 Current Ver 9651 non-null object
13 Android Ver 9657 non-null object
dtypes: float64(3), int64(3), object(8)
memory usage: 1.1+ MB
#check characters have been removed
apps.Installs.head(5)
0 10000
1 500000
2 5000000
3 50000000
4 100000
Name: Installs, dtype: int64
apps.Price.head(5)
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
Name: Price, dtype: float64
We'll start exploring our dataset by looking at the app categories. It would be interesting to explore the following questions:
- How many app categories are there on Google Play?
- Which category has the highest share of (active) apps in the Google Play market?
- Is any specific category dominating the Google Play market (i.e. which category has the highest number of installs)?
- Which categories have the fewest number of apps?
We will see that:
- there are 33 unique app categories present in our dataset.
- Family and Game apps have the highest market prevalence (with almost 19% and 10% of the market respectively). Tools, Business and Medical apps are also at the top.
- The categories with the least number of apps on Google Play are Beauty, Comics and Parenting (with around 6%of the mearket each).
Let's explore!
# Print the total number of unique categories
num_categories = len(apps['Category'].unique())
print('Number of categories = ', num_categories)
# Count the number of apps in each 'Category' and sort them in descending order
num_apps_in_category = apps['Category'].value_counts().sort_values(ascending = False)
print(num_apps_in_category)
Number of categories = 33
FAMILY 1832
GAME 959
TOOLS 827
BUSINESS 420
MEDICAL 395
PERSONALIZATION 376
PRODUCTIVITY 374
LIFESTYLE 369
FINANCE 345
SPORTS 325
COMMUNICATION 315
HEALTH_AND_FITNESS 288
PHOTOGRAPHY 281
NEWS_AND_MAGAZINES 254
SOCIAL 239
BOOKS_AND_REFERENCE 222
TRAVEL_AND_LOCAL 219
SHOPPING 202
DATING 171
VIDEO_PLAYERS 163
MAPS_AND_NAVIGATION 131
EDUCATION 119
FOOD_AND_DRINK 112
ENTERTAINMENT 102
AUTO_AND_VEHICLES 85
LIBRARIES_AND_DEMO 84
WEATHER 79
HOUSE_AND_HOME 74
EVENTS 64
ART_AND_DESIGN 64
PARENTING 60
COMICS 56
BEAUTY 53
Name: Category, dtype: int64
#convert series into dataframe
num_apps_in_category = num_apps_in_category.to_frame().reset_index()
#validate conversion to DF worked
num_apps_in_category.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33 entries, 0 to 32
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 index 33 non-null object
1 Category 33 non-null int64
dtypes: int64(1), object(1)
memory usage: 656.0+ bytes
#check the index column
num_apps_in_category.index
RangeIndex(start=0, stop=33, step=1)
#rename dataframe cols
num_apps_in_category.columns = ['category', '# of apps', '% of market']
# Visualise the number of apps by category
num_apps_in_category.plot.barh(x='category', y='# of apps', figsize=(15,8), title='Google Play apps per category')
<matplotlib.axes._subplots.AxesSubplot at 0x11bb3d790>
#calculate percetange of overall app market per category
num_apps_in_category['% of market'] = 100 * num_apps_in_category['# of apps'] / num_apps_in_category['# of apps'].sum()
#view result
num_apps_in_category
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
category | # of apps | % of market | |
---|---|---|---|
0 | FAMILY | 1832 | 18.966767 |
1 | GAME | 959 | 9.928564 |
2 | TOOLS | 827 | 8.561963 |
3 | BUSINESS | 420 | 4.348276 |
4 | MEDICAL | 395 | 4.089450 |
5 | PERSONALIZATION | 376 | 3.892743 |
6 | PRODUCTIVITY | 374 | 3.872036 |
7 | LIFESTYLE | 369 | 3.820271 |
8 | FINANCE | 345 | 3.571798 |
9 | SPORTS | 325 | 3.364738 |
10 | COMMUNICATION | 315 | 3.261207 |
11 | HEALTH_AND_FITNESS | 288 | 2.981675 |
12 | PHOTOGRAPHY | 281 | 2.909204 |
13 | NEWS_AND_MAGAZINES | 254 | 2.629672 |
14 | SOCIAL | 239 | 2.474376 |
15 | BOOKS_AND_REFERENCE | 222 | 2.298375 |
16 | TRAVEL_AND_LOCAL | 219 | 2.267315 |
17 | SHOPPING | 202 | 2.091314 |
18 | DATING | 171 | 1.770370 |
19 | VIDEO_PLAYERS | 163 | 1.687545 |
20 | MAPS_AND_NAVIGATION | 131 | 1.356248 |
21 | EDUCATION | 119 | 1.232012 |
22 | FOOD_AND_DRINK | 112 | 1.159540 |
23 | ENTERTAINMENT | 102 | 1.056010 |
24 | AUTO_AND_VEHICLES | 85 | 0.880008 |
25 | LIBRARIES_AND_DEMO | 84 | 0.869655 |
26 | WEATHER | 79 | 0.817890 |
27 | HOUSE_AND_HOME | 74 | 0.766125 |
28 | EVENTS | 64 | 0.662594 |
29 | ART_AND_DESIGN | 64 | 0.662594 |
30 | PARENTING | 60 | 0.621182 |
31 | COMICS | 56 | 0.579770 |
32 | BEAUTY | 53 | 0.548711 |
#reset category col as index
num_apps_in_category.set_index(['category'], inplace=True)
#plot categories by % of total apps on Google Play
num_apps_in_category.plot.pie(y='% of market',figsize=(20,10), legend=None)
<matplotlib.axes._subplots.AxesSubplot at 0x1227d1e80>
Now that we know which categories offer the most and least apps we can also check which category is dominating the market, based on the total number of installs of apps within a category (as opposed to the number of apps available within a category). It would be interesting to see if this analysis brings up the same categories as above.
downloads_per_cat = apps.groupby(['Category'])['Installs'].sum().sort_values(ascending=False)
print(downloads_per_cat)
Category
GAME 13878924415
COMMUNICATION 11038276251
TOOLS 8001771915
PRODUCTIVITY 5793091369
SOCIAL 5487867902
PHOTOGRAPHY 4649147655
FAMILY 4427941505
VIDEO_PLAYERS 3926902720
TRAVEL_AND_LOCAL 2894887146
NEWS_AND_MAGAZINES 2369217760
ENTERTAINMENT 2113660000
BOOKS_AND_REFERENCE 1665969576
PERSONALIZATION 1532494782
SHOPPING 1400348785
HEALTH_AND_FITNESS 1144022512
SPORTS 1096474498
BUSINESS 697164865
LIFESTYLE 503823539
MAPS_AND_NAVIGATION 503281890
FINANCE 455348734
WEATHER 361100520
EDUCATION 352952000
FOOD_AND_DRINK 211798751
DATING 140926107
ART_AND_DESIGN 114338100
HOUSE_AND_HOME 97212461
AUTO_AND_VEHICLES 53130211
LIBRARIES_AND_DEMO 52995910
COMICS 44981150
MEDICAL 38193177
PARENTING 31521110
BEAUTY 27197050
EVENTS 15973161
Name: Installs, dtype: int64
INSIGHTS
-
Interestingly, when it comes to the number of downloads, the Games category is the one with the highest number (over 14bn downloads), followed by Communication (with over 11bn downloads), Tools and Productivity (with 8bn and 5.7bn downloads respectively).
-
By running a similar calculation as above, we can see the market share of these categories - it appears that Game captures almost 19% of the total installs on Google Play while Communication apps account for almost 15% of the total installs on the platform.
-
Surprisingly, fitness/health apps only amount to 1.5% of the installs and dating apps only around 1.8%.
#convert series to DF
downloads_per_cat = downloads_per_cat.to_frame()
downloads_per_cat
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Installs | |
---|---|
Category | |
GAME | 13878924415 |
COMMUNICATION | 11038276251 |
TOOLS | 8001771915 |
PRODUCTIVITY | 5793091369 |
SOCIAL | 5487867902 |
PHOTOGRAPHY | 4649147655 |
FAMILY | 4427941505 |
VIDEO_PLAYERS | 3926902720 |
TRAVEL_AND_LOCAL | 2894887146 |
NEWS_AND_MAGAZINES | 2369217760 |
ENTERTAINMENT | 2113660000 |
BOOKS_AND_REFERENCE | 1665969576 |
PERSONALIZATION | 1532494782 |
SHOPPING | 1400348785 |
HEALTH_AND_FITNESS | 1144022512 |
SPORTS | 1096474498 |
BUSINESS | 697164865 |
LIFESTYLE | 503823539 |
MAPS_AND_NAVIGATION | 503281890 |
FINANCE | 455348734 |
WEATHER | 361100520 |
EDUCATION | 352952000 |
FOOD_AND_DRINK | 211798751 |
DATING | 140926107 |
ART_AND_DESIGN | 114338100 |
HOUSE_AND_HOME | 97212461 |
AUTO_AND_VEHICLES | 53130211 |
LIBRARIES_AND_DEMO | 52995910 |
COMICS | 44981150 |
MEDICAL | 38193177 |
PARENTING | 31521110 |
BEAUTY | 27197050 |
EVENTS | 15973161 |
#calculate % of installs out of total and place in new col
downloads_per_cat['% of total installs'] = 100 * downloads_per_cat['Installs'] / downloads_per_cat['Installs'].sum()
downloads_per_cat
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Installs | % of total installs | |
---|---|---|
Category | ||
GAME | 13878924415 | 18.474949 |
COMMUNICATION | 11038276251 | 14.693616 |
TOOLS | 8001771915 | 10.651570 |
PRODUCTIVITY | 5793091369 | 7.711481 |
SOCIAL | 5487867902 | 7.305183 |
PHOTOGRAPHY | 4649147655 | 6.188719 |
FAMILY | 4427941505 | 5.894260 |
VIDEO_PLAYERS | 3926902720 | 5.227302 |
TRAVEL_AND_LOCAL | 2894887146 | 3.853533 |
NEWS_AND_MAGAZINES | 2369217760 | 3.153787 |
ENTERTAINMENT | 2113660000 | 2.813601 |
BOOKS_AND_REFERENCE | 1665969576 | 2.217658 |
PERSONALIZATION | 1532494782 | 2.039983 |
SHOPPING | 1400348785 | 1.864076 |
HEALTH_AND_FITNESS | 1144022512 | 1.522867 |
SPORTS | 1096474498 | 1.459574 |
BUSINESS | 697164865 | 0.928032 |
LIFESTYLE | 503823539 | 0.670665 |
MAPS_AND_NAVIGATION | 503281890 | 0.669944 |
FINANCE | 455348734 | 0.606138 |
WEATHER | 361100520 | 0.480679 |
EDUCATION | 352952000 | 0.469833 |
FOOD_AND_DRINK | 211798751 | 0.281936 |
DATING | 140926107 | 0.187594 |
ART_AND_DESIGN | 114338100 | 0.152201 |
HOUSE_AND_HOME | 97212461 | 0.129404 |
AUTO_AND_VEHICLES | 53130211 | 0.070724 |
LIBRARIES_AND_DEMO | 52995910 | 0.070546 |
COMICS | 44981150 | 0.059877 |
MEDICAL | 38193177 | 0.050841 |
PARENTING | 31521110 | 0.041959 |
BEAUTY | 27197050 | 0.036203 |
EVENTS | 15973161 | 0.021263 |
#reset index for the DF
downloads_per_cat.reset_index(inplace=True)
#plot the % of total installs per category
downloads_per_cat.plot.bar(x='Category', y='% of total installs', figsize=(15,8), title='Google Play app installs per category (% of total)', color='g')
<matplotlib.axes._subplots.AxesSubplot at 0x1233b3af0>
Now that we've explored and analysed the app categories, we can turn to the app ratings. App ratings are an important KPI as they impact discoverability, conversion and the brand's overall image. Questions to look at would be:
- what is the average app rating on Google Play?
- what is the ratings distribution (i.e. where do most apps rank? where do the minority of apps rank)?
- are the majority of apps highly rated?
# Average rating of apps
avg_app_rating = apps['Rating'].mean()
print('Average app rating = ', avg_app_rating)
Average app rating = 4.173243045387998
Looks like the average app rating on Google Play is 4.1. Let's run a few more statistical calculations on Rating to see the highest, lowest and median ratings across all apps.
Pandas' describe function is really helpful as it provides summary descriptive statistics that tell us the central tendency, dispersion and shape of a dataset or column distribution.
#get summary statistics for the Rating col
apps['Rating'].describe()
count 8196.000000
mean 4.173243
std 0.536625
min 1.000000
25% 4.000000
50% 4.300000
75% 4.500000
max 5.000000
Name: Rating, dtype: float64
INSIGHTS
- Google Play has 8,196 app ratings.
- The average app rating is 4.1.
- The lowest rating is 1 and the highest rating is 5.
- The median (50th percentile) is 4.3.
Now, let's visualise the ratings distribution using a histogram.
#plot a histogram of the Rating col
apps['Rating'].plot.hist(figsize=(15,8), title='Google Play app ratings distribution', color='m')
<matplotlib.axes._subplots.AxesSubplot at 0x12363cf70>
INSIGHTS
As we can see in the above histogram, the ratings distribution is skewed to the right, which tells us that most apps on Google Play are highly rated and the minority of apps have low ratings. We can also see that a rating between 4.3-4.6 is the most frequent.
Let's now examine app size and price.
Size is relevant to an app's success, an app that is too large may be difficult and/or expensive for users to download. Lengthy download times could turn users off before they even experience your app. Additionally, users devices have a finite amount of space.
Pricing is also important to consider, as some users expect mobile apps to be free or rather inexpensive and so pricing can impact adoption rates and user numbers.
A few interesting questions we could explore here are:
- Does the size of an app affect its rating?
- Do users prefer light-weighted apps?
- Does the price of an app affect its rating?
- Do users prefer free apps over paid apps?
As you'll see below, the majority of top rated apps (with a rating over 4) range from 2 MB to 20 MB.
We also find that the vast majority of top rated apps are either free or priced under $10.
# Filter rows where both Rating and Size values are not null
apps_with_size_and_rating = apps[(~apps['Rating'].isnull()) & (~apps['Size'].isnull())]
# Subset for categories with at least 150 apps
large_categories = apps_with_size_and_rating.groupby('Category').filter(lambda x: len(x) >= 150).reset_index()
# Plot size vs. rating
plt1 = sns.jointplot(x = large_categories['Size'], y = large_categories['Rating'], kind = 'hex')
# Subset apps whose 'Type' is 'Paid'
paid_apps = apps_with_size_and_rating[apps_with_size_and_rating['Type'] == 'Paid']
# Plot price vs. rating
plt2 = sns.jointplot(x = paid_apps['Price'], y = paid_apps['Rating'])
App developers can adopt different business models: free, freemium or paid. Much depends on the app's functionality, features, complexity and purpose. It would be interesting to see how various app categories are priced and if there is any relationship between the category and pricing strategy.
Questions to explore:
- Which apps are the most expensive?
- Which app categories are the most expensive?
- How do the most popular app categories price their apps? Are they free or is there a general pricing tendency within the different categories?
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(15,8))
# Select a few popular app categories
popular_app_cats = apps[apps.Category.isin(['GAME', 'COMMUNICATION', 'TOOLS', 'PRODUCTIVITY',
'SOCIAL','PHOTOGRAPHY', 'FAMILY'])]
# Examine the price trend by plotting Price vs Category
ax = sns.stripplot(x = popular_app_cats['Price'], y = popular_app_cats['Category'], jitter=True, linewidth=1)
ax.set_title('App pricing trend across categories')
# Apps whose Price is greater than 50
apps_above_50 = popular_app_cats[['Category', 'App', 'Price']][popular_app_cats['Price'] > 50]
apps_above_50
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Category | App | Price | |
---|---|---|---|
3327 | FAMILY | most expensive app (H) | 399.99 |
4398 | FAMILY | I am Rich Plus | 399.99 |
4406 | FAMILY | I Am Rich Pro | 399.99 |
4410 | FAMILY | I Am Rich | 389.99 |
5661 | PRODUCTIVITY | cronometra-br | 154.99 |
popular_app_cats['Price'].describe()
count 4827.000000
mean 0.708461
std 11.784086
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 399.990000
Name: Price, dtype: float64
INSIGHTS
-
Of the top 5 most expensive apps, 4 are in the Family category and one is in the Productivity category.
-
The most expensive app costs $399.
-
The average app price (among the most popular categories) is $0.7.
-
Most popular apps are cheap - the general pricing tendency across all popular categories is under $50 (with a few outliers in Family and Productivity).
Now that we understand the general trend, let's zoom in on the cheaper most popular apps (under $50) and see if we can identify more granular insights about their pricing strategies.
# Select apps priced below $50
apps_under_50 = popular_app_cats[popular_app_cats['Price'] < 50]
fig, ax = plt.subplots(figsize=(15,8))
# Examine price vs category with the apps_under_50)
ax = sns.stripplot(x=apps_under_50['Category'], y=apps_under_50['Price'], data=apps_under_50,
jitter=True, linewidth=1)
ax.set_title('App pricing trend across categories after filtering for expensive apps')
Text(0.5, 1.0, 'App pricing trend across categories after filtering for expensive apps')
INSIGHTS
-
We can now see that the most popular categories' apps are priced below $10.
-
There are a few outliers in each category (except for productivity apps which all stay under $10).
-
The social category seems to be the cheapest, with only one outlier app above $10.
-
The Family category seems to have the most expensive apps in this group, with a few apps priced over $30.
Let's now explore if free apps get more installs than paid apps across the Google Play platform. Presumbly, free apps are vastly more popular? Let's find out!
#Find out what app Types are in the dataset
apps.Type.unique()
array(['Free', 'Paid'], dtype=object)
paid_v_free = apps.groupby(['Type'])['Installs'].sum().to_frame().reset_index()
print(paid_v_free)
Type Installs
0 Free 75065572646
1 Paid 57364881
paid_v_free.plot.bar(x='Type', y='Installs', figsize=(15,8), title='Google Play paid vs free app downloads')
<matplotlib.axes._subplots.AxesSubplot at 0x124814a30>
It seems that free apps on Google play are massively more popular than paid apps! Free apps have had more than 75bn installs whereas paid apps have only been installed by +57m users.
We can now examine the user reviews dataset to see if user sentiments vary much between free and paid apps. To do so, we'll look at the Sentiment Polarity score provided in the user reviews dataset.
To explore Sentiment Polarity against app type (paid or free) we'll first need to join the two datasets, as shown below.
# Load user_reviews.csv
reviews_df = pd.DataFrame(pd.read_csv('datasets/user_reviews.csv'))
# Join and merge the two dataframe
merged_df = pd.merge(apps, reviews_df, on = 'App', how = "inner")
# Drop NA values from Sentiment and Translated_Review columns
merged_df = merged_df.dropna(subset=['Sentiment', 'Translated_Review'])
#View results
merged_df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Unnamed: 0 | App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | Translated_Review | Sentiment | Sentiment_Polarity | Sentiment_Subjectivity | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Coloring book moana | ART_AND_DESIGN | 3.9 | 967 | 14.0 | 500000 | Free | 0.0 | Everyone | Art & Design;Pretend Play | January 15, 2018 | 2.0.0 | 4.0.3 and up | A kid's excessive ads. The types ads allowed a... | Negative | -0.250000 | 1.000000 |
1 | 1 | Coloring book moana | ART_AND_DESIGN | 3.9 | 967 | 14.0 | 500000 | Free | 0.0 | Everyone | Art & Design;Pretend Play | January 15, 2018 | 2.0.0 | 4.0.3 and up | It bad >:( | Negative | -0.725000 | 0.833333 |
2 | 1 | Coloring book moana | ART_AND_DESIGN | 3.9 | 967 | 14.0 | 500000 | Free | 0.0 | Everyone | Art & Design;Pretend Play | January 15, 2018 | 2.0.0 | 4.0.3 and up | like | Neutral | 0.000000 | 0.000000 |
4 | 1 | Coloring book moana | ART_AND_DESIGN | 3.9 | 967 | 14.0 | 500000 | Free | 0.0 | Everyone | Art & Design;Pretend Play | January 15, 2018 | 2.0.0 | 4.0.3 and up | I love colors inspyering | Positive | 0.500000 | 0.600000 |
5 | 1 | Coloring book moana | ART_AND_DESIGN | 3.9 | 967 | 14.0 | 500000 | Free | 0.0 | Everyone | Art & Design;Pretend Play | January 15, 2018 | 2.0.0 | 4.0.3 and up | I hate | Negative | -0.800000 | 0.900000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
61540 | 9862 | A+ Gallery - Photos & Videos | PHOTOGRAPHY | 4.5 | 223941 | NaN | 10000000 | Free | 0.0 | Everyone | Photography | August 6, 2018 | Varies with device | Varies with device | Overall great app. Best gallery seen far | Positive | 0.475000 | 0.512500 |
61546 | 9862 | A+ Gallery - Photos & Videos | PHOTOGRAPHY | 4.5 | 223941 | NaN | 10000000 | Free | 0.0 | Everyone | Photography | August 6, 2018 | Varies with device | Varies with device | Recommended, 100% love it, keep good work dev ... | Positive | 0.566667 | 0.733333 |
61553 | 9862 | A+ Gallery - Photos & Videos | PHOTOGRAPHY | 4.5 | 223941 | NaN | 10000000 | Free | 0.0 | Everyone | Photography | August 6, 2018 | Varies with device | Varies with device | Too much ads | Positive | 0.200000 | 0.200000 |
61554 | 9862 | A+ Gallery - Photos & Videos | PHOTOGRAPHY | 4.5 | 223941 | NaN | 10000000 | Free | 0.0 | Everyone | Photography | August 6, 2018 | Varies with device | Varies with device | Just allow time ... | Neutral | 0.000000 | 0.000000 |
61555 | 9862 | A+ Gallery - Photos & Videos | PHOTOGRAPHY | 4.5 | 223941 | NaN | 10000000 | Free | 0.0 | Everyone | Photography | August 6, 2018 | Varies with device | Varies with device | It's good best gallery phone | Positive | 0.850000 | 0.450000 |
35929 rows × 18 columns
merged_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 35929 entries, 0 to 61555
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 35929 non-null int64
1 App 35929 non-null object
2 Category 35929 non-null object
3 Rating 35929 non-null float64
4 Reviews 35929 non-null int64
5 Size 24069 non-null float64
6 Installs 35929 non-null int64
7 Type 35929 non-null object
8 Price 35929 non-null float64
9 Content Rating 35929 non-null object
10 Genres 35929 non-null object
11 Last Updated 35929 non-null object
12 Current Ver 35929 non-null object
13 Android Ver 35929 non-null object
14 Translated_Review 35929 non-null object
15 Sentiment 35929 non-null object
16 Sentiment_Polarity 35929 non-null float64
17 Sentiment_Subjectivity 35929 non-null float64
dtypes: float64(5), int64(3), object(10)
memory usage: 5.2+ MB
fig, ax = plt.subplots(figsize=(11,8))
# User review sentiment polarity for paid vs. free apps
ax=sns.boxplot(x= merged_df['Type'],y = merged_df['Sentiment_Polarity'])
A box plot shows the distribution of quantitative data in a way that facilitates comparisons between variables or across levels of a categorical variable. The box shows the quartiles of the dataset while the whiskers extend to show the rest of the distribution, except for points that are determined to be “outliers” using a method that is a function of the inter-quartile range.
INSIGHTS
By plotting sentiment polarity scores of user reviews for paid and free apps, we can see that:
- Free apps receive a lot of harsh comments, as indicated by the longer whiskers and outliers on the negative y-axis. - Reviews for paid apps appear to hardly ever be extremely negative. This may indicate that paid apps tend to be of better quality on average.
- The median polarity score for paid apps is a little higher than free apps, which confirms our earlier assumption (although the difference between paid and unpaid apps isn't huge).