1. Project overview & the data

This project will showcase a comprehensive analysis of the Android app market by comparing over 10,000 apps in Google Play across different categories. We'll look for insights in the data to devise strategies to drive growth and retention.

This project is based on an exercise in the "Data Scientist with Python Track" on DataCamp which I am currently undertaking (Jan 2021) with various additions and changes of my own.

The Dataset comprises two files:

apps.csv: details of the applications on Google Play. There are 13 features that describe any given app in the dataset.
user_reviews.csv: contains 100 reviews for each app, most helpful first. The text in each review has been pre-processed and attributed with three new features: Sentiment (Positive, Negative or Neutral), Sentiment Polarity and Sentiment Subjectivity.

We'll start by importing the apps csv file and running a few preliminary queries to:

drop duplicates
check how many unique apps are in the dataset
print an info summary of the data to check data types, null values, columns etc.
print a random sample of 5 rows

2. Importing the data & initial review

# Read in dataset
import pandas as pd
apps_with_duplicates = pd.DataFrame(pd.read_csv('datasets/apps.csv'))

# Drop duplicates
apps = apps_with_duplicates.drop_duplicates()

# Print the total number of apps
print('Total number of apps in the dataset = ', len(apps.index))

# Print a concise summary of apps dataframe
print(apps.info())

# Have a look at a random sample of n rows
n = 5
apps.sample(n)

Total number of apps in the dataset =  9659
<class 'pandas.core.frame.DataFrame'>
Int64Index: 9659 entries, 0 to 9658
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Unnamed: 0      9659 non-null   int64  
 1   App             9659 non-null   object 
 2   Category        9659 non-null   object 
 3   Rating          8196 non-null   float64
 4   Reviews         9659 non-null   int64  
 5   Size            8432 non-null   float64
 6   Installs        9659 non-null   object 
 7   Type            9659 non-null   object 
 8   Price           9659 non-null   object 
 9   Content Rating  9659 non-null   object 
 10  Genres          9659 non-null   object 
 11  Last Updated    9659 non-null   object 
 12  Current Ver     9651 non-null   object 
 13  Android Ver     9657 non-null   object 
dtypes: float64(2), int64(2), object(10)
memory usage: 1.1+ MB
None

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Unnamed: 0	App	Category	Rating	Reviews	Size	Installs	Type	Price	Content Rating	Genres	Last Updated	Current Ver	Android Ver
3755	4671	V LIVE - Star Live App	FAMILY	4.4	397147	NaN	10,000,000+	Free	0	Teen	Entertainment	August 2, 2018	Varies with device	Varies with device
6046	7089	CA UIM Mobile	BUSINESS	3.2	54	7.7	1,000+	Free	0	Everyone	Business	February 29, 2016	8.4.1	4.0 and up
7616	8726	DRAGON QUEST VI	FAMILY	4.4	3358	17.0	100,000+	Paid	$14.99	Everyone	Role Playing	August 1, 2018	1.0.4	2.3 and up
4018	4961	Ad Blocker Turbo - Adblocker Browser	COMMUNICATION	3.9	107	44.0	10,000+	Free	0	Everyone	Communication	January 29, 2018	1.0.5	4.1 and up
8447	9578	EM Launcher for EMUI	PERSONALIZATION	4.1	3175	2.3	500,000+	Free	0	Everyone	Personalization	April 6, 2018	1.0.6	4.1 and up

2. Data cleaning

The features/columns that we will focus on are Category, Rating, Installs, Size, and Price.

The info() function (from the previous task) told us that the Installs and Price columns are of type 'object' and not int64 or float64 as we would expect. This is because the column contains some characters more than just digits. Ideally, we would want these columns to be numeric so we can perform calculations on them for further analysis.

Let's move on to data cleaning, which would involve:

removing special characters (, $ +) in the Installs and Price columns as those make their conversion to a numerical data type difficult.
converting those columns to numeric data type.

# List of characters to remove
chars_to_remove = ['+', ',', '$']
# List of column names to clean
cols_to_clean = ['Installs', 'Price']

# Loop for each column
for col in cols_to_clean:
    for char in chars_to_remove:
        apps[col] = apps[col].astype(str).str.replace(char, "")
    # Convert col to numeric
    apps[col] = pd.to_numeric(apps[col])

#check type has been changed
apps.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9659 entries, 0 to 9658
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Unnamed: 0      9659 non-null   int64  
 1   App             9659 non-null   object 
 2   Category        9659 non-null   object 
 3   Rating          8196 non-null   float64
 4   Reviews         9659 non-null   int64  
 5   Size            8432 non-null   float64
 6   Installs        9659 non-null   int64  
 7   Type            9659 non-null   object 
 8   Price           9659 non-null   float64
 9   Content Rating  9659 non-null   object 
 10  Genres          9659 non-null   object 
 11  Last Updated    9659 non-null   object 
 12  Current Ver     9651 non-null   object 
 13  Android Ver     9657 non-null   object 
dtypes: float64(3), int64(3), object(8)
memory usage: 1.1+ MB

#check characters have been removed
apps.Installs.head(5)

0       10000
1      500000
2     5000000
3    50000000
4      100000
Name: Installs, dtype: int64

apps.Price.head(5)

0    0.0
1    0.0
2    0.0
3    0.0
4    0.0
Name: Price, dtype: float64

3. Exploring the data

App categories

We'll start exploring our dataset by looking at the app categories. It would be interesting to explore the following questions:

How many app categories are there on Google Play?
Which category has the highest share of (active) apps in the Google Play market?
Is any specific category dominating the Google Play market (i.e. which category has the highest number of installs)?
Which categories have the fewest number of apps?

We will see that:

there are 33 unique app categories present in our dataset.
Family and Game apps have the highest market prevalence (with almost 19% and 10% of the market respectively). Tools, Business and Medical apps are also at the top.
The categories with the least number of apps on Google Play are Beauty, Comics and Parenting (with around 6%of the mearket each).

Let's explore!

# Print the total number of unique categories
num_categories = len(apps['Category'].unique())
print('Number of categories = ', num_categories)

# Count the number of apps in each 'Category' and sort them in descending order
num_apps_in_category = apps['Category'].value_counts().sort_values(ascending = False)
print(num_apps_in_category)

Number of categories =  33
FAMILY                 1832
GAME                    959
TOOLS                   827
BUSINESS                420
MEDICAL                 395
PERSONALIZATION         376
PRODUCTIVITY            374
LIFESTYLE               369
FINANCE                 345
SPORTS                  325
COMMUNICATION           315
HEALTH_AND_FITNESS      288
PHOTOGRAPHY             281
NEWS_AND_MAGAZINES      254
SOCIAL                  239
BOOKS_AND_REFERENCE     222
TRAVEL_AND_LOCAL        219
SHOPPING                202
DATING                  171
VIDEO_PLAYERS           163
MAPS_AND_NAVIGATION     131
EDUCATION               119
FOOD_AND_DRINK          112
ENTERTAINMENT           102
AUTO_AND_VEHICLES        85
LIBRARIES_AND_DEMO       84
WEATHER                  79
HOUSE_AND_HOME           74
EVENTS                   64
ART_AND_DESIGN           64
PARENTING                60
COMICS                   56
BEAUTY                   53
Name: Category, dtype: int64

#convert series into dataframe
num_apps_in_category = num_apps_in_category.to_frame().reset_index()

#validate conversion to DF worked 
num_apps_in_category.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33 entries, 0 to 32
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   index     33 non-null     object
 1   Category  33 non-null     int64 
dtypes: int64(1), object(1)
memory usage: 656.0+ bytes

#check the index column
num_apps_in_category.index

RangeIndex(start=0, stop=33, step=1)

#rename dataframe cols
num_apps_in_category.columns = ['category', '# of apps', '% of market']

# Visualise the number of apps by category

num_apps_in_category.plot.barh(x='category', y='# of apps', figsize=(15,8), title='Google Play apps per category')

<matplotlib.axes._subplots.AxesSubplot at 0x11bb3d790>

#calculate percetange of overall app market per category
num_apps_in_category['% of market'] = 100 * num_apps_in_category['# of apps'] / num_apps_in_category['# of apps'].sum()

#view result
num_apps_in_category

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	category	# of apps	% of market
0	FAMILY	1832	18.966767
1	GAME	959	9.928564
2	TOOLS	827	8.561963
3	BUSINESS	420	4.348276
4	MEDICAL	395	4.089450
5	PERSONALIZATION	376	3.892743
6	PRODUCTIVITY	374	3.872036
7	LIFESTYLE	369	3.820271
8	FINANCE	345	3.571798
9	SPORTS	325	3.364738
10	COMMUNICATION	315	3.261207
11	HEALTH_AND_FITNESS	288	2.981675
12	PHOTOGRAPHY	281	2.909204
13	NEWS_AND_MAGAZINES	254	2.629672
14	SOCIAL	239	2.474376
15	BOOKS_AND_REFERENCE	222	2.298375
16	TRAVEL_AND_LOCAL	219	2.267315
17	SHOPPING	202	2.091314
18	DATING	171	1.770370
19	VIDEO_PLAYERS	163	1.687545
20	MAPS_AND_NAVIGATION	131	1.356248
21	EDUCATION	119	1.232012
22	FOOD_AND_DRINK	112	1.159540
23	ENTERTAINMENT	102	1.056010
24	AUTO_AND_VEHICLES	85	0.880008
25	LIBRARIES_AND_DEMO	84	0.869655
26	WEATHER	79	0.817890
27	HOUSE_AND_HOME	74	0.766125
28	EVENTS	64	0.662594
29	ART_AND_DESIGN	64	0.662594
30	PARENTING	60	0.621182
31	COMICS	56	0.579770
32	BEAUTY	53	0.548711

#reset category col as index
num_apps_in_category.set_index(['category'], inplace=True)

#plot categories by % of total apps on Google Play
num_apps_in_category.plot.pie(y='% of market',figsize=(20,10), legend=None)

<matplotlib.axes._subplots.AxesSubplot at 0x1227d1e80>

Now that we know which categories offer the most and least apps we can also check which category is dominating the market, based on the total number of installs of apps within a category (as opposed to the number of apps available within a category). It would be interesting to see if this analysis brings up the same categories as above.

downloads_per_cat = apps.groupby(['Category'])['Installs'].sum().sort_values(ascending=False)

print(downloads_per_cat)

Category
GAME                   13878924415
COMMUNICATION          11038276251
TOOLS                   8001771915
PRODUCTIVITY            5793091369
SOCIAL                  5487867902
PHOTOGRAPHY             4649147655
FAMILY                  4427941505
VIDEO_PLAYERS           3926902720
TRAVEL_AND_LOCAL        2894887146
NEWS_AND_MAGAZINES      2369217760
ENTERTAINMENT           2113660000
BOOKS_AND_REFERENCE     1665969576
PERSONALIZATION         1532494782
SHOPPING                1400348785
HEALTH_AND_FITNESS      1144022512
SPORTS                  1096474498
BUSINESS                 697164865
LIFESTYLE                503823539
MAPS_AND_NAVIGATION      503281890
FINANCE                  455348734
WEATHER                  361100520
EDUCATION                352952000
FOOD_AND_DRINK           211798751
DATING                   140926107
ART_AND_DESIGN           114338100
HOUSE_AND_HOME            97212461
AUTO_AND_VEHICLES         53130211
LIBRARIES_AND_DEMO        52995910
COMICS                    44981150
MEDICAL                   38193177
PARENTING                 31521110
BEAUTY                    27197050
EVENTS                    15973161
Name: Installs, dtype: int64

INSIGHTS

Interestingly, when it comes to the number of downloads, the Games category is the one with the highest number (over 14bn downloads), followed by Communication (with over 11bn downloads), Tools and Productivity (with 8bn and 5.7bn downloads respectively).
By running a similar calculation as above, we can see the market share of these categories - it appears that Game captures almost 19% of the total installs on Google Play while Communication apps account for almost 15% of the total installs on the platform.
Surprisingly, fitness/health apps only amount to 1.5% of the installs and dating apps only around 1.8%.

#convert series to DF
downloads_per_cat = downloads_per_cat.to_frame()

downloads_per_cat

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Installs
Category
GAME	13878924415
COMMUNICATION	11038276251
TOOLS	8001771915
PRODUCTIVITY	5793091369
SOCIAL	5487867902
PHOTOGRAPHY	4649147655
FAMILY	4427941505
VIDEO_PLAYERS	3926902720
TRAVEL_AND_LOCAL	2894887146
NEWS_AND_MAGAZINES	2369217760
ENTERTAINMENT	2113660000
BOOKS_AND_REFERENCE	1665969576
PERSONALIZATION	1532494782
SHOPPING	1400348785
HEALTH_AND_FITNESS	1144022512
SPORTS	1096474498
BUSINESS	697164865
LIFESTYLE	503823539
MAPS_AND_NAVIGATION	503281890
FINANCE	455348734
WEATHER	361100520
EDUCATION	352952000
FOOD_AND_DRINK	211798751
DATING	140926107
ART_AND_DESIGN	114338100
HOUSE_AND_HOME	97212461
AUTO_AND_VEHICLES	53130211
LIBRARIES_AND_DEMO	52995910
COMICS	44981150
MEDICAL	38193177
PARENTING	31521110
BEAUTY	27197050
EVENTS	15973161

#calculate % of installs out of total and place in new col
downloads_per_cat['% of total installs'] = 100 * downloads_per_cat['Installs'] / downloads_per_cat['Installs'].sum()
downloads_per_cat

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Installs	% of total installs
Category
GAME	13878924415	18.474949
COMMUNICATION	11038276251	14.693616
TOOLS	8001771915	10.651570
PRODUCTIVITY	5793091369	7.711481
SOCIAL	5487867902	7.305183
PHOTOGRAPHY	4649147655	6.188719
FAMILY	4427941505	5.894260
VIDEO_PLAYERS	3926902720	5.227302
TRAVEL_AND_LOCAL	2894887146	3.853533
NEWS_AND_MAGAZINES	2369217760	3.153787
ENTERTAINMENT	2113660000	2.813601
BOOKS_AND_REFERENCE	1665969576	2.217658
PERSONALIZATION	1532494782	2.039983
SHOPPING	1400348785	1.864076
HEALTH_AND_FITNESS	1144022512	1.522867
SPORTS	1096474498	1.459574
BUSINESS	697164865	0.928032
LIFESTYLE	503823539	0.670665
MAPS_AND_NAVIGATION	503281890	0.669944
FINANCE	455348734	0.606138
WEATHER	361100520	0.480679
EDUCATION	352952000	0.469833
FOOD_AND_DRINK	211798751	0.281936
DATING	140926107	0.187594
ART_AND_DESIGN	114338100	0.152201
HOUSE_AND_HOME	97212461	0.129404
AUTO_AND_VEHICLES	53130211	0.070724
LIBRARIES_AND_DEMO	52995910	0.070546
COMICS	44981150	0.059877
MEDICAL	38193177	0.050841
PARENTING	31521110	0.041959
BEAUTY	27197050	0.036203
EVENTS	15973161	0.021263

#reset index for the DF
downloads_per_cat.reset_index(inplace=True)

#plot the % of total installs per category
downloads_per_cat.plot.bar(x='Category', y='% of total installs', figsize=(15,8), title='Google Play app installs per category (% of total)', color='g')

<matplotlib.axes._subplots.AxesSubplot at 0x1233b3af0>

App ratings

Now that we've explored and analysed the app categories, we can turn to the app ratings. App ratings are an important KPI as they impact discoverability, conversion and the brand's overall image. Questions to look at would be:

what is the average app rating on Google Play?
what is the ratings distribution (i.e. where do most apps rank? where do the minority of apps rank)?
are the majority of apps highly rated?

# Average rating of apps
avg_app_rating = apps['Rating'].mean()
print('Average app rating = ', avg_app_rating)

Average app rating =  4.173243045387998

Looks like the average app rating on Google Play is 4.1. Let's run a few more statistical calculations on Rating to see the highest, lowest and median ratings across all apps.

Pandas' describe function is really helpful as it provides summary descriptive statistics that tell us the central tendency, dispersion and shape of a dataset or column distribution.

#get summary statistics for the Rating col
apps['Rating'].describe()

count    8196.000000
mean        4.173243
std         0.536625
min         1.000000
25%         4.000000
50%         4.300000
75%         4.500000
max         5.000000
Name: Rating, dtype: float64

INSIGHTS

Google Play has 8,196 app ratings.
The average app rating is 4.1.
The lowest rating is 1 and the highest rating is 5.
The median (50th percentile) is 4.3.

Now, let's visualise the ratings distribution using a histogram.

#plot a histogram of the Rating col
apps['Rating'].plot.hist(figsize=(15,8), title='Google Play app ratings distribution', color='m')

<matplotlib.axes._subplots.AxesSubplot at 0x12363cf70>

INSIGHTS

As we can see in the above histogram, the ratings distribution is skewed to the right, which tells us that most apps on Google Play are highly rated and the minority of apps have low ratings. We can also see that a rating between 4.3-4.6 is the most frequent.

App size & price vs rating

Let's now examine app size and price.

Size is relevant to an app's success, an app that is too large may be difficult and/or expensive for users to download. Lengthy download times could turn users off before they even experience your app. Additionally, users devices have a finite amount of space.

Pricing is also important to consider, as some users expect mobile apps to be free or rather inexpensive and so pricing can impact adoption rates and user numbers.

A few interesting questions we could explore here are:

Does the size of an app affect its rating?
Do users prefer light-weighted apps?
Does the price of an app affect its rating?
Do users prefer free apps over paid apps?

As you'll see below, the majority of top rated apps (with a rating over 4) range from 2 MB to 20 MB.

We also find that the vast majority of top rated apps are either free or priced under $10.

# Filter rows where both Rating and Size values are not null
apps_with_size_and_rating = apps[(~apps['Rating'].isnull()) & (~apps['Size'].isnull())]

# Subset for categories with at least 150 apps
large_categories = apps_with_size_and_rating.groupby('Category').filter(lambda x: len(x) >= 150).reset_index()

# Plot size vs. rating
plt1 = sns.jointplot(x = large_categories['Size'], y = large_categories['Rating'], kind = 'hex')

# Subset apps whose 'Type' is 'Paid'
paid_apps = apps_with_size_and_rating[apps_with_size_and_rating['Type'] == 'Paid']

# Plot price vs. rating
plt2 = sns.jointplot(x = paid_apps['Price'], y = paid_apps['Rating'])

App price and categories

App developers can adopt different business models: free, freemium or paid. Much depends on the app's functionality, features, complexity and purpose. It would be interesting to see how various app categories are priced and if there is any relationship between the category and pricing strategy.

Questions to explore:

Which apps are the most expensive?
Which app categories are the most expensive?
How do the most popular app categories price their apps? Are they free or is there a general pricing tendency within the different categories?

import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(15,8))

# Select a few popular app categories
popular_app_cats = apps[apps.Category.isin(['GAME', 'COMMUNICATION', 'TOOLS', 'PRODUCTIVITY',
                                            'SOCIAL','PHOTOGRAPHY', 'FAMILY'])]

# Examine the price trend by plotting Price vs Category
ax = sns.stripplot(x = popular_app_cats['Price'], y = popular_app_cats['Category'], jitter=True, linewidth=1)
ax.set_title('App pricing trend across categories')

# Apps whose Price is greater than 50
apps_above_50 = popular_app_cats[['Category', 'App', 'Price']][popular_app_cats['Price'] > 50]
apps_above_50

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Category	App	Price
3327	FAMILY	most expensive app (H)	399.99
4398	FAMILY	I am Rich Plus	399.99
4406	FAMILY	I Am Rich Pro	399.99
4410	FAMILY	I Am Rich	389.99
5661	PRODUCTIVITY	cronometra-br	154.99

popular_app_cats['Price'].describe()

count    4827.000000
mean        0.708461
std        11.784086
min         0.000000
25%         0.000000
50%         0.000000
75%         0.000000
max       399.990000
Name: Price, dtype: float64

INSIGHTS

Of the top 5 most expensive apps, 4 are in the Family category and one is in the Productivity category.
The most expensive app costs $399.
The average app price (among the most popular categories) is $0.7.
Most popular apps are cheap - the general pricing tendency across all popular categories is under $50 (with a few outliers in Family and Productivity).

Now that we understand the general trend, let's zoom in on the cheaper most popular apps (under $50) and see if we can identify more granular insights about their pricing strategies.

# Select apps priced below $50

apps_under_50 = popular_app_cats[popular_app_cats['Price'] < 50]

fig, ax = plt.subplots(figsize=(15,8))

# Examine price vs category with the apps_under_50)
ax = sns.stripplot(x=apps_under_50['Category'], y=apps_under_50['Price'], data=apps_under_50,
                   jitter=True, linewidth=1)
ax.set_title('App pricing trend across categories after filtering for expensive apps')

Text(0.5, 1.0, 'App pricing trend across categories after filtering for expensive apps')

INSIGHTS

We can now see that the most popular categories' apps are priced below $10.
There are a few outliers in each category (except for productivity apps which all stay under $10).
The social category seems to be the cheapest, with only one outlier app above $10.
The Family category seems to have the most expensive apps in this group, with a few apps priced over $30.

Are free apps more popular than paid apps?

Let's now explore if free apps get more installs than paid apps across the Google Play platform. Presumbly, free apps are vastly more popular? Let's find out!

#Find out what app Types are in the dataset
apps.Type.unique()

array(['Free', 'Paid'], dtype=object)

paid_v_free = apps.groupby(['Type'])['Installs'].sum().to_frame().reset_index()

print(paid_v_free)

   Type     Installs
0  Free  75065572646
1  Paid     57364881

paid_v_free.plot.bar(x='Type', y='Installs', figsize=(15,8), title='Google Play paid vs free app downloads')

<matplotlib.axes._subplots.AxesSubplot at 0x124814a30>

It seems that free apps on Google play are massively more popular than paid apps! Free apps have had more than 75bn installs whereas paid apps have only been installed by +57m users.

Sentiment analysis of user reviews

We can now examine the user reviews dataset to see if user sentiments vary much between free and paid apps. To do so, we'll look at the Sentiment Polarity score provided in the user reviews dataset.

To explore Sentiment Polarity against app type (paid or free) we'll first need to join the two datasets, as shown below.

# Load user_reviews.csv
reviews_df = pd.DataFrame(pd.read_csv('datasets/user_reviews.csv'))

# Join and merge the two dataframe
merged_df = pd.merge(apps, reviews_df, on = 'App', how = "inner")

# Drop NA values from Sentiment and Translated_Review columns
merged_df = merged_df.dropna(subset=['Sentiment', 'Translated_Review'])

#View results
merged_df

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Unnamed: 0	App	Category	Rating	Reviews	Size	Installs	Type	Price	Content Rating	Genres	Last Updated	Current Ver	Android Ver	Translated_Review	Sentiment	Sentiment_Polarity	Sentiment_Subjectivity
0	1	Coloring book moana	ART_AND_DESIGN	3.9	967	14.0	500000	Free	0.0	Everyone	Art & Design;Pretend Play	January 15, 2018	2.0.0	4.0.3 and up	A kid's excessive ads. The types ads allowed a...	Negative	-0.250000	1.000000
1	1	Coloring book moana	ART_AND_DESIGN	3.9	967	14.0	500000	Free	0.0	Everyone	Art & Design;Pretend Play	January 15, 2018	2.0.0	4.0.3 and up	It bad >:(	Negative	-0.725000	0.833333
2	1	Coloring book moana	ART_AND_DESIGN	3.9	967	14.0	500000	Free	0.0	Everyone	Art & Design;Pretend Play	January 15, 2018	2.0.0	4.0.3 and up	like	Neutral	0.000000	0.000000
4	1	Coloring book moana	ART_AND_DESIGN	3.9	967	14.0	500000	Free	0.0	Everyone	Art & Design;Pretend Play	January 15, 2018	2.0.0	4.0.3 and up	I love colors inspyering	Positive	0.500000	0.600000
5	1	Coloring book moana	ART_AND_DESIGN	3.9	967	14.0	500000	Free	0.0	Everyone	Art & Design;Pretend Play	January 15, 2018	2.0.0	4.0.3 and up	I hate	Negative	-0.800000	0.900000
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
61540	9862	A+ Gallery - Photos & Videos	PHOTOGRAPHY	4.5	223941	NaN	10000000	Free	0.0	Everyone	Photography	August 6, 2018	Varies with device	Varies with device	Overall great app. Best gallery seen far	Positive	0.475000	0.512500
61546	9862	A+ Gallery - Photos & Videos	PHOTOGRAPHY	4.5	223941	NaN	10000000	Free	0.0	Everyone	Photography	August 6, 2018	Varies with device	Varies with device	Recommended, 100% love it, keep good work dev ...	Positive	0.566667	0.733333
61553	9862	A+ Gallery - Photos & Videos	PHOTOGRAPHY	4.5	223941	NaN	10000000	Free	0.0	Everyone	Photography	August 6, 2018	Varies with device	Varies with device	Too much ads	Positive	0.200000	0.200000
61554	9862	A+ Gallery - Photos & Videos	PHOTOGRAPHY	4.5	223941	NaN	10000000	Free	0.0	Everyone	Photography	August 6, 2018	Varies with device	Varies with device	Just allow time ...	Neutral	0.000000	0.000000
61555	9862	A+ Gallery - Photos & Videos	PHOTOGRAPHY	4.5	223941	NaN	10000000	Free	0.0	Everyone	Photography	August 6, 2018	Varies with device	Varies with device	It's good best gallery phone	Positive	0.850000	0.450000

35929 rows × 18 columns

merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 35929 entries, 0 to 61555
Data columns (total 18 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Unnamed: 0              35929 non-null  int64  
 1   App                     35929 non-null  object 
 2   Category                35929 non-null  object 
 3   Rating                  35929 non-null  float64
 4   Reviews                 35929 non-null  int64  
 5   Size                    24069 non-null  float64
 6   Installs                35929 non-null  int64  
 7   Type                    35929 non-null  object 
 8   Price                   35929 non-null  float64
 9   Content Rating          35929 non-null  object 
 10  Genres                  35929 non-null  object 
 11  Last Updated            35929 non-null  object 
 12  Current Ver             35929 non-null  object 
 13  Android Ver             35929 non-null  object 
 14  Translated_Review       35929 non-null  object 
 15  Sentiment               35929 non-null  object 
 16  Sentiment_Polarity      35929 non-null  float64
 17  Sentiment_Subjectivity  35929 non-null  float64
dtypes: float64(5), int64(3), object(10)
memory usage: 5.2+ MB

fig, ax = plt.subplots(figsize=(11,8))

# User review sentiment polarity for paid vs. free apps
ax=sns.boxplot(x= merged_df['Type'],y = merged_df['Sentiment_Polarity'])

A box plot shows the distribution of quantitative data in a way that facilitates comparisons between variables or across levels of a categorical variable. The box shows the quartiles of the dataset while the whiskers extend to show the rest of the distribution, except for points that are determined to be “outliers” using a method that is a function of the inter-quartile range.

INSIGHTS

By plotting sentiment polarity scores of user reviews for paid and free apps, we can see that:

Free apps receive a lot of harsh comments, as indicated by the longer whiskers and outliers on the negative y-axis. - Reviews for paid apps appear to hardly ever be extremely negative. This may indicate that paid apps tend to be of better quality on average.
The median polarity score for paid apps is a little higher than free apps, which confirms our earlier assumption (although the difference between paid and unpaid apps isn't huge).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Project_7_Google_Play_Apps.md

Project_7_Google_Play_Apps.md

1. Project overview & the data

2. Importing the data & initial review

2. Data cleaning

3. Exploring the data

App categories

App ratings

App size & price vs rating

App price and categories

Are free apps more popular than paid apps?

Sentiment analysis of user reviews

Files

Project_7_Google_Play_Apps.md

Latest commit

History

Project_7_Google_Play_Apps.md

File metadata and controls

1. Project overview & the data

2. Importing the data & initial review

2. Data cleaning

3. Exploring the data

App categories

App ratings

App size & price vs rating

App price and categories

Are free apps more popular than paid apps?

Sentiment analysis of user reviews