Skip to content

Code from the book Fighting Churn With Data

Notifications You must be signed in to change notification settings

DataScienceWorks/fight-churn

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

fight-churn

This is code for the (forthcoming) book "Fighting Churn With Data: The Subscribed Institute Guide to Data Driven Customer Engagement". See http://www.fightchurnwithdata.com for more information.

Table of Contents
1 Getting Started
1.1 Prerequisites
1.2 Development Environment Setup
1.3 Data Creation / Loading
2 Running Book Code Listings
2.1 Running a Listing
2.2 Configuring How Listings are Run
3 Metric Calculation Framework
3.1 Event QA
3.2 Batch Metric Calculation
3.3 Metric QA
4 Dataset Export
5 Analysis Framework


These are basic startup/setup instructions that I think should work for most people using a recent Mac. I apologize that so far I have not had a chance to provide more detailed setup instructions for Windows, but I will do so ASAP (and I would welcome contributions in this area.) I also want to apologize in advance because I am neither an expert in PostgreSQL nor an expert in Python, but I am about to give a lot of advice on how to setup and use these technologies - if you find I am not doing things the best way, or just not how you would have done it, please be patient. Same goes for the rudimentary state of some of the code - I'm doing the best I can with the time I've got. If you want to make things better please help help out! :)



Required:

  • Python 3 and the required packages (requirements.txt)
  • PostgreSQL

Recommended:

  • PyCharm (Community Edition)
  • PgAdmin

1.1.1 Python 3

If you need help installing Python 3, you can refer to this page for Mac:

For Windows there are resources here:

(If you are on linux I'm going to assume you know how to install your own python...)

I also recommend installing an Integrated Development Environment like PyCharm. PyCharm Community Edition was used by the author to write and run all the code for the book, so this README will include how to setup PyCharm (this is especially useful if you are new to Python, since an IDE can take care of some of the setup for you...)


1.1.2 PostgreSQL

To install PostgreSQL for Mac following these instructions:

I also recommend installing installing pgAdmin to make it easier to import and export data, and run adhoc queries. Follow the instructions here:

Make sure Posgres is running - here's what it looks like if you installed with PostgresApp on a Mac:

Postgres Running on Mac



Before you can load data or run the code you have to do some setup on your system. If you never have done this before it may seem like a lot of work, and it kind of is, but this amount of setup is routine when you begin to work with a new technology. If already know how to do this sort of thing feel free to igonre my instructions, which are primarily written for beginners to use GUI tools to get up and running.


1.2.1 Creating a Database

The first thing you need to do is connect to your server. You can do that in PgAdmin by control (right) clicking on the root of the Servers tree and selecting Connect

Connect to Server in PgAdmin

A dialog will open. Assuming you are working on a PostgreSQL database installed on your own computer then
in the first tab (General) name your connection localhost, and on the second tab (Connection) enter the address 127.0.0.1 (which is the IP address to connect to a database locally.) You should also enter your user name and password. So your dialog should look like the one below - then hit Save.

Connect to Server in PgAdmin

Next you should create a new database to hold all of the churn data schemas you create. You will probably create multiple schemas as you work on the exmaples in the book and/or your own data so this will help keep these organized. An easy way to create a database is in PgAdmin - right click on the Databases node under localhost in the tree:

Create Database in PgAdmin

And enter the name of the new database (I used churn, but you can use whatever you want - just make the appropriate settings in your environment variable, section 1.2.6 below):

Name the database


1.2.3 Create a PyCharm Project

After you have cloned this repository

  1. launch PyCharm and go to the menu File / New Project
  2. In the file section window, select the folder for the repo. Leave it set to "Create New Virtual environment for htis project"
  3. Click Create
  4. It should say "The directory ... is not empty. Would you like to create a project from existing sources?" Click Yes

Name the database


1.2.4 Setup Python Project

You should create a Python "virtual environment" for the project (I won't try to go into details here, but this allows you to install the packages used for the repo without interfering with anything else on your system).

In PyCharm, select from the menus: Pycharm / Preferences... and go to the area for Project

PyCharm Project Preferences

If you click on the gear wheel to the right of where it says you have the option to add a new one...

PyCharm Add Intepreter

It will look something like this:

PyCharm Interpreter Setting

And after you click Okay it will take several seconds to create the virtual environment, and then it should look like this:

PyCharm After Create

You will add more packages in a minute, but first finish with the project setup by going to the Project Structure section of the preferences:

PyCharm Project Structure

Select each folder that contains python source code, and click on the button Sources (with the blue folder next to it):

PyCharm Select One Source Folder

You should select the following folders:

  • churnalyze/py
  • data-generation/py
  • metric-framework/py
  • examples/py
    • examples/chap5
    • examples/chap6
    • examples/chap7
    • examples/chap8
    • examples/chap9
    • examples/chap10
    • examples/chap11

When you are done your project preferences should look like this:

PyCharm Source Folders Selected

(At the time of creation of these instructions, later chapters have not been written.)


1.2.5 Installing Python Package Requirements

Now that you have created a Python project you can easily install the required packages. Start by opening the file requirements.txt in the root project folder

PyCharm Requirements

Notice that at the top it says: Install requirements - click on that and the click Install in the dialog that comes up. Note that you have to be connected to the internet for this to work, and it will take several minutes for all the packages to be downloaded and installed.


1.2.6 Creating Run Configurations and Setting Database Environment Variables

A lot of the python code depends on access to the database you created for the churn data. You will set environment variables that will hold your database and login information, so that it does not have to be hard coded. The easiset way to do this in PyCharm is to set environment variables when you create a Run Configuration for the executables.

Here is one example of creating a run configuration for one of the programs: If you have not created any Run Configurations yet, you start with the Add Configuration button...

PyCharm Add Config

In the Run/Debug Configuration dialog click the + then Python to make a new python script configuration.

PyCharm Python Config

You will get an empty configuration, and the first thing to do is click on the Folder icon in the script path text box and pick the script path. The first script everyone will need to run is data-generation/py/churndb.py, which creates a schema for an analysis. After selecting this script, your configuration should look like this, with both the script path and the working directory set to wherever the script is on your system:

PyCharm Script Path

The next step is to add the environment variables, by clicking on the Browse button at the end of the environment variables text box:

PyCharm Add Config

That button launches a dialog to add the environment variables. You need to add three:

  1. CHURN_DB : the name of the database for your churn analysis schemas (step 1.2.1 above)
  2. CHURN_DB_USER : the user name to login to the database
  3. CHURN_DB_PASS : the password to login to the database

After setting these, your environment variable dialog should look something like this:

PyCharm Add Config

Select OK and save all the configuration changes. I'll say more about running this script in the next section on Data Loading / Creation.

For creating additional Run configurations, note that PyCharm allows you to duplicate and modify an existing configuration. So when you want to make another script with the same environment variables open the Configuration Dialog, now by clicking on the run configuration drop down then selecting Edit Configuration

PyCharm Add Config

Now choose to duplicate your existing Run configuration:

PyCharm Add Config

This will make another configuration the same as the existing one - you can rename it and reset the path to a different script, keeping the environment variables you already setup.

One last thing: After creating the Run Configuration you actually run a script in PyCharm using the little green arrow next to the Configuration drop down menu...

PyCharm Run Script

(The green bug next to the arrow runs it in debug mode...)


1.2.7 Running From the Command Line

If you are not using PyCharm of course you can run all this code from a terminal shell. You will need to manually setup a Python virtual environment with the required packages, and set a few environment variables in your .bash_profile or .bashrc file (or whatever you use, but I will show this for bash because it seems to be the most common.) If you are running Python on the command line then I'm going to assume you know what you are doing and I will not go over installation of the requirements this way as it is standard (and I provided PyCharm instructions, above, in case you are really new to Python.)

The environment variables are specific to these programs and contain information need to access the databse. You should add a few lines like this:

export CHURN_DB=your_db
export CHURN_DB_USER=your_user
export CHURN_DB_PASS=your_password

Make sure to open a new terminal or source the .bashrc script after making the changes. You will also need to add the code folders you want to run from the project to your PYTHONPATH environement variable; note that because the code (particularly the examples) add code from packages defined in other directories you really need to add these paths (it is not sufficient to simply run the script from the directory it is in.)

The rest of the README is written for people using PyCharm, but you can always run the same commands illustrated from the terminal...



To run any of the code in the book you need to get some data loaded into your database. There are two types of data you need:

  1. Subscriptions (or the equivalent) for purchase of the product
  2. Customer Behavior Events using the product

For more detailed information about what the data has to look like you can see these on-line resources:

Or, see Chapter 2 (for Subscriptions) and Chapter 3 (for Events) in the book. For setup purposes (the focuse here) there are two options for getting this kind of data:

  1. You already have data of your own - this is the option for anyone who already works on an online product or service
  2. You can generate a simulated data set from a random model - this is the option for most students or people doing this for training purposes

Note that unforatunately there is no publicly avaiable real data at this time. Subscription and customer data like this tends to be very sensitive information for the companies that hold it and so far there are no offers of data that can be made publicly available. (If you have such data that you would be willing to make publicly available please contact the author.)


1.3.1 Schema Creation

Regardless of where you get your data from, you need to create the database schema and tables that it to hold it. There is a python script in the folder data-generation for this. If you have been following the instructions above then you have already made a Run Configuration for the script in PyCharm. There's just one more thing to do: name the schema that you will create.

  1. Open the file data-generation/py/churndb.py
  2. Edit the schema name in the file to the name you want
    1. If you are loading your own data, name it whatever you want
    2. If you are generating simulated data, leave it set to the name chunrsim2 which is the default model for simulation
  3. Run the script using PyCharm, as described above

If everything works as planned you should see something like the following output:

/Users/user_name/fight-churn-master/venv/bin/python /Users/user_name/fight-churn-master/data-generation/py/churndb.py
Creating schema test (if not exists)...
Creating table event (if not exists)
Creating table subscription (if not exists)
Creating table event_type (if not exists)
Creating table metric (if not exists)
Creating table metric_name (if not exists)
Creating table active_period (if not exists)
Creating table observation (if not exists)

Process finished with exit code 0

A new schema and tables were created in your PostgreSQL database, which you can confirm by looking in PgAdmin:

PgAdmin Churn Schema Tables

If you made it this far then congratulations! You just ran your first bit of the fight-churn code.


1.3.2 Generating Simulated Data

If you don't have your own data to analyze then you should use the simulator program to create a realistic (enough) simulation of subscription, churn and event data for you to run the book code examples on. Code for generating artificial data to run the code is in the directory data-generation. In the near future there will be a blog post explaining how the simulation works, and the instructions here are limited to simply running the simulation.

  • Create a run configuration for the script data-generation/churnsim.py following the instructions in section 1.2.6.

If this is your first time following these instructions, just run it. You should start to see output like this:

/Users/user_name/fight-churn-master/venv/bin/python /Users/user_name/fight-churn-master/data-generation/py/churnsim.py
Matrix is not positive semi-definite: Multiplying by transpose

Creating 500 initial customers for 2019-01-01 start date
Simulated customer 0: 5 subscription, 10243 events @ 2019-05-21 06:00:01.611085
Simulated customer 1: 5 subscription, 10076 events @ 2019-05-21 06:00:01.805228
Simulated customer 2: 5 subscription, 11501 events @ 2019-05-21 06:00:02.034017
Simulated customer 3: 5 subscription, 11643 events @ 2019-05-21 06:00:02.250260
Simulated customer 4: 5 subscription, 9869 events @ 2019-05-21 06:00:02.433102
...

There will be more like this and the whole process will take a couple of minutes. The program has simulated the subscriptions, behavior and churn of a few hundred customers and inserted them into the database tables subscription and event. (There is no data for churns yet : you will derive that as part of the analysis process described in the book chapters 2 and 4.) You can (and should) confirm the results of the data simulation by querying the database directly:

churn=# select count(*) from churnsim2.subscription;
 count 
-------
  2594
(1 row)

churn=# select count(*) from churnsim2.event;
  count  
---------
 5454701
(1 row)

churn=# 

Yours won't look exactly like that, because its a random simulation and the results are different every time. But it should be a similar overall number. (If you don't know how to launch a SQL prompt on your launch the PostgreSQL app and double click on the database. Mine doesn't actually launch the terminal directly, but it points in the right direction....)

Mac PostgreSQL terminal launch

Great! Now you have the data you need to run the code and learn the techniques in the book.


1.3.3 Loading Your Own Data (If you have it)

If you have your own data that you want to analyze then you need to load it into the subscription and event tables of the churn analysis schema. I will outline the basic steps here, but I'm going to assume that if you work in an organization that has its own churn data to analyze then either you know how to export and import the data already (probably better than I could tell you), or you can get support from someone else in your oganization. That said, for those of you have your own data but don't have a better resource to turn to for how to load it, the process you will follow is:

  • Export the data from your equivalent of a subscription table. If you use Zuora to track your subscriptions, this is actually the rate_plan_charge object. If you use Salesforce, this is may be the Opportunity table (where the opportunity was won.) Each subscription record must have an account identifier, a start date and a price, quoted in Monthly Recurring Revenue (MRR). Terminated (churned) subscriptions must have the end date.
    If you have other fields in the churn analysis subscription table (product name, units and quantity, billing frequency) then you can export these as well, or else export a NULL in those positions (which will make the import simpler.) Follow the order in the churn analysis subscription table:

    1. id : integer
    2. account_id : integer
    3. product : text
    4. start_date : date
    5. end_date : date
    6. mrr : double
    7. quantity : double
    8. units : text
    9. bill_period_months : integer
  • Export the data from your equivalent of an event table, and from whatever table names the event types (assuming there is a code name table separate from the event table. Every event must have an account identifier, a timestamp, and a type. The required fields for the events are:

    1. id : integer
    2. event_time : timestamp
    3. event_type_id : integer
  • You can and should also export whatever additional data fields you have for your events that you think are important (amounts viewed or consumed, prices, content categories, etc.) Start by updating the the event table schema in data-generation/schema/create_event.sql and recreate the table including your fields (You will have to drop the existing table that you created in step 1.3.1.)

If you know how to import and export PostgreSQL data using the command line then have it. Personally I did it with the PgAdmin Import/Export GUI, which you launch by command clicking (right clicking) with your mouse on the listing for the table you want to import into:

Mac PostgreSQL terminal launch

(top)




You can run the book listings with the python script listings/run_churn_listing.py.

  • This script reads SQL and Python source code, and either binds variables (for SQL) or passes parameters (for Python) and executes the code.
  • The code (listings from the book) are in the folders chapN under the listings directory.
  • Exactly what listings to run and what parametes are used are set in JSON files in the directory listings/conf.

So you can use this utility as you go through the book - if you want to actually run the code from any listing on your local database, this is the easiest way to do it.

(top)


Start by making a Run Configuration for the script listings/run_churn_listing.py, following the instructions in Section 1.2.6.

The script is preset to run the first code listing, listing 2.1 from chapter 2, for the simulated data set churnsim2. If you have created a simulated dataset named churnsim2 as described in section 1.3.2 you can run your configuration as is and you should see a result like this:

/Users/user_name/fight-churn/venv/bin/python /Users/user_name/fight-churn/listings/py/run_churn_listing.py

Running chap2 listing listing_2_1_net_retention
SQL:
----------
set search_path = 'churnsim2'; with 
date_range as (    
	select  '2019-03-01'::date as start_date, '2019-04-01'::date as end_date
), 
start_accounts as    
(
	select  account_id, sum (mrr) as total_mrr    
	from subscription s inner join date_range d on
		s.start_date <= d.start_date    
		and (s.end_date > d.start_date or s.end_date is null)
	group by account_id    
),
end_accounts as    
(
	select account_id, sum(mrr) as total_mrr      
	from subscription s inner join date_range d on
		s.start_date <= d.end_date    
		and (s.end_date > d.end_date or s.end_date is null)
	group by account_id    
), 
retained_accounts as     
(
	select s.account_id, sum(e.total_mrr) as total_mrr     
	from start_accounts s 
	inner join end_accounts e on s.account_id=e.account_id  
	group by s.account_id    
),
start_mrr as (    
	select sum (start_accounts.total_mrr) as start_mrr from start_accounts
), 
retain_mrr as (    
	select sum(retained_accounts.total_mrr) as retain_mrr 
from retained_accounts
)
select 
	retain_mrr /start_mrr as net_mrr_retention_rate,    
	1.0 - retain_mrr /start_mrr as net_mrr_churn_rate,    
	start_mrr,    
	retain_mrr
from start_mrr, retain_mrr

----------
RESULT:
Record(net_mrr_retention_rate=0.96039603960396, net_mrr_churn_rate=0.0396039603960396, start_mrr=1008.99, retain_mrr=969.030000000001)

Process finished with exit code 0

The first line shows you what chapter and listing are being run. Next it shows the SQL being run (this is a SQL listing). The final line prints out the result - the net retention rate, calculated with the SQL. Because the data was randomly simulated your result on the last line won't be exactly the same as that one, but it should be similar.

One way you can change what the script will run for by simply editing the constants in the "main" portion at the bottom of the file. Note these variables:

  • schema : the name of the churn data schema to run on
  • chapter : the chapter to run a listing from
  • listing : the number of the listing to run

The most common thing you will do is run a different listing on the same schema and chapter, so you would edit this line:

listing=1

to whatever listing you want. So for example, to run listing 2.2 you can change the variable to:

listing=2

Alternatively, the script accepts command line parameters. To run this way, provide the following three parameters (all required) and it will use those instead of the hard coded constants:

  1. The first command line parameter is the schema
  2. The second command line parameter is the chapter number
  3. The third command line parameter is the listing number

Note: If you want to run listings from the command line, you also need to setup your Python virtual environment and add the listing code paths to your PYTHONPATH (meaning, set them up the old fashioned way - not using PyCharm.)
Command line setup is not covered in this README at this time (but if you want to add such instructions please contribute!)

You can also set command line arguments in PyCharm, in the Run Configurations setup dialog. (But IMHO changing the command line arguments in the configuration dialog is more tedious than simply editing them script for this type of use...)

To see what listings are available to run, peruse the code in the chapter folders below listing. But note that your schema must be configured to run each listing, as described in the next section. The churnsim2 (default) schema has entries created for it already, but if you want to run the code on your own data you will need to enter your own configuration.

(top)


Your schema must be configured to run each listing. The churnsim2 (default) schema has entries created for it already, but if you want to run the code on your own data you will need to enter your own configuration. Also if you want to change how the listings are run on churnsim2 this section will explain how to do it.

The configuration files are all in folder listings/conf and each schema has a configuration file that must have a name that is <schema_name>_listings.json. So the configuration for the churnsim2 data set is in the file churnsim2_listings.json. The configuration his a JSON with the following structure:

  1. The top level are keys for the chapters, "chap2", "chap3", etc.
  2. The next level is a set of objects representing each listing in the chapter
    • The key for each object is the listing name, beginning with "listing__"
    • The values for each object are parameters that apply to running that listing.
      There are a few possible types of parameters.
      1. Variables which are substituted in SQL, or passed as values to python functions
      2. Control parameters (described more below)
  3. There is a special object of chapter default parameters in each chapter, with the key params. The defaults for the chapter will automatically be applied to every listing - parameters specified in each listing are override the defaults.

As mentioned, there are two special control parameters which are are not parameters of the listing:

  1. type : must be either sql or python and controls how the listing is executed
  2. mode : controls how the result of the program is handled:
    • mode=run : A SQL expected to return no results (like an insert)
    • mode=one : A SQL expected to return one result, print it
    • mode=top : A SQL expected to return many results, print the first 5 lines
    • mode=save: A SQL expected to return many results, save the result in a csv file

Below is an example of the beginning of the listing configuration for the churnsim2 simulated data set:

	"chap2" : {
		"params" : {
			"%from_yyyy-mm-dd": "2019-03-01",
			"%to_yyyy-mm-dd": "2019-04-01",
			"mode" : "one",
			"type" : "sql"
		},
		"listing_2_1_net_retention" : {
		},
		"listing_2_2_churn_rate" : {
		},
		...

The following summarizes the configuration:

  • listing_2_1_net_retention and listing_2_2_churn_rate are enabled
  • Both listing will run with the parameters shown in the params section:
    • The strings for start and from date and to date in the queries will be set as shown
    • The listings are SQL
    • The listings will print one result

(More to come on running Python listings when Chapter 5 is released...)

(top)



Running the code listing by listing is useful for when you are learning the techniques described the book. But once you have mastered the basics and you are trying to actually analyze churn for a live product or service you will need a more efficient way to operate. The code described in this section is a partially automated framework for calculating metrics on events, and performing quality assurance (QA) checks associated with this process. So these are all the topics described in chapter 3, but this is a more scalable system for applying the techniques to a large data set than the example by example framework described in the last section.


As described in the book, the first step in calculating metrics should actually be running some tests on your event data. This section of the code contains a script that will automatically run generate a QA plot for each event type, as described in chapter 3.

The script that does is metric-framework/event_qa.py. To run it, make a Run Configuration as described in Section 1.2.6. If you are running for the default simulated data set, chursim2 then the script is ready to run - just launch it. You should see output like the following:

/Users/user_name/fight-churn/venv/bin/python /Users/user_name/fight-churn/metric-framework/py/event_qa.py
Checking event post
Checking event new_friend
Checking event like
Checking event view_feed
Checking event photo
Checking event message
Checking event unfriend
Saving results to ../../../fight-churn-output/churnsim2/

Process finished with exit code 0

The output is a set of of png images shown in the folder indicated by the relative path on the last line: It is a folder nmaed fight-churn-output that is located adjacent to the repository folder. The figures will look like this:

Event QA Output

You can also configure the QA to run on numeric event properties, if you events have any. See the event_properties field in the metric configuration described in section 3.2.2 (below).

Note that the metric QA uses the metric configuration to run - it takes the start and end dates for the QA from the configuration. So if you are running on your own data you need to create the metric configuration (described in detail in the next section) before you can run this QA. (For the defaulted simulated data a configuration is already there for you.)

(top)


After you are satisfied with the event data that has been loaded and is correct you can calculate metrics in batch jobs. There are three parts to the metric framework:

  1. Metric calculation SQL's in the metric-framework/sql folder
  2. Metric configurations in JSONfiles in the metric-framework/conf folder
  3. An execution program in python, metric-framework/metric_calc.py

Here is a high level overview of how it works:

  • The SQL's calculate the metrics desribed in the book, but they have to be configured with things like the type name for an event or the time window for a measurement.
  • The configuration of what parameters to bind in the SQL's are stored in JSON in the conf directorynd
  • Each database schema has its own configuration file, so if you are using the default simulated data churnsim2 then the configuration for all of the metrics will be in conf/churnsim2_metrics.json. If you have loaded your own data in a schema called saasco you need to make and fill out a configuration conf/saasco_metrics.json, etc.
  • Each metric that will be calculated must have its own entry in the JSON for its schema, and there is also a block of start and end date paramerters that are common to al the metrics.
  • The Python program metric-framework/metric_calc.py reads the configuration for the schema passed as a parameter and calculates the metrics.

The following sections provide details about each part of the framework.

3.2.1 Metric SQL's

SQL's to calculate metrics are in the metric-framework/sql folder. These are the SQL's described in Chapter 3 (Measuring Customers) and Chapter 7 (Advanced Behavioral Metrics).

  • The SQL's all have bind parameters prefaced by %

For example, the metric for counting the number of events or aggregating an event property is:

set search_path = '%schema';
with date_vals AS (
  select i::timestamp as metric_date from generate_series('%from_date', '%to_date', '7 day'::interval) i
)

insert into metric (account_id,metric_time,metric_name_id,metric_value)

select account_id, metric_date,%metric_name_id, (%quotwin)::float/(%measwin::float) *  %fun
from event e inner join date_vals d
on e.event_time <= metric_date and e.event_time > metric_date - interval '%measwin week'
where e.event_type_id=%event_id
group by account_id, metric_date
order by account_id, metric_date;

The bind variables in the metric are:

  1. %schema
  2. %from_date
  3. %to_date
  4. %event_id
  5. %metric_name_id
  6. %quotwin
  7. %measwin
  8. %fun

These will be substituted by the program based on the configurations when it is run. Details of what the SQL does and the meaning of bind parameters are in the related sections of chapter 3 and 7 in the book.

At the time of this writing the following metrics are available:

  1. metric_multi_event_tenscale.sql
  2. metric_over_period.sql
  3. metric_over_period_tenscale.sql
  4. metric_ratio.sql
  5. metric_subscription_billperiod.sql
  6. metric_subscription_mrr.sql
  7. metric_subscription_product_quantity.sql
  8. metric_subscription_unit_quantity.sql
  9. metric_tenure.sql

3.2.2 Configuring Metrics

To calculate metrics in batch jobs you need to configure the metrics specifically for your event data in a configuration file for your schema. For a simple example you can take a look at metric-framework/conf/churnsim2_metrics.json which is setup to run for the default simulated data file. The configuration is a set of named objects where the key is the name of the metric that will be created (the name to be entered in the table metric_name in the database) and the value is an object containing key/value pairs that are the bind variables for the SQL. Here is an example from the churnsim2 configuration:

    "date_range": {
        "from_date" :"2019-02-04",
        "to_date" : "2019-05-06"
    },,
	"event_properties" : [],
	"account_tenure": {
	    "sql" : "metric_tenure"
	},
	"post_per_month": {
		"event_id":  0,
		"fun" : "count(*)",
		"measwin" : 8,
		"quotwin" : 4,
		"minten" : 2,
		"sql" : "metric_over_period_tenscale"
	}

The entire configuration is a JSON of name/value pairs that is loaded as a dictionary in Python. Each name maps to another object which contains additional name/value pairs.

The first object in the configuration is a named date_range which contains the start and end dates for the metrics to be calculated as from_date and to_date.

The second object is a list of properties that are in the event table - these will be included in the Event QA (described above) if there are any. (In the future there will be a simulated data set with event properties, but not yet...)

The third object is an example of a no-parameter metric, account_tenure (described in Chapter 3 of the book). The only parameter in the object is the sql parmeter which indicates which SQL file (in metric-framework/sql) is run to make the metric calculation.

The second object in the example configuration is a metric that has multiple parameters. This configuration will create a metric named "post_per_month". The sql parameter in the configuration object indicates which specific SQL to execute in order to create the metric : in this case it is the sql metric-framework/sqls/metric_over_period_tenscale.sql.
The other fields in the configuration object will all be matched to bind variables in the SQL.

Each metric you create requires one configuration and the easiest way to create them is copy/paste and then modify the existing configurations. For more examples of metric configurations (the churnsim2 only has one example repeated) look at the other sample file metric-framework/conf/x_metrics.json. (Unless you really need to make a lot - then its not so easy to make them manually and you should spend some time automating that process as well. Maybe that will be a future feature in this framework...)


3.2.3 Running Metric Calculations

Once you have configured metrics (or if you are using the default churnsim2) you calculate the metrics with the metric-framework/metric_calc.py. You create a configuration for this script as described in Section 1.2.6 of this README. If you run the default configuration you should see this in your terminal (or the "Run" window in PyCharm):

/Users/cgold/projects/ChurnBook/fight-churn/venv/bin/python /Users/cgold/projects/ChurnBook/fight-churn/metric-framework/py/metric_calc.py
TRUNCATING *Metrics* in schema -> churnsim2 <-  ...
are you sure? (enter churnsim2 to proceed)

The program asks because if there were any metrics already saved, this program would wipe them out. If this is the first time there is no data but the script is not smart enough to check - just enter the schema name as shown and hit <ENTER> and you should see the following, although it will take a bit of time: each metric will take some time to calculate:

/Users/cgold/projects/ChurnBook/fight-churn/venv/bin/python /Users/cgold/projects/ChurnBook/fight-churn/metric-framework/py/metric_calc.py
TRUNCATING *Metrics* in schema -> churnsim2 <-  ...
are you sure? (enter churnsim2 to proceed) churnsim2
Inserting metric churnsim2.account_tenure as id 0
set search_path = 'churnsim2'; with RECURSIVE date_vals AS (   select i::timestamp as metric_date from generate_series('2019-02-04', '2019-05-06', '7 day'::interval) i ), earlier_starts AS ( 	select account_id, metric_date, start_date 	from subscription inner join date_vals 		on start_date <= metric_date 		and (end_date > metric_date or end_date is null)  	UNION  	select s.account_id, metric_date, s.start_date 	from subscription s inner join earlier_starts e 		on s.account_id=e.account_id 		and s.start_date < e.start_date 		and s.end_date >= (e.start_date-31)  )  insert into metric (account_id,metric_time,metric_name_id,metric_value)   SELECT account_id, metric_date,0, extract(days from metric_date-min(start_date)) FROM earlier_starts group by account_id, metric_date order by account_id, metric_date;  
Inserting metric churnsim2.post_per_month as id 1
set search_path = 'churnsim2';  insert into metric (account_id,metric_time,metric_name_id,metric_value)  select m.account_id, metric_time, 1,     (4/ least(8,m.metric_value))  *  count(*) from event e inner join metric m     on m.account_id = e.account_id     	and event_time <= metric_time and event_time >  metric_time-interval '8 week' where e.event_type_id=0     and metric_name_id = 0     and metric_value >= 2     and metric_time between '2019-02-04'::timestamp and '2019-05-06'::timestamp group by m.account_id, metric_time, metric_value order by m.account_id, metric_time, metric_value; 
Inserting metric churnsim2.new_friend_per_month as id 2
set search_path = 'churnsim2';  insert into metric (account_id,metric_time,metric_name_id,metric_value)  select m.account_id, metric_time, 2,     (4/ least(8,m.metric_value))  *  count(*) from event e inner join metric m     on m.account_id = e.account_id     	and event_time <= metric_time and event_time >  metric_time-interval '8 week' where e.event_type_id=1     and metric_name_id = 0     and metric_value >= 2     and metric_time between '2019-02-04'::timestamp and '2019-05-06'::timestamp group by m.account_id, metric_time, metric_value order by m.account_id, metric_time, metric_value; 
Inserting metric churnsim2.like_per_month as id 3
set search_path = 'churnsim2';  insert into metric (account_id,metric_time,metric_name_id,metric_value)  select m.account_id, metric_time, 3,     (4/ least(8,m.metric_value))  *  count(*) from event e inner join metric m     on m.account_id = e.account_id     	and event_time <= metric_time and event_time >  metric_time-interval '8 week' where e.event_type_id=2     and metric_name_id = 0     and metric_value >= 2     and metric_time between '2019-02-04'::timestamp and '2019-05-06'::timestamp group by m.account_id, metric_time, metric_value order by m.account_id, metric_time, metric_value; 
Inserting metric churnsim2.view_feed_per_month as id 4
set search_path = 'churnsim2';  insert into metric (account_id,metric_time,metric_name_id,metric_value)  select m.account_id, metric_time, 4,     (4/ least(8,m.metric_value))  *  count(*) from event e inner join metric m     on m.account_id = e.account_id     	and event_time <= metric_time and event_time >  metric_time-interval '8 week' where e.event_type_id=3     and metric_name_id = 0     and metric_value >= 2     and metric_time between '2019-02-04'::timestamp and '2019-05-06'::timestamp group by m.account_id, metric_time, metric_value order by m.account_id, metric_time, metric_value; 
Inserting metric churnsim2.photo_per_month as id 5
set search_path = 'churnsim2';  insert into metric (account_id,metric_time,metric_name_id,metric_value)  select m.account_id, metric_time, 5,     (4/ least(8,m.metric_value))  *  count(*) from event e inner join metric m     on m.account_id = e.account_id     	and event_time <= metric_time and event_time >  metric_time-interval '8 week' where e.event_type_id=4     and metric_name_id = 0     and metric_value >= 2     and metric_time between '2019-02-04'::timestamp and '2019-05-06'::timestamp group by m.account_id, metric_time, metric_value order by m.account_id, metric_time, metric_value; 
Inserting metric churnsim2.message_per_month as id 6
set search_path = 'churnsim2';  insert into metric (account_id,metric_time,metric_name_id,metric_value)  select m.account_id, metric_time, 6,     (4/ least(8,m.metric_value))  *  count(*) from event e inner join metric m     on m.account_id = e.account_id     	and event_time <= metric_time and event_time >  metric_time-interval '8 week' where e.event_type_id=5     and metric_name_id = 0     and metric_value >= 2     and metric_time between '2019-02-04'::timestamp and '2019-05-06'::timestamp group by m.account_id, metric_time, metric_value order by m.account_id, metric_time, metric_value; 
Inserting metric churnsim2.unfriend_per_month as id 7
set search_path = 'churnsim2';  insert into metric (account_id,metric_time,metric_name_id,metric_value)  select m.account_id, metric_time, 7,     (4/ least(8,m.metric_value))  *  count(*) from event e inner join metric m     on m.account_id = e.account_id     	and event_time <= metric_time and event_time >  metric_time-interval '8 week' where e.event_type_id=6     and metric_name_id = 0     and metric_value >= 2     and metric_time between '2019-02-04'::timestamp and '2019-05-06'::timestamp group by m.account_id, metric_time, metric_value order by m.account_id, metric_time, metric_value; 

Process finished with exit code 0

The program prints out each metric as it inserts them into the metric name table and runs the SQL to insert the results into the metric table.

The default of the program is calculate all the metrics for the churnsim2 schema, but you can control this behavior with a constant saved at the top of the file metric_calc.py :

schema='churnsim2'

run_mets=None
# run_mets=['account_tenure','post_per_month']

If you uncomment the line run_mets=[... you can specify any list of metrics and when the script runs it will only calculate those metrics. This is useful because you normally iterate and can create additional metrics after investigating a first group.

At this point you should check on the results of the metric calculation you have just run. You could query the database directly, or you can use the script described in the next section.

(top)


The script metric-framework/metric_qa.py will create plots for metric QA (Quality Assurance) like the ones described in Chapter 3 of the book. To get started, just make a new Run Configuration (Section 1.2.6 of this README) and if you are using the default churnsim2 data set just run it. You should see the following printout:

/Users/cgold/projects/ChurnBook/fight-churn/venv/bin/python /Users/cgold/projects/ChurnBook/fight-churn/metric-framework/py/metric_qa.py
Checking metric churnsim2.account_tenure
Checking metric churnsim2.post_per_month
Checking metric churnsim2.new_friend_per_month
Checking metric churnsim2.like_per_month
Checking metric churnsim2.view_feed_per_month
Checking metric churnsim2.photo_per_month
Checking metric churnsim2.message_per_month
Checking metric churnsim2.unfriend_per_month
Saving results to ../../../fight-churn-output/churnsim2/

Process finished with exit code 0

As described on the last line of the printout, results are saved in a directory named fight-churn-output which is alongside the fight-churn source directory. (The confusing printout is the relative path from the script directory.) Each of the plots produced by the script for the default churnsim2 simulation data will look roughly like this:

Note that the metric QA uses the metric configuration to run - it only runs on metrics that are listed in the configuration and takes the start and end dates for the QA from the configuration.

Metric QA Output

(top)



Chapter four of the book explains how to create a churn dataset from a database schema containing subscriptions and metrics. You can run each step of the process with the framework for running listings (described above). But there are a few problems with doing it that way when you are iterating on different versions of a dataset for a real project.

  1. You want to run all the steps at once, not one at a time
  2. The code does not clean out the results of prior runs, so you need to manually clear out the tables between executations.
  3. The final select statement in the listings is hard coded to the metrics from one particular simualted data set

The code in the folder dataset-export handles these problems by making a single script that does everything. The python script is dataset-export/observe_churn.py. To get started, just make a new Run Configuration (Section 1.2.6 of this README) and if you are using the default churnsim2 data set just run it. The script prints out the SQL from each step, just like the script that runs the book SQL listings. When it is done it should print that it is saving the dataset:

...
Saving: ../../../fight-churn-output/churnsim2/churnsim2_dataset.csv

If you want to run this for your own dataset it accepts command line arguments to change the defaults:

  • --schema : The name of the schema to run against
  • --frdt : The earliest date to export, in the format YYYY-MM-DD
  • --todt : The latest date to export, in the format YYYY-MM-DD
  • --interval : The interval between the metric observations. Typically 7 days (if you followed the guidelines in the book.) Note that this is just telling the code about the metrics.

ALSO NOTE: You must set up the chapter 4 listing configurations (as described above) to use this! Thats because the script actually imports and runs the listings 4.1, 4.2 and 4.4 as they are written. The observe-churn.py only differs from the listing program in the final step where it does the export.



(top)



Coming Soon! (after Chapter 5 is written...)



Authors

License

This project is licensed under the MIT License - see the LICENSE.md file for details

(top)

About

Code from the book Fighting Churn With Data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 88.7%
  • TSQL 11.3%