Module Name: | ExceltoPandas |
Author: | Rory Creedon (rcreedon@poverty-action.org) |
Purpose: | Library of Classes and Functions for converting multi-worksheet ExcelFile to pandas using a descriptive workbook structure file such as that created by the WorkbookFunctions module |
Use: | Intended for use in IPython, in particular an IPython notebook (although will work in any python terminal). |
Dependencies: | WorkbookFunctions |
#Introduction
#==========
This library of functions was developed to meet the needs of the RMG Data Team working primarily with Quality and Absence data generated by the RMG projects in Bangladesh.
The module provides tools for creating a single DataFrame
from a multi-worksheet ExcelFile. The tools require that a descriptive "workbook structure" object such as that created when using WorkbookFunctions be passed to the functions in order to extract data from the ExcelFile.
The work-flow and a description of the syntax etc. is provided here.
##Importing ExceltoPandas
In this document the syntax will de described as though the module was imported according to the following convention:
import ExceltoPandas as etp
##Retrieve Line Codes
###Purpose and Information
The data that are supplied by the factory need to be anonymised and coded before they can be used in analysis. In particular in the data sets we are producing the factroy name will always be replcaed by a factory code. Additionally all line number/names will be replaced with codes. These codes have been used in the survey work to identify line, and so they are the key to tying all the different types of data together.
###Workflow and Syntax
Line codes are prepared by the field team, in an Excel format. Before using these line codes in this process that file needs to be converted to a .csv file that has the following structure:
fact_code | unit_floor | line | line_code | s_line |
---|---|---|---|---|
1005 | 2 | 1 | 100501 | 1 |
1005 | 2 | 2 | 100502 | 1 |
1005 | 2 | 3 | 100503 | 2 |
1005 | 2 | 4 | 100504 | 1 |
1005 | 2 | 5 | 100505 | 2 |
1005 | 2 | 6 | 100506 | 1 |
1005 | 2 | 7 | 100507 | 1 |
1005 | 2 | 8 | 100508 | 1 |
1005 | 2 | 9 | 100509 | 2 |
- fact_code : the code assigned to the factory
- unit_floor : the unit or floor that each line is located on
- line : the line number/name as referred to by the factory
- line_code : the line code assigned to each line
- s_linen : An indicator of whether the line is a sample line or not 1 = participating line 2 = non-participating line
In order to load a csv in this format into your workspace use the ExceltoPandas.get_line_codes()
function. The syntax is as follows:
etp.get_line_codes(codes_file_path)
codes_file_path |
string of path to .csv file |
Returns | DataFrame of line code data. |
An example call might look like this:
os.chdir(r'path/to/folder')
LC_DF = etp.get_line_codes('line_codes.csv')
LC_DF
Note that the function does not check the structure of the .csv file (except it will raise an exception if any cells are blank (missing), so it is up to the operative to ensure that the data are in the right structure. The columns are named automatically to ensure that they can be used in subsequent functions without the user having to specify column names as function arguments.
The 'line' number/name in the imported csv file will be used as the merge key when merging the line code information with the data from the pd.ExcelFile
type object that will be created later. If the line names/numbers in the data are not as in the line_code sheet, then it is generally better to fix that issue directly in the data by using some function in DataNitro. However, if this is not practical, then it is also possible to hold off on the merging until after the DataFrames have been concatenated. For more on this see below.
##Retrieve Workbook Structure
###Purpose and Information
The Workbook Structure object is a .json file that was created using the WorkbookFunctions module. For each sheet of data in the excel workbook that will be worked with, this .json describes the start row, the end row, and the date for each sheet. This structure is used to iterate through the workbook and select only that data which falls between the start and end row, and the date is inserted. In order to retrieve the Workbook Structre from the .json file use the ExceltoPandas.get_workbook_structure()
function.
###Workflow and Syntax
The syntax for the function is as follows:
etp.get_workbook_structure(json_file_path)
json_file_path |
string of path to .json file |
Returns | dict |
An example call might look like this:
ws = etp.get_workbook_structure(r'path/to/workbook_structure.json')
ws
##Creating a Workbook Checker Object
###Purpose and Information
Before concatenating the workbook it is important to understand whether it conforms to the structure we believe it to have. For example, if we believe that there is just one observation per line per sheet, then we need to confirm this, and confirm that there are not multiple observations. For this purpose and others ExceltoPandas provides the ExceltoPandas.workbook_checker
class object.
###Workflow and Syntax
####Creating an etp.workbook_checker
object
The syntax to create a workbook_checker object is :
etp.workbook_checker(wkbk_file_path, line_codes, workbook_structure)
wkbk_file_path : |
string path to the compiled excel workbook |
line_codes : |
DataFrame as generated by the etp.get_line_codes() function. |
workbook_structure : |
dict as generated by the etp.get_workbook_structure() function |
Attributes
The attributes of the class are as follows:
- line_codes - line codes DataFrame
- line_vals - list of line values from line_codes DataFrame
- All_DFs - dict of sheet name keys and associated
DataFrame
values - DFs - list of
DataFrame
names - col_names - list column names from first
DataFrame
in All_DFs
The workbook checker object internally uses a pd.ExcelFile
object to read in the excel workbook.
####Checking Presence of Lines
In theory every line in the line_codes data should exist at least once on every sheet of data provided by the factory and hence in the compiled workbook. If there are instances where a line is not present it is possible that the line name was misentered, or it has been merged with another line, or any number of things. Ultimately we will drop observations from the data that do not have valid line codes, and in order to do that we have to merge the line code data. The data will only merge when the line data in the underlying data frames contain the same values as those in the line_codes data. Therefore if the merge is going to fail for certain lines on certain days we want to be sure this is because the data for that line are genuinely missing, not because of a naming error etc. In order to assist in checking this, the workbook_checker.check_single_lines()
method is available. This will check that for every line in the line_codes
data there is at least one obeservation in every DataFrame
stored in All_DFs. The syntax is as follows:
etp.workbook_checker.check_single_lines(line_col_ref, numerical = True)
line_col_ref : |
string that identifies the column in the DataFrames that contain the line number/name |
numerical : |
boolean. If True then the values in the DataFrames are converted to string integers where possible. This makes sure that they can be merged with the line_codes data as those are imported as strings. |
Returns : | dict of keys that are sheet names and line values that present in the line_codes but are missing from that sheet. |
The user should return to the original Excel file to check any errors identified. If changes need to be made, make them on the underlying DataFrame, and once all other checks have been completed and errors rectified, re-run the ExceltoPandas process from the top to confirm all errors have been satisfactorily dealt with.
An example call might look like this:
single_dict = workbook_checker.check_single_lines('line_no.')
single_dict
Note that you can check which column name to pass by looking at the workbook_checker.col_names
attribute.
####Checking for Multiple Instances of Lines
If we believe that there is only one observation per line, per day, then we should confirm this as the case. To assist in this ExceltoPandas provides the workbook_checker.get_multiple_lines()
method. The syntax is as follows:
etp.workbook_checker.get_multiple_lines(line_col_ref, numerical = True)
line_col_ref : |
string that identifies the column in the DataFrames that contain the line number/name |
numerical : |
boolean. If True then the values in the DataFrames are converted to string integers where possible. This makes sure that they can be merged with the line_codes data as those are imported as strings. |
Returns : | dict of values that are sheet names and line_values that are present in the line_codes dict and multiple times on the respective sheet. |
The user should return to the original Excel file to check any errors identified.
An example call might look like this:
multi_dict = workbook_checker.check_single_lines('line_no.')
multi_dict
####Checking for Unusual Lines
We might find that there are no lines missing, and that there are no multiple lines, but that does not mean that everything is as we think it is. For example some line may have been entered correctly once, but incorrectly a second time. We need to check for this type of error by identifying whether there are any values of line in the DataFrames that we would not expect to find there. To assist in this process ExceltoPandas provides the workbook_checker.check_unusual_lines()
method. The syntax is as follows:
etp.workbook_checker.check_unusual_lines(line_col_ref, other_lines, numerical = True)
line_col_ref : |
string that identifies the column in the DataFrames that contain the line number/name |
'other_lines' | list of lines we expect in the data that are not in the line_codes data. |
numerical : |
boolean. If True then the values in the DataFrames are converted to string integers where possible. This makes sure that they can be merged with the line_codes data as those are imported as strings. |
Returns : | dict of values that has keys that are DataFrame names as per self.All_DFs, and values that are lists of line names/numbers that are present in the DataFrame, but are not present in either self.line_vals or the list of other_lines |
Some explanation is needed here:
The line_codes data may not contain data for every line in the factory. If lines are neither participating nor non-participating lines, then they are excluded from the final data set. However, clearly those lines will exist in the data. In such circumstances an 'unusual line' will be one that is neither in the line_codes data nor in the list of other_lines
passed as argument to the method. If the line_codes data should contain all lines in the factory data then pass an empty list as the other_lines argument.
The user should return to the original Excel file to check any errors identified.
An example call might look like this:
unusual_dict = workbook_checker.check_unusual_lines('line_no.', [12, 13, 14, 15])
unusual_dict
##Workbook Concatenation
###Purpose and Information
Once the workbook has been checked for inconsistencies it can be concatenated into a single DataFrame
. To assist in this process ExceltoPandas provides the workbook_concatenator
class object.
###Workflow and Syntax
####Creating an etp.workbook_cconcatenator
object
The syntax for creating a workbook_cconcatenator
object is as follows:
etp.workbook_cconcatenator(wkbk_file_path, line_codes, workbook_structure, line_col_ref, numerical = True, merge = True)
wkbk_file_path : |
string path to the compiled excel workbook |
line_codes : |
DataFrame as generated by the etp.get_line_codes() function. |
workbook_structure : |
dict as generated by the etp.get_workbook_structure() function |
line_col_ref : |
string that identifies the column in the DataFrames that contain the line number/name |
numerical : |
boolean. If True then the values in the DataFrames are converted to string integers where possible. This makes sure that they can be merged with the line_codes data as those are imported as strings. |
merge : |
boolean. If merge is True then the data in the All_DFs will be merged with the line_code data on the line_col_ref in a 'left' merge. If it is thought that this is not appropriate then set merge to False . |
It should be possible to merge the data, as issues in line names etc. should be sorted at the excel workbook level. In general merge
should always be True
unless for some other very good reason.
When the object is created it effectively takes all the DataFrames
from the workbook, extracts the necessary data as per the workbook_structure
and then adds a date column, and then merges with the line_code
on the line_col_ref
and 'line' key in All_DFs
and line_codes
data respectively.
An example call might look like this:
os.chdir(r'/Path/to/data')
wk_concat = etp.workbook_concatenator('compiled.xls', line_codes, workbook_structure, 'line_no.')
Attributes
The attributes of this object are as for the workbook_checker object except that there is now also included an All_DFs_merged' attribute which is a dict of dataframes where the date column and
line_codesinformation have been merged (or just the date column if
merge = False`
####Concatenating to a single DataFrame
Once checked and merged, the individual dataframes in workbook_cconcatenator.All_DFs_merged
attribute are ready to be concatenated. In order to do this you can use the workbook_cconcatenator.concat_all()
method. The syntax is as follows:
'etp.workbook_cconcatenator.concat_all(drop_na_lines = False)`
drop_na_lines : |
If drop_na_lines is False then no rows are dropped. If it is True then all rows with a null value of 'line_code' will be dropped. |
Returns : | DataFrame indexed by sheet name (outer level) and integer index (inner level) |
The drop_na_lines
option is only available if the object creator has been passed merge == True
. If this was not the case then drop_na_lines
must be False
. Only drop lines if you are sure that the data contain only one observation per line/day, or that where there are multiple line observations per sheet, that they all have 'line' references (i.e. line references were not part of merged cells etc.)
You will need to sort the index in order to get back to how the data look in the underlying Excel sheets.
An example call might look like this:
DF = wk_concat.concat_all()
##Critical Points Object
###Purpose and Information
Sometimes lines are merged with other lines, and sometimes lines are 'off' for the day, or subject to a new 'layout' for the entire day (meaning that workers move elsewhere).
When lines are merged with other lines we need a variable that gives the line code of the other line the line is merged with, and the data must be present for both lines.
When lines as off for whatever reason we need a line status variable (for the entire dataset) that is equal to 1 when the line is 'on' and 2 when the line is 'off'.
But how to find these instances of merged lines and off lines? Well there are several ways in fact. One way to identify merged lines is by using the unusual_lines()
function provided with ExceltoPandas. Another way is to use the CriticalPoints
class object.
The CriticalPoints
object allows the user to test whether critical data points are null or equal to zero or some combination of the two. A 'critical data point' is a data point that when looked at in combination with the other critical data points (if any), if the points are null or all equal to zero, then no activity could have taken place on the particular line on that day.
A typical example from the production data would be that if 'daily input' and 'daily output' are both zero, or null, then probably nothing happened on that day on that line. When specifying these points to the class object, be careful to choose wisely.
###Workflow and Syntax
####Creating an etp.CriticalPoints
object
The syntax for creating a CriticalPoints
object is as follows:
etp.CriticalPoints(DF, cols, line_col_ref, date_col_ref)
DF : |
DataFrame (concatenated) |
cols : |
`List of column references that 'critical points' |
line_col_ref : |
string that identifies the column in the DataFrames that contain the line number/name |
date_col_ref : |
string that identifies the column in the DataFrames that contain the dates |
The class has attributes as the arguments passed.
An example call might look like this:
criticals = etp.CriticalPoints(DF, ['input_today', 'output_today'], 'line', 'date')
####Check Critical Nulls
The critical_nans()
method allows the user to identify those day/line observations at which all of the critical points are null. The syntax is as follows:
etp.CriticalPoints.critical_nans()
Returns : | dict of line_number keys and a list of dates at which those line numbers have critical values that are all null. |
The output should be examined and the original sheets should be looked at to determine if the data are just missing (for no reason) or the line is off, subject to layout. If the line is properly thought of as 'off' and this can be demonstrated (generally by looking at the comments/remarks column), then the line_status variable should equal 2 for that line on that day. If you are unable to determine why the data are missing, then they should just remain missing. If the lines are merged then speak to RC. At present there is no mechanism for dealing with merged lines, but this functionality will be added.
Note that it is assumed that the critical points are numeric types. If a value that is not convertible to numeric type is found an error will be generated. Use this error to find the problem rogue string and deal with them.
####Check Critical Zeros
This works in exactly the same way as the above method, but will show lines where the critical values are all zero. The syntax is as follows:
etp.CriticalPoints.critical_zeros()
####Check Critical Zeros and Null
Again this works as above but it it identifies where the critical values are all either zero or null. In fact this method will catch all of the former points (just nans, and just zeros), but as the issues surrounding zero and null are typically different there are three methods provided separately.
It is very important to understand when a value should be 0 and when it should be null. If you are in any doubt please ask RC.
##Next Steps
Next a number of procedures will need to be followed using general pandas techniques i.e there is no special ExceltoPandas functionality. These steps may involve:
- filling NA values of line, line_code, unit etc. that are NA due to merged cells. This may involve forward filling.
- checking that the filling of NA values stratgey was a good one (grouping and checking number of values)
- checking there are not null values in the columns where we would not expect them
- ensuring that data types are as expected.
- checking unique values of columns to be sure that data are as we would expect them to be
- reshaping the data frame where necessary.
##Saving and Opening DataFrames
Once the data frame is as we want it to be it can be exported to csv using pd.to_csv()
and should be pickled as a DataFrame
. To assist in the pickling and unpickling of the DataFrame
using the following functions:
etp.pickle_dataframe(DF, top_folderpath, name)
Function pickles DF in top_folderpath according to name, returns sting message if successful.
etp.unpickle_dataframe(top_folderpath, file_name)
Function unpickles file_name from top_folderpath and returns the result.