Skip to content

Make Import Details

matthewcornell edited this page Aug 7, 2017 · 8 revisions

This page summarizes the details of importing from the CSV files.

Import runs in two phases: Load the CSV files' data and then create the unique_case table. The details of each phase are detailed below in order of execution.

import_data.R

(See note in Servers-Databases-And-Code-Repos#data_dir-variable re: the $DATA_DIR variable.)

  1. 1_original_to_sql.R: Uses the $DATA_DIR/dengue_case_metadata.txt file to identify the CSV files to load, and then loads each into its own table in the original_data schema. Loads each file's corresponding /Users/cornell/IdeaProjects/dengue-data/file_specific_code entry. See Dengue-Cases-Database#original_data for the original_data schema's details.

    Note: Due to the variation we saw early on in data types and formats, this database imports all data from .csv files originally as text. We avoid R type conversions by using the safe_read_csv function from our own integrator package which among other things calls read.csv as:

    read.csv(file = file, check.names = FALSE, colClasses = "character", stringsAsFactors = FALSE)
  2. 2_standardize_pipeline_code.R: defines record access functions

  3. 3_standardize_data.R: Standardizes each table loaded in step 1, as documented in the standardized_data schema's details at Dengue-Cases-Database#standardized_data.

  4. 4_merge_into_case_data.R: Creates the single, merged standardized_data.case_data table from individual case data tables, adding UID data as well. (See Dengue-Cases-Database#standardized_data for details.)

  5. 5_old_counts_new_names.R: Loads the data in the original_data.dengue_counts_1968_to_2005 table into dengue_counts_1968_to_2005, normalizing the data by moving province information into a single province_name column, normalizing dates, and dropping unused columns.

  6. 1_old_counts_new_smooth.R: Modifies the data in the dengue_counts_1968_to_2005 table to include weekly, biweekly, and monthly counts, then uses a spline curve to interpolate in-between data, and finally saves results to the weekly_dengue_counts_1968_to_2005 table.

create the unique_case table

This phase runs a sequence of SQL files that creates the final unique_case table based on the case tables built by the previous phase, and also creates functions and additional tables, which are documented in Dengue-Cases-Database. Here are the tables and functions in order of creation:

  • 1_create_first_delivery_tb.sql:
    • most_recent_deliveries() function
    • first_delivery table
  • 2_create_standardized_data.case_reports_with_delivery_information_tb.sql:
    • standardized_data.case_reports_with_delivery_information table
  • 3_create_unique_case_data_tb.sql:
    • unique_case_data() function
    • unique_case_data table
  • 4_create_first_reports_tb.sql:
    • first_reports table
  • 5_create_aggregate_table_fn.sql:
    • aggregate_table() function