We will create and deliver a framework, documented workflow and curriculum to support the pratical application of data science as a discipline within the healthcare industry
There is increasing and widespread adoption of Electronic Medical Record (EMR) systems as well as a vast number of more specialized Information Systems (IS) in use within healthcare organizations.
Additionally, there is a growing number and volume of data freely and openly available that can be correlated with private and proprietary sources to enrich the value of “data products” and relevant information to drive decision making for leaders and consumers.
The rise in recent years of “big data” has included a wide variety of tools and techniques for working with large amounts of data, many of these software components are freely available, open source and actively developed by contributors and supported by private parties (including large enterprises, start-ups and even not-for-profit organizations)
The discipline of data science has begun to emerge and there is a strong likelihood that this will be a major factor for the success of any organization that depends on information services.
- recent articles clarify the need for and state of data science and healthcare
- O’reilly - What is Data Science
- McKinsey - Big data: The next frontier for innovation, competition, and productivity
- NY Times - For Today’s Graduate, Just One Word: Statistics
- EMC - New Global Study: Only One-Third of Companies Making Effective Use of Data
- (PDF) EMC - Data Science Revealed: A Data-Driven Glimpse into the Burgeoning New Field
- Forbes - Can Big Data Fix Healthcare?
- initially, all open data gathered will be loaded into a PostgreSQL database
- must identify opportunities for exploring health data with some of the following tools/techniques:
- hadoop/mahout
- riak
- while this framework will be built on a Linux Operating System Distribution, all software should be platform independent and able to run on other Operating Systems (such as Macintosh or Windows)
- while all components included as part of this framework will be free and open source, there must exist a means to interoperate, integrate with closed source and/or proprietary systems and data
- for example: curehunter
- a portable guide
- software (and associated documentation for) components of “framework”
- platform
- Linux: Arch Install Guide (my distro of choice, distro du jour)
- emacs (org-mode with org-babel, for reproducible research)
- data management
- postgresql (and postgis, for working geo/spatial data)
- other rdbms like mssql, oracle even mysql
- hadoop
- riak
- couchdb/couchbase
- analytics
- R statistical programming language (with rstudio, web-based IDE)
- WEKA 3 - Data Mining Software in Java
- Mahout - Machine Learning for Hadoop
- RapidMiner
- PyBrain - Python Machine Learning Library
- Natural Language Toolkit (Python)
- application
- Node.js (server-side javascript for realtime web)
- presentation (web, rich, interactive GUI)
- jquery
- d3js
- polymaps
- platform
- books and documentation
- open healthcare data
- identify, (rate?,) gather and load (into PostgreSQL) all available open healthcare data
- identify, install (and document settings and configuration) all software components on a single server
- retrieve, and load U.S. Census data (to allow for geographic analysis of healthcare data)
- hint hint, tokenmathguy
- retrieve, and load U.S. Census data (to allow for geographic analysis of healthcare data)
- create comprehensive demonstration of how each software component can particpate in an end-to-end solution
- create data science training materials and reusable components
- sql and statistics training for data miners
- fork thinkstats and translate excercises from python to sql/R
- draft SQL best practices presentation and deliver (video?)
- javascript for developing interactive, rich graphical interfaces
- revise, re-architect tsv
- machine learning principles and procedures
- sql and statistics training for data miners