Skip to content

Clean financial with adjusted closes and dividends in fast data formats (`cPickle` and `HDF5`)

License

Notifications You must be signed in to change notification settings

benjaminmgross/clean-fin-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

#clean-fin-data

A module to help you store, append, and scrub financial data using HDF5 data structures to ensure speed and efficiency

##Installation

###Clone the Repository

$ git clone https://github.com/benjaminmgross/clean-fin-data
$ cd clean-fin-data
$ python setup.py install

###Test the install

$ ipython
Python 2.7.6 (default, Nov 12 2013, 10:54:02) 
Type "copyright", "credits" or "license" for more information.

IPython 2.0.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: import clean_fin_data

##Dependencies

  • pandas (obviously)
  • tables (as in PyTables)

To install tables on Mac OS X Mavericks, you need HDF5. I did the following to get HDF5 functionality up and running (if you have Windows, it PyTables comes pre-installed with Anaconda and Enthought, so you can skip this step):

  1. You can't just $ brew install hdf5 using homebrew, you actually need to go to Homebrew Science to get it. So execute the following commands:

     $ brew tap homebrew/science
     
     #Then you can brew install it
     
     $ brew install hdf5
    
  2. Now you've got the necessary components to install PyTables with pip:

     $ pip install tables
    

NOTE: Just because you have pandas installed doesn't mean you have tables, so if you try to execute any of this code and get:

.. ERROR:: Could not find a local HDF5 installation.

You probably need to execute the steps just covered.

##Filling the Gaps

###The Motivation

Yahoo! Finance Data has historically been quite good, but of recent you may (as I did) find rather large gaps in historical price information. When it comes to portfolio price calculation or metric estimation, those holes (which I've found for some to be 19 trading days) can wreak havoc. The upside is that Google Finance often has those prices... the downside is that Google Finance doesn't have Adjusted Close, and therefore the Adjusted Close needs to be estimated.

###Classifying Transform of Adjusted Close & Close

Standard methodology to calculate Close and Adjusted Close, based on some sort of stock split or dividend is well explained here, at this Yahoo! Tutorial. One method to determine whether a stock split or dividend payment has occurred is by using the ratio of the Close to Adjusted Close for each price date.

It should intuitively make sense that the ratio of price Close to Adjusted Close will take on one of the following four situations, and be graphically represented in some sort of similar looking graph:

####1. Dividends and no splits occurred: div ####2. Dividends and splits occurred: dns ####3. Neither Dividends Nor Splits nodno ####4. Splits only (no image found)

The next step is to perform a log transformation can be performed and then a classification algorithm run, to determine whether a log change is:

  1. White Noise
  2. A Dividend
  3. A Split (or Reverse Split)

lnchg

With that information, and with a known Close Date, we can get really close to the actual Adjusted Close, and, if dividends or splits have occurred, make the necessary adjustments to the... well, Adjusted Close.

##Tickers Used to train Logit Model Algorithm

A spreadsheet inside data/training_data/Trained Data.xlsx has the tickers and binary outcomes for each of the tickers to train the Logit model to detect "non-white noise" values such as dividends or splits.

##To Do

  • Create the ability to check if a ticker is already a member of a store
  • Write the scrub functionality
  • Write the append functionality
  • Consider writing to a log file that specifies all of the additions / changes / etc. that were accomplished in each function
  • Create some sort of testing mechanism to compare performance of read / write capacity of cPickle and HDF5.
  • The dates for Google, Yahoo!, and those generated by pandas.bdate_range are all different, and the current "store" has all nan's removed. So,
    1. Some kind of "master index" needs to be created
    2. That master index then needs to be used to determine the gaps in the data
    3. Then the gaps themselves need to be filled in some way

About

Clean financial with adjusted closes and dividends in fast data formats (`cPickle` and `HDF5`)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages