recipe is an instruction about how to construct a new dataset from existing one.
by reading and running the recipe executor, one can generate a new dataset based
on the ingredients
and procedures
described in the recipe.
A recipe is made of following parts:
- basic info
- configuration
- includes
- cooking procedures
A recipe file can be in either json or yaml format. Check recipe_unpop.yaml for a example.
all basic info are stored in info
section of the recipe. an id
field is
required inside this section. Any other information about the new dataset can be
store inside this section, such as name
, provider
, description
and so on.
Note on yaml format
In this repo we use base
to indicate where the ingredients comes from. We set
an yaml anchor to each of them (&d1
etc) so that we can reference them later
in the recipe (*d1
etc).
inside configuration section. we define the configuration of dirs. currently we can set below path:
ddf_dir
: the directory that contains all ddf csv repos. Must set this variable in the main recipe to run with chef.recipes_dir
: the directory contains all recipes to include. Must set this variable if we haveinclude
section.dictionary_dir
: the directory contains all translation files. Must set this variable if we have json file in the options of procedures. (translation will be discussed later)
one recipe can include other recipes inside itself. to include a recipe, simply
append the filename to the include
section. note that it should be a absolute
path or a filename inside the recipes_dir
.
On the chef module, the process to generate a dataset have following steps:
- read the main recipe
- if there is include section, read each file in the include list and expand the main recipe
- if there is file name in dictionary option of each procedure, try to expand them if the option value is a filename
- run the procedures one by one.
cooking
section is a dictionary contains one or more list of procedures to
build a dataset. valid keys for cooking section are datapoints, entities,
concepts.
supported procedures currently:
- translate_header
- translate the headers of the datapoints
- translate_column
- translate the values in a column
- identity
- identity function = nothing changes
- merge
- merge ingredients together on the keys
- align
- align two columns in two ingredients
- discussion: semio/ddf_utils#3
- groupby
- group ingredient data by keys
- discussion: semio/ddf_utils#4
- filter_row
- filter ingredient data by values
- discussion: semio/ddf_utils#2
- filter_item
- filter ingredient data by concepts
- discussion: semio/ddf_utils#14
- run_op
- run math operations on ingredient
- discussion: semio/ddf_utils#7
- accumulate
- run cumulative functions over an ingredient
- copy
- make copy of indicators of ingredient
- if you need to use
translate_header
/translate_column
/align
/copy
in your recipe, place them at the beginning of recipe. This can improve the performance of running the recipe.