Skip to content

Latest commit

 

History

History
101 lines (78 loc) · 3.45 KB

File metadata and controls

101 lines (78 loc) · 3.45 KB

Recipe Manual (draft)

What is recipe

recipe is an instruction about how to construct a new dataset from existing one. by reading and running the recipe executor, one can generate a new dataset based on the ingredients and procedures described in the recipe.

structure of for a recipe

A recipe is made of following parts:

  • basic info
  • configuration
  • includes
  • cooking procedures

A recipe file can be in either json or yaml format. Check recipe_unpop.yaml for a example.

basic info

all basic info are stored in info section of the recipe. an id field is required inside this section. Any other information about the new dataset can be store inside this section, such as name, provider, description and so on.

Note on yaml format

In this repo we use base to indicate where the ingredients comes from. We set an yaml anchor to each of them (&d1 etc) so that we can reference them later in the recipe (*d1 etc).

config

inside configuration section. we define the configuration of dirs. currently we can set below path:

  • ddf_dir: the directory that contains all ddf csv repos. Must set this variable in the main recipe to run with chef.
  • recipes_dir: the directory contains all recipes to include. Must set this variable if we have include section.
  • dictionary_dir: the directory contains all translation files. Must set this variable if we have json file in the options of procedures. (translation will be discussed later)

include

one recipe can include other recipes inside itself. to include a recipe, simply append the filename to the include section. note that it should be a absolute path or a filename inside the recipes_dir.

On the chef module, the process to generate a dataset have following steps:

  • read the main recipe
  • if there is include section, read each file in the include list and expand the main recipe
  • if there is file name in dictionary option of each procedure, try to expand them if the option value is a filename
  • run the procedures one by one.

cooking procedures

cooking section is a dictionary contains one or more list of procedures to build a dataset. valid keys for cooking section are datapoints, entities, concepts.

supported procedures currently:

  • translate_header
    • translate the headers of the datapoints
  • translate_column
    • translate the values in a column
  • identity
    • identity function = nothing changes
  • merge
    • merge ingredients together on the keys
  • align
  • groupby
  • filter_row
  • filter_item
  • run_op
  • accumulate
    • run cumulative functions over an ingredient
  • copy
    • make copy of indicators of ingredient

General guideline for writing recipes

  • if you need to use translate_header/translate_column/align/copy in your recipe, place them at the beginning of recipe. This can improve the performance of running the recipe.