This repository contains everything needed to process and store synoptic observations from a variety of sources. Out-of-the box it supports DWD, KNMI and IMGW Open Data services and can parse BUFR files from many other providers.
It is easily extendable via configuration (YAML) files and by adding your own scripts which make use of the existing framework.
- While in the directory of the repository, run install.sh like this:
chmod +x install.sh && ./install.sh
- OR if the permissions cannot be set/changed:
bash install.sh
- The install.sh script will install miniconda if not present, create an environment with all necessary packages and install the plbufr package from sferics' github.
- It then defines ".githook/" as the directory for git hooks. There are currently two git hooks: The pre-commit git hook automatically compiles alls .py files before each commit, so at least some syntax errors can be easily avoided. The post-commit, on the other hand, calls scripts/export_conda_environment.sh which exports the conda environment information to "environment.yml" and creates a "requirement.txt" file.
- Afterwards, it will compile all .py files in the directory in order to speed-up the first run of each script.
- Lastly, it executes 3 .sql files (in "sql/") which add some essential tables, columns and values to the main database. These changes should be implemented in amalthea/main for a better integration!
All python scripts offer a -h/--help option which shows their command line arguments with a brief explanation. However, in order to understand them better, you should read the following in-depth information carefully.
To be able to run these scripts, the configuration files general.yml, scripts.yml, sources.yml and clusters.yml are needed. So right before the first usage, you need to make sure to create them by copying the template files from "config/templates/" to config/ and adding your desired configurations of sources and clusters to the respective files. The general.yml and scripts.yml files also need to be adjusted with your desired file paths, system-specific settings etc.
All command line arguments are defined in config/parser_args.yml and they are the same across all scripts. The only difference lies in their availability.
For more details on adding/changing/removing command line arguments, please read the respective section about the YAML configuration files -> parser_args.yml.
IMPORTANT: Settings defined by command line arguments will always overwrite settings defined in the script's configuration!
- show help message which explains the usage of the script briefly
- print (more) verbose output
- run in debug mode with additional debug prints and stop points (using pdb module)
- use traceback module to print error messages that occur on module level
- supress all warning messages
- use a PID file to determine whether the script is already running and which processes number it has
- define logging level (choose one of the following: {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET} )
- define a custom config directory (structure has to be the same as within config/ directory)
- comma-seperated list of stations to consider
- comma-seperated list of clusters to consider
- maximal number of retries when writing to station databases
- maximal number of files to process (usually this setting applies per source)
- operation mode (can be "dev", "oper" or "test")
- stage of forging (can be "raw","forge","bad" or "final")
- timeout in seconds when trying to write to station databases
- define custom output path where the station databases will be saved
- use multiprocessing if -P > 1; defines number of processes to use
- define name of custom (BUFR) translation file (can be necessary for providers which use special encoding or error treatment)
This script decodes one or several BUFR files and inserts all relevant observations into the raw databases.
It can also process intire source/dataset directories which can be provided by the source name(s) as argument(s) or via the "sources.yml" configuration file.
- first and only positional argument
- can take several sources, seperated by spaces
You may use 5 different approaches to decode the BUFR files:
- pd: Using pdbufr package officially provided by ECMWF (very slow because it relies on pandas).
- pl: Using plbufr package forked from pdbufr by sferics (much faster because it uses polars instead).
- gt: Also using plbufr package, but instead of creating a polars DataFrame, it uses a generator (should be equally fast).
- us: Fastest decoding method using bufr keys from ECCODES, but lacking some observations like soil temperatures.
- ex: Slower than "us" method, but significantly faster than pdbufr/plbufr methods. Not guaranteed to work with all files and lacking some information from DWD Open Data files!
- process a single file, given by its file path
- process several files, given by their file paths, seperated by divider character (default: ";")
- define a custom divider/seperator character for -F
- process file(s) again, even if they have been processed already
- usually only used automatically by the script if the RAM is full, so it knows which files are still left to process
- sort files with sorting algorithm (sorted() by default)
- define sorting algorithm for the above option (has to be a python callable and will be evaluated by eval() method)
decode_bufr.py -a pl -f example_file.bufr -r
decode_bufr.py -a ex -F example_file1.bin,example_file2.bin,example_file3.bin -D "," -v
decode_bufr.py DWD -a gt -k 10381,10382,10384,10385
decode_bufr.py DWD KNMI RMI -a gt -n 100
decode_bufr.py -C obs_custom.yml -O /custom/output/directory
This is a chain script which runs the following scripts in the order of occurrence. Only in operational mode, derived_obs.py runs again after aggregate_obs.py and export_obs.py will only be executed if -e/--export is set.
- only print out commands and do not actually run the scripts
- this is meant for debugging purposes only
- export new observations into old/legacy metwatch csv format after finishing the chain (see export_obs.py for more information)
- define old/legacy metwatch csv output directory for export_obs.py
python forge_obs.py -e -L /legacy/output/path -l INFO
(only 1 row with max(file) per dataset [UNIQUE datetime,duration,element]) Copy all remaining elements from raw to forge databases [dataset,datetime,duration,element,value]
python reduce_obs.py -P 12
Compute derived elements like relative humidity, cloud levels or reduced pressure from (a combination of) other elements.
Compute derived elements again, but only considering 30min-values.
python derive_obs.py -k 10381
Aggregate over certain time periods / durations (like 30min,1h,3h,6h,12,24h) and create new elements with "{duration}" suffix (like "TMAX12h_2m_syn"). The information about what elements to aggregate over which durations and which elements need gap filling is contained in config/element_aggregation.yml.
python aggregate_obs.py -t
Check all obs in forge databases, delete bad data like NaN, unknown value or out-of-range
- move good data in final databases e.g. "/oper/final" (oper mode)
- move bad data to seperate databases, e.g. "/dev/bad" (dev mode)
python audit_obs.py -d
Clear forge station databases (they are temporary and get rebuilt every chain cycle).
- clear bad obs as well
python empty_obs.py -B -w
Export observations from final databases into the old/legacy metwatch csv format.
- define old/legacy metwatch csv output directory
python export_obs.py -L /legacy/output/directory
Get latest observations from the Polish Open Data service
python get_imgw.py -v -c poland
- conversion of BUFR code/flag tables into values we use
- definition of wmo BUFR sequences - only needed for "ex" approach of decode_bufr.py
- conversion of SYNOP codes into values we use
- conversion of METAR codes into values we use
- information about which element to aggregate OR fill in gaps
- consists of two sections:
duration:
- which element to aggregate over which durations
- fallback elements can be defined (like TMP instead of TMAX)
instant:
- which elements always have the same duration
- for these elements we try to fill in the gaps (using nearby values)
- information about the value range of elements (lower/upper boundaries)
- also: which values to include or exclude out of that range (extra/exclude)
- extra column is a list of values and these will always be excepted, even if they are out-of-range
- exclude is defined as a regular expression (x means no exluded values)
- used for audit_obs.py script only
- needs to be copied to "config/" in order to be recognized by the python scripts
- main configuration file template with the following sections:general:
- most general settings which will be overwritten by all following configs
- order of priorities: general -> class -> script -> command line arguments
database:
- default configuration for the main database (usually when DatabaseClass is called for main.db)
bufr:
- default configuration for the BufrClass, higher priority than "general:" but lower than script config
obs:
- default configuration for the ObsClass, higher priority than "general:" but lower than script config- needs to be copied to "config/" in order to be recognized by the python scripts
- just change the settings of all scripts to your desire in here
- sections/keys are always the FULL script name (with .py)
- special script configuration entries in detail:decode_bufr.py:
- TODO
forge_obs.py:
- TODO
reduce_obs.py:
- TODO
derive_obs.py:
- TODO
aggregate_obs.py:
- TODO
audit_obs.py:
- TODO
empty_obs.py:
- TODO
get_obs.py:
- TODO
get_imgw.py:
- TODO
get_knmi.py:
- TODO- needs to be copied to "config/" in order to be recognized by the python scripts
- define all source-specific settings in here- needs to be copied to "config/" in order to be recognized by the python scripts
- define blockNumber, stationIdentifier and station types (str) for different clusters
- BUFR key translations for the different approaches
- translation for the legacy metwatch element names
- translation for element names of Polish weather service Open Data
- use this naming scheme if you want to add your own custom source translation files
- definition of positional and flag (e.g. -v/--verbose) command line arguments
- definition of the table structure for the location/station databases
- the syntax is very SQL-like but simpler than a real .sql file
- different mode and stage combination need to be all present if you add custom modes/stages
Export your custom BUFR table paths to the local and conda environment variables.
Export conda environment information to "environment.yml". Only skip "path:" and "variables:" sections because they depend on the local system. Then create a "requirement.txt" which contains all needed packages to successfully run the python scripts.
Install the repository using conda and prepare everything to get started immediately. The script creates the "obs" environment, installs all needed packages and sets the right environment variables.
This scripts starts the decode_bufr.py script multiple times, so you can process a large number of files much faster.
NOTE: You have to calculate manually how many files to process for each instance of the script and define "max_files:" accordingly in the script config's "decode_by.py:" section.
- set BUFR decoding approach (default: gt)
- number of processes to use (start decode_bufr.py N times)
- sleep time in between script execution (wait N seconds before starting the next instance)
./multi_decode_bufr.sh 8 ex 2