Notice: This ETL is no longer used - active development has moved to https://github.com/klahnakoski/Bugzilla-ETL.
A set of Pentaho DI jobs to extract bug versions from a bugzilla database and store them in an elasticsearch index. This ETL drives dashboards for BMO, for various teams at Mozilla Corporation.
- an elasticsearch cluster where you can CRUD the index
bugs
- a working PDI (a.k.a kettle) installation (free community edition should work fine). Tested with PDI CE 4.3
-
Clone this project into a local directory
-
Configure the elasticsearch indexes (put a cluster node in place of
localhost
):-
Optionally: clean out previous indexes:
curl -XDELETE 'http://localhost:9200/bugs'
-
Initialize the elasticsearch mappings:
curl -XPOST 'http://localhost:9200/bugs' --data @configuration/es/bug_version.json
-
-
Configure Pentaho DI:
- add a directory
.kettle
in your$KETTLE_HOME
- there, create a file
kettle.properties
- in that file, add settings for
bugs_db_host
,bugs_db_port
,bugs_db_user
,bugs_db_pass
andbugs_db_name
for your bugzilla-database connection. - add settings for
ES_NODES
,ES_CLUSTER
,ES_INDEX
- add a directory
-
If necessary, modify
bin/import_bugs.sh
, then run it to import the full data set. -
Later on, use
bin/update_bugs_incr.sh
to read incremental modifications from the MySQL database
- Some cases where a user's bugzilla ID changes mid-history for a bug can't be handled automatically, and should be added to
configuration/kettle/bugzilla_aliases.txt
. There are several alias-related scripts and transformations that help to detect these types of changes. Seebin/find_aliases.sh
,bin/find_all_aliases.sh
,transformations/find_aliases.ktr
, andtransformations/detect_new_aliases.ktr
. - Mozilla Bug 804946 causes some trouble with the ETL. See Bug 804961 for details.