BibHelioTech is a program for recognition of temporal expressions and entities (satellites, instruments, regions) extracted from scientific articles in the field of heliophysics.
It was developed for the IRAP (INSTITUT DE RECHERCHE EN ASTROPHYSIQUE ET PLANÉTOLOGIE (CNRS)) of Toulouse.
Its main purpose is to retrieve this information which is not currently available digitally, and to allow its visualisation on AMDA (http://amda.irap.omp.eu/).
STEP 1: install all dependency
On your shell, run: pip install -r requirements.txt
Don't forget to install SUTime Java dependencies, more details on: https://pypi.org/project/sutime/
Put the "english.sutime.txt" under sutime install directory, jars/stanford-corenlp-4.0.0-models.jar/edu/stanford/nlp/models/sutime/
STEP 2: tesseract 5 installation (Ubuntu exemple)
sudo apt update
sudo add-apt-repository ppa:alex-p/tesseract-ocr-devel
sudo apt install -y tesseract-ocr
sudo apt update
tesseract --version
STEP 3: GROBID installation
install GROBID under ../
Follow install instruction on: https://grobid.readthedocs.io/en/latest/Install-Grobid/
Make sure you have JVM 8 used by default !
STEP 4: GROBID python client installation
install GROBID python client under ../
Follow install instruction on: https://github.com/kermitt2/grobid_client_python
Put Heliophysics articles in pdf format under BibHelio_Tech/DATA/Papers.
You just have to run "MAIN.py".
optionally if you want to have AMDA catalogues by satellites,
you need to run "SATS_catalogue_generator.py".
If you use or contribute to BibHelio_Tech, you agree to use it or share your contribution following this license.
[Axel Dablanc]: axel.alain.dablanc@gmail.com
[Vincent Génot]: vincent.genot@irap.omp.eu