- texter will
- check
is_texted
flag on DB to know whether the stored PDF document is already texted and inserted to ES or not. - download PDF and save it in source folder
- convert from PDF to text file and save it in output folder
- create json file and save it in json folder
- insert json data into ES index
- clear PDF, text, and json files
- change
is_texted
flag on DB
- check
- configure DB, ES, and AP server settings by editing
configure.ini.templete
file - change the name of configure file to
configure.ini
- run
python texter.py
- If there are huge number of PDFs in DB, it may take long time to finish it.
- If some PDF contains other than text (e.g. graph, shape, table...), text convert may fail.
- Yusuke Hirata