content-extractor

Simple Apache Tika and Tesseract OCR powered file content extractor service. Service runs on port 5000 by default and has two endpoints:

post_file_list - consumes POST request with one or more files
post_json_list - consumes POST request with JSON payload which contains URL's to files needed to be downloaded and parsed

To run it on local machine Python 3+, Java 8+, Tesseract and Norwegian language support for Tesseract needed.

All configuration options passes via environmental variables.

TIKA_VERSION = 1.20 (required)
UPLOAD_URL - where to upload parsed data (only for post_json_list endpoint). All files will preserve their names with ".txt" added
FILE_URL - which data entity attribute contains url to file
FILE_NAME - which data entity attribute contains name of file
LOG_LEVEL - logging level ("INFO" by default)
FAIL_ON_ERROR - If need fail fast on parsing or other processing errors or not. True by default
TESSERACT_OCR_LANG - Language(s) to be used by tesseract OCR (need to be installed)
PDF_EXTRACT_INLINE_IMG instruct Tika to retrieve text from images in PDF file true/fals, true by default
PRESERVE_FILE_TYPE if resulted file name will carry information about original file type or not, False by default

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
service		service
utils		utils
.gitignore		.gitignore
.travis.yml		.travis.yml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
service.py		service.py

Provide feedback