Skip to content

timurgen/content-extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Build Status

content-extractor

Simple Apache Tika and Tesseract OCR powered file content extractor service. Service runs on port 5000 by default and has two endpoints:

  • post_file_list - consumes POST request with one or more files
  • post_json_list - consumes POST request with JSON payload which contains URL's to files needed to be downloaded and parsed

To run it on local machine Python 3+, Java 8+, Tesseract and Norwegian language support for Tesseract needed.

configuration options:

All configuration options passes via environmental variables.

  • TIKA_VERSION = 1.20 (required)
  • UPLOAD_URL - where to upload parsed data (only for post_json_list endpoint). All files will preserve their names with ".txt" added
  • FILE_URL - which data entity attribute contains url to file
  • FILE_NAME - which data entity attribute contains name of file
  • LOG_LEVEL - logging level ("INFO" by default)
  • FAIL_ON_ERROR - If need fail fast on parsing or other processing errors or not. True by default
  • TESSERACT_OCR_LANG - Language(s) to be used by tesseract OCR (need to be installed)
  • PDF_EXTRACT_INLINE_IMG instruct Tika to retrieve text from images in PDF file true/fals, true by default
  • PRESERVE_FILE_TYPE if resulted file name will carry information about original file type or not, False by default

About

No description or website provided.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published