Spreadsheet binary object extraction #303

ruebot · 2019-01-31T19:50:33Z

Using the image extraction process as a basis, our next set of binary object extractions will be documents. This issue is meant to focus specially on xls xlsx, ods, and csv.

There may be a some tweaks to this depending on the outcome of #298.

The text was updated successfully, but these errors were encountered:

jrwiebe · 2019-02-13T22:22:49Z

Putting this here for my reference; feedback is welcome.

These are the spreadsheet MIME types Tika will identify. The Mozilla MIME type list was also consulted. Unless there are objections I think I'll extract all of these, including templates and MS Works spreadsheets.

Excel

application/vnd.ms-excel
application/vnd.ms-excel.workspace.3
application/vnd.ms-excel.workspace.4
application/vnd.ms-excel.sheet.2
application/vnd.ms-excel.sheet.3
application/vnd.ms-excel.sheet.4
application/vnd.ms-excel.addin.macroenabled.12
application/vnd.ms-excel.sheet.binary.macroenabled.12
application/vnd.ms-excel.sheet.macroenabled.12
application/vnd.ms-excel.template.macroenabled.12
application/vnd.ms-spreadsheetml

Open Office

application/vnd.openxmlformats-officedocument.spreadsheetml.template
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
application/x-vnd.oasis.opendocument.spreadsheet-template
application/vnd.oasis.opendocument.spreadsheet-template
application/vnd.oasis.opendocument.spreadsheet
application/x-vnd.oasis.opendocument.spreadsheet

Other

application/x-tika-msworks-spreadsheet

CSV

Currently Tika only detects CSV if the parser is given a filename with the extension "CSV", although byte-based detection might be coming (TIKA-2826). I'll detect CSV by looking at the URL extension and checking if getMimeType() == "text/csv".

jrwiebe · 2019-08-02T03:48:59Z

MIME type references for #303, #304, #305, #306, #307:

That should do it.

- Add WordProcessor DF and binary extraction - Add Spreadsheets DF and binary extraction - Add Presentation Program DF and binary extraction - Add tests for new DF and binary extractions - Add test fixture for new DF and binary extractions - Resolves #303 - Resolves #304 - Resolves #305 - Back out 39831c2 (We _might_ not have to do this)

@jrwiebe

- Add Word Processor DF and binary extraction - Add Spreadsheets DF and binary extraction - Add Presentation Program DF and binary extraction - Add Text files DF and binary extraction - Add tests for new DF and binary extractions - Add test fixtures for new DF and binary extractions - Resolves #303 - Resolves #304 - Resolves #305 - Use aut-resources repo to distribute our shaded tika-parsers 1.22 - Close TikaInputStream - Add RDD filters on MimeTypeTika values - Add CodeCov configuration yaml - Includes work by @jrwiebe, see #346 for all commits before squash

- Address #190 - Address #259 - Address #302 - Address #303 - Address #304 - Address #305 - Address #306 - Address #307

* Add binary extration DataFrames to PySpark. - Address #190 - Address #259 - Address #302 - Address #303 - Address #304 - Address #305 - Address #306 - Address #307 - Resolves #350 - Update README

ruebot added enhancement Scala feature DataFrames labels Jan 31, 2019

ruebot self-assigned this Aug 14, 2019

ruebot mentioned this issue Aug 15, 2019

Add office document binary extraction. #346

Merged

ianmilligan1 closed this as completed in #346 Aug 16, 2019

ruebot added a commit that referenced this issue Aug 20, 2019

Add binary extration DataFrames to PySpark.

1176fd5

- Address #190 - Address #259 - Address #302 - Address #303 - Address #304 - Address #305 - Address #306 - Address #307

ruebot mentioned this issue Aug 20, 2019

Add binary extraction DataFrames to PySpark. #350

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spreadsheet binary object extraction #303

Spreadsheet binary object extraction #303

ruebot commented Jan 31, 2019

jrwiebe commented Feb 13, 2019

jrwiebe commented Aug 2, 2019

Spreadsheet binary object extraction #303

Spreadsheet binary object extraction #303

Comments

ruebot commented Jan 31, 2019

jrwiebe commented Feb 13, 2019

Excel

Open Office

Other

CSV

jrwiebe commented Aug 2, 2019

MIME type references for #303, #304, #305, #306, #307: