Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify support other file formats, e.g. Xslx #466

Closed
hallvard opened this issue Feb 20, 2019 · 5 comments
Closed

Simplify support other file formats, e.g. Xslx #466

hallvard opened this issue Feb 20, 2019 · 5 comments

Comments

@hallvard
Copy link
Contributor

I'm looking into supporting Excel files (xslx) by using the Apache POI project. I would like to reuse some of the logic from the corresponding csv code, like CsvReadOptions and its Builder class, but that's pretty difficult. First, the CsvReadOptions has a private constructor (called by its Builder), so you cannot subclass it. Second, some options don't make sense for a format that's not line-oriented, e.g. separator and lineEnding. In addition, the customParser method in ColumnType takes an CsvReadOptions instance as argument, so it needs to be subclassed.

I propose making an AbstractReadOptions class that includes the things that other formats are most likely to need, and that is easy to subclass. And update the customParser method in ColumnType to use that class as argument instead.

@benmccann
Copy link
Collaborator

It sounds to me like you're looking at an old version of the code because we recently made some of these changes to support adding the JSON and fixed-width file readers. Can you sync your repo against the latest code from master?

@hallvard
Copy link
Contributor Author

hallvard commented Feb 21, 2019

Yes, you're right! It seems like the kind of changes I suggested are in the newest version. My plan is to create a new maven module with the XslxReader and XslxReadOptions in a tech.tablesaw.io.xslx package, so it needn't be part of the core. Then integrate it into the build and create a pull request. Sounds OK? Should I create a separate issue first?

@benmccann
Copy link
Collaborator

No need to create a separate issue

I'm not sure the best API for optional dependencies would be or how to best include them. We should probably follow the same pattern for the JSON, HTML, and CSV dependencies as well.

For including, we have the Table.smile() optional dependency that makes use of Maven's optional attribute. There's also the option to make a separate Maven dependency as you suggested.

For API, most stuff is done today via Table.read().csv(...), Table.read().json(...), Table.read().html(...), etc. If we have a separate Maven module, it'll have to be a bit different and we'll need to do something like new XslxReader().read(...). I'm curious to learn more about how Spark handles this because I think their API is similar to ours, but they have the ability to register new readers as plugins. I'm not quite sure how that works, but I'd be interested to learn more before we make a decision

@benmccann
Copy link
Collaborator

I think I'd probably lean towards including the POI dependency as an optional dependency like the Smile dependency and creating the reader in the main module alongside the json, csv, html readers

@lwhite1
Copy link
Collaborator

lwhite1 commented Feb 22, 2019 via email

@hallvard hallvard mentioned this issue Feb 23, 2019
1 task
lwhite1 pushed a commit that referenced this issue Mar 3, 2019
* Initial xlsx support, as discussed in issue #466

* Addresses some Codacy issues.

* Removed imports and improved assert message.

* Moved test file(s) to data folder, to see if that fixes travis build.

* Add missing value, when cell value isn't appended

* Improved test messages.

* Improved assert messages, added test for file with missing values.
Fixed problem with mapInto for missing value, shouldn't call function.

* Removed comments and replaced tabs with spaces.

* Replaced tabs with spaces (fixed Eclipse settings), improved code
according to comments.

* Removed final (and fixed my setup to not add them), and made other
changes according to comments.

* More improvements.

* Replaced use of int[] with private static class TableRange, removed use
of SHORT column type, replaced check prefix with assert for test helper
methods.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants