Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arff deserialiser #4627

Open
wants to merge 21 commits into
base: develop
Choose a base branch
from
Open

arff deserialiser #4627

wants to merge 21 commits into from

Conversation

gf712
Copy link
Member

@gf712 gf712 commented May 2, 2019

Simple ARFF deserialiser. See the spec here.

@karlnapf shall I add a ARFF file as an example and a unittest/meta example?

TODO:

  • datetime format
  • string
  • handle missing data

@gf712 gf712 force-pushed the arff_format branch 2 times, most recently from 714905e to 192281a Compare May 2, 2019 14:59
}
for (auto iter = line.begin(); iter != line.end(); ++iter)
{
if (iter->front() == '\'' || iter->front() == '\"')
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vigsterkr does stuff like this work independent of platform?

std::vector<std::string> elems;
auto innner_string =
m_current_line.substr(strlen(m_attribute_string));
split(innner_string, " ,\t\r\f\v", std::back_inserter(elems));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vigsterkr and this?

@gf712 gf712 force-pushed the arff_format branch 3 times, most recently from 1667862 to a0a778e Compare May 2, 2019 20:47
@gf712
Copy link
Member Author

gf712 commented May 3, 2019

@karlnapf how does shogun handle missing data values? It is possible to have NaN values ("?" in arff files) ?

@karlnapf
Copy link
Member

karlnapf commented May 5, 2019

@karlnapf shall I add a ARFF file as an example and a unittest/meta example?

Yes you can add it under the toy data in data/toy for examples.
For unit test, I would rather not base it on a file ... although there seems to be no other sensible way or?

@karlnapf
Copy link
Member

karlnapf commented May 5, 2019

@karlnapf how does shogun handle missing data values? It is possible to have NaN values ("?" in arff files) ?

idk to be honest :D we might have to think/work a bit for that

@karlnapf
Copy link
Member

karlnapf commented May 5, 2019

Nice!

@gf712
Copy link
Member Author

gf712 commented May 8, 2019

For unit test, I would rather not base it on a file ... although there seems to be no other sensible way or?

For unittest I could try and use strings? I can then feed the stream directly, rather than a file. Would just have to refactor the class a bit though by adding a constructor for streams rather than file names

@karlnapf
Copy link
Member

karlnapf commented May 8, 2019

I think that’d be a nice feature anyways or?

@gf712
Copy link
Member Author

gf712 commented May 8, 2019

Yes, I need it for openml stuff!

@gf712 gf712 force-pushed the arff_format branch 3 times, most recently from 5ccd879 to 6b855f6 Compare May 9, 2019 20:04
@gf712
Copy link
Member Author

gf712 commented May 9, 2019

So datetime somewhat works, but might have to add strptime definition for the windows build? (see https://stackoverflow.com/a/321877). The conversion to the c++ convention of datetime format needs to be tested more too. I convert dates to unix timestamps btw, because it is the only representation that makes sense to me that can be stored in a numeric matrix

@gf712
Copy link
Member Author

gf712 commented May 9, 2019

@karlnapf I am not sure how to do the parsing of strings though, for that we would almost need a SGMatrix<variant>

@karlnapf
Copy link
Member

So datetime somewhat works, but might have to add strptime definition for the windows build? (see https://stackoverflow.com/a/321877). The conversion to the c++ convention of datetime format needs to be tested more too. I convert dates to unix timestamps btw, because it is the only representation that makes sense to me that can be stored in a numeric matrix

Great. Ill check it out

@karlnapf
Copy link
Member

karlnapf commented May 10, 2019

@karlnapf I am not sure how to do the parsing of strings though, for that we would almost need a SGMatrix<variant>

Uh that will be tricky with parameter framework. Also SG* types are meant as mere (continuous!) memory wrappers around basic types, not the fancy stuff they are abused to do.

So one thing might be to think about a matrix-like structure to handle non-primitives. I mean std::vector is fine with the param framework

We have started using variant in the pipeline...although that causes all sort of headaches (needs an override clone/equals, but no idea how to make put/get work atm), but that might be fixed in any.h. Maybe a good point to start thinking about how to handle variant .... @vigsterkr ?

@gf712 gf712 force-pushed the arff_format branch 3 times, most recently from 583a724 to 101ef8d Compare May 13, 2019 16:17
@gf712
Copy link
Member Author

gf712 commented May 13, 2019

@karlnapf @vigsterkr so the return type is now CCombinedFeatures, and I use std::vector<variant> internally.
Current features:

  1. handle numeric data with CDenseFeatures
  2. handle dates (excluding timezone name parsing because it requires libtz from the date library to be compiled into a library)
  3. handle strings with StringList and then StringFeatures

The missing data can be ignored for now I think, and just be skipped by the parser, and then one day we can add imputers.
The relational attributes I still need to look into.
And then I'll add a meta example too and that's it I think. @karlnapf can you think of an interesting dataset that shogun could handle?
At some point I can add a ARFFWriter, but it isn't really needed right now, and can be done once the new serialisation stuff is ready.

@gf712
Copy link
Member Author

gf712 commented May 13, 2019

also the error seems to be due to AveragedPerceptron? I am guessing there's some UB going on

@gf712 gf712 force-pushed the arff_format branch 2 times, most recently from 4174168 to ada3794 Compare May 14, 2019 13:47
src/shogun/io/ARFFFile.h Outdated Show resolved Hide resolved
class ARFFSerializer
{
public:
ARFFSerializer(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karlnapf I had some time so I wrote this (pretty bad) serialiser.. Would it make sense to add a visitor pattern to SGObjects? Or at least for features?

@gf712 gf712 force-pushed the arff_format branch 2 times, most recently from 977b9f8 to 63015ec Compare May 23, 2019 07:11
switched to shared_ptr, std::variant, etc..
@stale
Copy link

stale bot commented Aug 2, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Aug 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants