Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serializers convert floats to strings #4681

Open
david-cortes opened this issue Oct 14, 2021 · 8 comments
Open

Serializers convert floats to strings #4681

david-cortes opened this issue Oct 14, 2021 · 8 comments

Comments

@david-cortes
Copy link
Contributor

david-cortes commented Oct 14, 2021

In the R and C interfaces of LightGBM, the serialization functions convert floating point numbers to text representations in decimal format in order to save and load models (e.g. LGBM_BoosterSaveModel and LGBM_BoosterSaveModelToString). This loses precision in floating point numbers and can lead to small differences in predictions between a model used right after fitting it and a model that was loaded from a saved file or string.

As an additional problem, serializing through this system also means that it's not possible to know the size that the serialized bytes will have beforehand (that is, the model needs to be serialized in order to know how long will the buffer than holds need to be), which makes serialization in the C interface less efficient.

@jameslamb
Copy link
Collaborator

Thanks for writing this up!

Can you please provide some evidence for the problem you're talking about?

  • how did you notice this? (e.g. is there some code you can provide that reproduces the problem?)
  • are you able to point to specific lines in the functions you referenced, like LGBM_BoosterSaveModel, where you think this precision loss is happening?

Without that type of information, maintainers here will have to put some effort into figuring out things that I suspect you already know. Sharing such information could help us get to a fix more quickly.

@david-cortes
Copy link
Contributor Author

You can confirm it in lines like these:

str_buf << "threshold="

Alternatively, you can save the model to a file and inspect it in a text editor to see that values are saved in a decimal representation.

@jameslamb
Copy link
Collaborator

Thanks! We'd need to test this, but do you think this issue could be the cause of #4680?

@david-cortes
Copy link
Contributor Author

Don't know. Could be, if lgb.Dataset does the same thing, but haven't looked at its code. Although in that case, if it is only failing for windows, you could perhaps try playing with mingw's __USE_MINGW_ANSI_STDIO.

@jameslamb
Copy link
Collaborator

got it, thanks for the tip!

@StrikerRUS
Copy link
Collaborator

StrikerRUS commented Jan 8, 2022

Do you think we need a binary serialization format in addition to current text one?

Related: #4217.

There was unsuccessful attempt to adopt protobuf format many years ago: #372, #908 (reverted from master later).

Also related, our friends at XGBoost are trying to adopt Universal Binary JSON format as a binary serialization format: dmlc/xgboost#7545.

@david-cortes
Copy link
Contributor Author

I think it'd be a good addition, since apart from ruling out any potential case of mismatches between fresh and deserialized models, it would imply decreased memory usage and faster serialization/deserialization (especially important for the R interface since it now keeps a serialized copy by default).

However, I think it'd be enough with simply using a custom format by memcpying arrays and struct fields to a char* pointer. There's also the "cereal" library which auto-generates code for doing that using ostreams (like std::stringstream), but such an approach would have the downside of losing compatibility with earlier and future library versions with a different model structure.

@trivialfis
Copy link

but such an approach would have the downside of losing compatibility with earlier and future library versions with a different model structure.

It might be desirable to have a schema otherwise it can be difficult to debug compatibility issues in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants