Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve alternative JSON support #171

Closed
vyasr opened this issue Apr 1, 2019 · 12 comments
Closed

Improve alternative JSON support #171

vyasr opened this issue Apr 1, 2019 · 12 comments
Labels
enhancement New feature or request

Comments

@vyasr
Copy link
Contributor

vyasr commented Apr 1, 2019

Feature description

I'm opening this issue to track our discussion about using JSON libraries other than the built-in JSON library. Currently, signac supports rapidjson internally, but we should investigate carefully to determine which JSON libraries are the best alternatives.

Proposed solution

As a first step, we would want to add benchmarks for the existing support so that we can assess further changes.

Additional context

Any alternative solutions you might have considered or other information that might be helpful to understand this feature request.

@bdice
Copy link
Member

bdice commented May 28, 2019

This article is from 2015 but discusses some alternative libraries.
I ran the benchmarks from this on my Mac: https://github.com/akrylysov/python-json-benchmark

I also investigated newer alternatives (simdjson, hyperjson) but found none that were stable, simple to install, and matched the json API.

The results below suggest to me that ujson should be our library of choice: it's closer to json API than rapidjson, performs comparably for large objects, and outperforms rapidjson for small objects.

(note: In my understanding, simplejson is basically the same as the standard library json, but it's on PyPI and is kept more up to date)

Results

loads (large obj)

json 0.354s
simplejson 0.364s
ujson 0.421s
rapidjson 0.435s

dumps (large obj)

json 0.335s
simplejson 0.346s
ujson 0.213s
rapidjson 0.215s

loads (small objs)

json 0.449s
simplejson 0.571s
ujson 0.165s
rapidjson 0.341s

dumps (small objs)

json 0.624s
simplejson 1.073s
ujson 0.174s
rapidjson 0.277s

@csadorf
Copy link
Contributor

csadorf commented Jun 7, 2019

Looks like ujson is the right choice. I think moving forward we should probably introduce a configuration that enables the user to specify which library to use instead of just relying on "what can we import".

@dgasmith
Copy link

You should also check out orjson depending on use case.

@bdice
Copy link
Member

bdice commented Jun 28, 2019

@dgasmith Good call! Here is an updated plot. I recommend that we choose the optimal case for the "small" data, which has a size range that is similar to most signac statepoints and job documents. It's probably a choice between orjson and ujson.

image

@csadorf csadorf added the enhancement New feature or request label Jul 5, 2019
@b-butler b-butler modified the milestones: v1.1.1, v1.2.0 Jul 5, 2019
@csadorf csadorf modified the milestones: v1.2.0, v1.3.0 Jul 5, 2019
@b-butler b-butler modified the milestones: v1.3.0, v1.4.0 Jul 5, 2019
@csadorf
Copy link
Contributor

csadorf commented Jul 26, 2019

Another project to watch with respect to this issue: https://github.com/lemire/simdjson

@bdice
Copy link
Member

bdice commented Jul 26, 2019

My benchmark code doesn't work for simdjson. I'd prefer to stick to packages that explicitly implement all of the features in json that we use. The error is described in TkTech/pysimdjson#20

Other possible compatibility issues to watch out for / test carefully:

Numerical precision

>>> import math, json, ujson
>>> json.dumps(math.pi)
'3.141592653589793'
>>> ujson.dumps(math.pi)
'3.1415926536'

(This is configurable.)

Non-ASCII Characters

>>> import json, orjson
>>> json.dumps('好')
'"\\u597d"'
>>> orjson.dumps('好')
b'"\xe5\xa5\xbd"'

Datetime objects

>>> import datetime, json, orjson
>>> json.dumps(datetime.datetime.now())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/bdice/miniconda3/envs/dice/lib/python3.7/json/__init__.py", line 231, in dumps
    return _default_encoder.encode(obj)
  File "/Users/bdice/miniconda3/envs/dice/lib/python3.7/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/Users/bdice/miniconda3/envs/dice/lib/python3.7/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
  File "/Users/bdice/miniconda3/envs/dice/lib/python3.7/json/encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type datetime is not JSON serializable
>>> orjson.dumps(datetime.datetime.now())
b'"2019-07-26T10:29:09.710778"'

The orjson README also carefully documents other issues around 53-bit vs. 64-bit integers. It notes that for floating-point numbers, "ujson is inaccurate in both serialization and deserialization."

@csadorf
Copy link
Contributor

csadorf commented Jul 26, 2019

Oh, that is good to know! The built-in JSON package will serialize floating point numbers to whatever precision necessary for an accurate representation.

We should probably add unit tests for all of these types before enabling or switching to any other back-end.

@dgasmith
Copy link

Depending on the structure of your data you can also consider msgpack. Very useful for array structures.

@vyasr
Copy link
Contributor Author

vyasr commented Feb 6, 2020

Here are my thoughts on the steps to proceed with this issue, and the relevant concepts

  • We need to document exactly what JSON functionality what we require (essentially json.load, json.loads, and json.dumps right now, but I could imagine this changing).
  • We should write tests of the signac.core.json module that verify this behavior. Some of these tests will have to check the precision to which certain objects are encoded/decoded.
  • We should make the current JSON library used by signac a configurable option. One relatively simple way to enable this would be to just store the currently loaded json module in a module-level variable that could be set by a function json.set_json(MODULE). Another alternative is the use of a config variable in the signac config. I think the latter is probably cleaner in most cases, but then we would have to consider how the specified module would be loaded; in particular would it have to be on the PYTHONPATH? The reason I raise this question is that a use-case I could imagine is writing a small wrapper that uses e.g. simdjson for fast JSON reading but falls back to the native json library for writing.
  • I would suggest exposing the tests to the user in a manner that makes it easy to test different json modules.
  • Our existing rapidjson support should be moved to a separate module that users could configure their installation to use (however we choose from (3)) if they wanted to. However, we would remove the loading by default.
  • We should use the benchmarks that Bradley wrote as a guide for what libraries to provide our own support for.
  • It may be worthwhile to add a section to the documentation that addresses different libraries that we have looked at and the pros/cons of each of them (for users wishing to make this choice for their own projects).

@bdice
Copy link
Member

bdice commented Feb 7, 2020

@vyasr Thanks for documenting this so clearly! Another solution would be to remove "alternative" JSON support and close this issue until/unless users request it. Resolving this issue requires extreme attention to detail.

I believe that compatibility across platforms and locales is of critical importance, and the performance benefits are probably not important enough to most signac users. From some of the documented examples above, I believe that many "alternative" JSON libraries have quirks that prevent me from fully trusting their compatibility in serializing and deserializing floating point values, unicode characters, and special objects like datetimes. Resolving this in the way that @vyasr suggests is a good path forward if we decide it is worth doing, but at this time I do not think supporting alternative JSON backends is worth the significant effort that would be demanded in the form of testing, handling potential incompatibilities, and long-term maintenance.

I vote to deprecate rapidjson support and eventually remove our current implementation. If at some point we want to re-visit this, we can add a configurable option for JSON backends as @vyasr suggests, subject to appropriate scrutiny and testing.

@csadorf
Copy link
Contributor

csadorf commented Feb 7, 2020

I also vote to close this issue for now and deprecate support for rapidjson. The only thing I could see as a reasonable project is to slightly refactor the code before the release of signac 2.0, so that it would be easier to implement that in some future version should there be a demand for it.

@bdice
Copy link
Member

bdice commented Feb 9, 2020

I think there is some consensus on deprecating rapidjson -- I opened #285 to help make concrete steps towards deprecation. I will close this issue for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants