Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deserialization perf improvement #81

Closed
wants to merge 1 commit into from

Conversation

zigzago
Copy link
Contributor

@zigzago zigzago commented Jun 13, 2018

I've done a deserialization perf benchmark with JMH and I've found that most of the time is consumed in LittleEndianInputStream#readUTF. With this PR, I notice a ~15% improvement.

  • Instead of creating a new decoder each time, cache & reuse in StaticBuffers (the main improvement)
  • There is always an underflow at the end of the decode. If len == 0, do not compact as it is not useful (and avoid a system array copy and a byte array allocation)
  • LittleEndianInputStream is only used in BsonParser, that is not thread safe. So it seems reasonable to cache StaticBuffer in the stream and avoid a ThreadLocal call for each string deserialization (except the first one).

HTH

@michel-kraemer
Copy link
Owner

Thanks for the PR. It looks very promising. I would like to make some experiments myself before merging it. I'll get back to you as soon as possible.

@michel-kraemer
Copy link
Owner

By the way, a 15% improvement is a lot. Do you have a test script or a test dataset, so I can probably reproduce the benchmark?

@zigzago
Copy link
Contributor Author

zigzago commented Jun 15, 2018

My benchmark targets KMongo : CodecRegistryBenchmark & KMongoBenchmark, and is copied from Jongo benchmark

You can see in this first result that for "jacksonFriend" the score is 123775 (current bson4jackson version). With the patch, I get ~142000

@michel-kraemer
Copy link
Owner

Thanks for the test data. I made my own benchmarks (with your test data) and was able to achieve an improvement of at most 2%-3%. Sometimes I didn't see any improvement at all or it was within the error margin. This is probably due to the way we tested. I didn't use a benchmark framework - I just used a for loop and measured the ms elapsed.

Anyway, I'm still thinking about the implications of the change. Since you're caching StaticBuffers in LittleEndianInputStream, it may behave differently in a multi-threaded environment. In practise, this should not make a difference, since it's an internal class, but you never know who uses it where and for what purpose. So, I'm wondering if this is a breaking change and if it's worth the improvement (if there is any at all).

Here's my benchmark code so you can reproduce my tests:

@Test
public void parseFriend() throws Exception {
  long start = System.currentTimeMillis();
  for (int i = 0; i < 100000; i++) {
    ByteArrayInputStream bais = new ByteArrayInputStream(FRIEND_AS_BYTE);
    BsonFactory fac = new BsonFactory();
    ObjectMapper mapper = new ObjectMapper(fac);
    fac.setCodec(mapper);
    Map<?, ?> m = mapper.readValue(bais, Map.class);
  }
  System.out.println((System.currentTimeMillis() - start));

  start = System.currentTimeMillis();
  for (int i = 0; i < 100000; i++) {
    ByteArrayInputStream bais = new ByteArrayInputStream(FRIEND_AS_BYTE);
    BsonFactory fac = new BsonFactory();
    ObjectMapper mapper = new ObjectMapper(fac);
    fac.setCodec(mapper);
    Map<?, ?> m = mapper.readValue(bais, Map.class);
  }
  System.out.println((System.currentTimeMillis() - start));

  start = System.currentTimeMillis();
  for (int i = 0; i < 100000; i++) {
    ByteArrayInputStream bais = new ByteArrayInputStream(FRIEND_AS_BYTE);
    BsonFactory fac = new BsonFactory();
    ObjectMapper mapper = new ObjectMapper(fac);
    fac.setCodec(mapper);
    Map<?, ?> m = mapper.readValue(bais, Map.class);
  }
  System.out.println((System.currentTimeMillis() - start));

  start = System.currentTimeMillis();
  for (int i = 0; i < 100000; i++) {
    ByteArrayInputStream bais = new ByteArrayInputStream(FRIEND_AS_BYTE);
    BsonFactory fac = new BsonFactory();
    ObjectMapper mapper = new ObjectMapper(fac);
    fac.setCodec(mapper);
    Map<?, ?> m = mapper.readValue(bais, Map.class);
  }
  System.out.println((System.currentTimeMillis() - start));
}

Maybe I did something wrong and my tests are useless? :) Let me know, what you think.

@zigzago
Copy link
Contributor Author

zigzago commented Jun 21, 2018

I think there is an issue with your test. In a real app, you have usually only one ObjectMapper for the app as it is thread safe and costly to instantiate. Here, you measure more the ObjectMapper instantiation cost than the bson deserialization perf.

For my point of view, a more accurate test would look like this:

   @Test
    public void parseFriend() throws Exception {
        BsonFactory fac = new BsonFactory();
        ObjectMapper mapper = new ObjectMapper(fac);
        fac.setCodec(mapper);

        long start = System.currentTimeMillis();
        for (int i = 0; i < 1000000; i++) {
            ByteArrayInputStream bais = new ByteArrayInputStream(FRIEND_AS_BYTE);

            Map<?, ?> m = mapper.readValue(bais, Map.class);
        }
        System.out.println((System.currentTimeMillis() - start));

        start = System.currentTimeMillis();
        for (int i = 0; i < 1000000; i++) {
            ByteArrayInputStream bais = new ByteArrayInputStream(FRIEND_AS_BYTE);
            Map<?, ?> m = mapper.readValue(bais, Map.class);
        }
        System.out.println((System.currentTimeMillis() - start));

        start = System.currentTimeMillis();
        for (int i = 0; i < 1000000; i++) {
            ByteArrayInputStream bais = new ByteArrayInputStream(FRIEND_AS_BYTE);
            Map<?, ?> m = mapper.readValue(bais, Map.class);
        }
        System.out.println((System.currentTimeMillis() - start));

        start = System.currentTimeMillis();
        for (int i = 0; i < 1000000; i++) {
            ByteArrayInputStream bais = new ByteArrayInputStream(FRIEND_AS_BYTE);
            Map<?, ?> m = mapper.readValue(bais, Map.class);
        }
        System.out.println((System.currentTimeMillis() - start));
    }

And with this fixed test I get better results :)

Without the patch:
6745
5836
5792
5688

With the patch:
5506
4804
4847
4871

Let me know if you get the same results

@zigzago
Copy link
Contributor Author

zigzago commented Jul 3, 2018

@michel-kraemer do you reproduce the perf improvement with the updated test? I plan to publish a benchmark soon and it would be great to have this improvement included (if you think it makes sense of course).

michel-kraemer added a commit that referenced this pull request Jul 28, 2018
@michel-kraemer
Copy link
Owner

Yes, I was able to reproduce it, but as I said I'm still worried about the possible implications of caching StaticBuffers in LittleEndianInputStream.

I did some experiments and noticed that caching the CharsetDecoder matters the most. The other changes improve performance almost unnoticeable (if at all). So, I just pushed 32d2ab3. Please have a look at it.

@stale
Copy link

stale bot commented Sep 26, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Sep 26, 2018
@stale
Copy link

stale bot commented Oct 10, 2018

This issue has been automatically closed due to inactivity. Please feel free to reopen it if necessary. We apologize for any inconvenience caused.

@stale stale bot closed this Oct 10, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants