Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Moving zstd and zstd_no_dict compression codecs out of experimental #7805

Closed
6 tasks done
sarthakaggarwal97 opened this issue May 29, 2023 · 12 comments · Fixed by #7908
Closed
6 tasks done

Moving zstd and zstd_no_dict compression codecs out of experimental #7805

sarthakaggarwal97 opened this issue May 29, 2023 · 12 comments · Fixed by #7908
Labels
enhancement Enhancement or improvement to existing feature or request v2.9.0 'Issues and PRs related to version v2.9.0'

Comments

@sarthakaggarwal97
Copy link
Contributor

sarthakaggarwal97 commented May 29, 2023

Is your feature request related to a problem? Please describe.
Currently, we have the experimental support for zstd and zstd compression codec as mentioned in #3354. The request is to move the feature out of the sandbox to enable the users to create an index using the new codecs.

Describe the solution you'd like
The idea is to introduce the new compression codecs for the users by moving the current implementation out of the box. With that, we will leverage the current index.codec settings that can be used to specify zstd and zstd_no_dict upon index creation. We will continue to support the existing zlib and lz4 codecs, with the the default as lz4 or BEST_SPEED.

There are the outcomes of the benchmarks with these new codecs:

Screenshot 2023-05-29 at 16 14 45

cc: @mulugetam @backslasht @mgodwan

@reta
Copy link
Collaborator

reta commented May 30, 2023

@sarthakaggarwal97 thanks a lot for publishing the compression gains, could you please share the CPU / memory profiles as well? Thank you.

@Bukhtawar
Copy link
Collaborator

+1 Latency is just one dimension, we need to understand if there are trade-offs with more cycles on compress/decompress

@reta
Copy link
Collaborator

reta commented May 30, 2023

Some benchmarks were also performed here https://issues.apache.org/jira/browse/LUCENE-8739 although the pull request never made it into Apache Lucene.

@sarthakaggarwal97
Copy link
Contributor Author

sarthakaggarwal97 commented May 31, 2023

@reta @Bukhtawar here are the percentage CPU utilization during indexing. The profiles were taken at a 5 minute interval during active indexing with the nyc_taxis dataset.

Summary of Compression Overheads across Codecs

Codecs Compression Overhead
%
zlib 10.48
lz4 4.2
zstd 9.22
zstd_no_dict 2.55

The numbers denote the %age CPU utilized for the compression. Please let me know if I can help with more information regarding the experiments.

@sarthakaggarwal97
Copy link
Contributor Author

Some benchmarks were also performed here https://issues.apache.org/jira/browse/LUCENE-8739 although the pull request never made it into Apache Lucene.

@reta I think the reason why PR never made it into Lucene was because they were looking for a pure java implementation and didn't want to use libraries with JNI bindings in the lucere-core build.

@reta
Copy link
Collaborator

reta commented May 31, 2023

The numbers denote the %age CPU utilized for the compression. Please let me know if I can help with more information regarding the experiments.

Thanks @sarthakaggarwal97 , do you have Java heap (memory) stats?

@sarthakaggarwal97
Copy link
Contributor Author

With respect to the implementation for this issue, there are two possible approaches I can see.

  1. Move the zstd sandboxed code directly into source/server:
    This approach should have minimal changes from the already present code in the sandbox, and would be enabling new compression codecs using the index settings just by adding support for the new codecs.

  2. Introduce the Codecs as a new module:
    In this approach, we will be introducing the new codecs (or even old ones) as a module, which will be plugged in to the CodecService eventually. With this, we will have to wire few more components like IndicesServices, IndexService, IndexModule, IndexShard so that we are able to fetch the filtered codecs from the newly added module using PluginsService. We should then be able to initialize the CodecService with the filtered/new codecs. This approach might also enable us in the future to extend support to more codecs.

I would request the community to review the implementation approaches and which one would be preferable. Please share if there are any other approaches that can be taken into consideration.

cc: @reta @dblock @backslasht @mgodwan

@dblock
Copy link
Member

dblock commented May 31, 2023

I like (2) of course because it's more generic and extensibly, but I think I'd also merge (1) if it gives the feature to users earlier.

@reta
Copy link
Collaborator

reta commented May 31, 2023

I would go with 2nd option, Introduce the Codecs as a new module

@mulugetam
Copy link
Contributor

The quickest approach to make it accessible to users is to move it from sandbox/plugins to modules. However, the second option is more generic and can hopefully be done soon.

@backslasht
Copy link
Contributor

I agree option 2 is the right long term solution. Can it be done in two steps where in option 1 is done first followed by option 2?

@sarthakaggarwal97
Copy link
Contributor Author

The implementation and design to make the Custom Codecs pluggable would require some discussions about designs and implementation.
As mentioned by few of the folks, we will start with the option (1), since it would provide the feature to the users much sooner. I will raise the PR for it.
I have also created a separate issue #7886 to track the pluggable module for Custom Codecs. This way, we can start the discussions and would be picked after (1)

@reta reta added the v2.9.0 'Issues and PRs related to version v2.9.0' label Jun 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request v2.9.0 'Issues and PRs related to version v2.9.0'
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants