Moving zstd and zstd_no_dict compression codecs out of experimental #7805

sarthakaggarwal97 · 2023-05-29T10:49:07Z

Is your feature request related to a problem? Please describe.
Currently, we have the experimental support for zstd and zstd compression codec as mentioned in #3354. The request is to move the feature out of the sandbox to enable the users to create an index using the new codecs.

Describe the solution you'd like
The idea is to introduce the new compression codecs for the users by moving the current implementation out of the box. With that, we will leverage the current index.codec settings that can be used to specify zstd and zstd_no_dict upon index creation. We will continue to support the existing zlib and lz4 codecs, with the the default as lz4 or BEST_SPEED.

There are the outcomes of the benchmarks with these new codecs:

Move ZSTD out of sandbox
- Version Upgrade Checks
- Snapshot Checks
- Plugin Compatibility (like KNN fixed over at Fix: avoid ZSTD codec from overriding service codec factory. #7037)
- Stress Tests (100gb shard)
Enable the index setting to accept new codec values

cc: @mulugetam @backslasht @mgodwan

The text was updated successfully, but these errors were encountered:

reta · 2023-05-30T12:19:05Z

@sarthakaggarwal97 thanks a lot for publishing the compression gains, could you please share the CPU / memory profiles as well? Thank you.

Bukhtawar · 2023-05-30T13:19:43Z

+1 Latency is just one dimension, we need to understand if there are trade-offs with more cycles on compress/decompress

reta · 2023-05-30T13:29:34Z

Some benchmarks were also performed here https://issues.apache.org/jira/browse/LUCENE-8739 although the pull request never made it into Apache Lucene.

sarthakaggarwal97 · 2023-05-31T08:49:32Z

@reta @Bukhtawar here are the percentage CPU utilization during indexing. The profiles were taken at a 5 minute interval during active indexing with the nyc_taxis dataset.

Summary of Compression Overheads across Codecs

Codecs	Compression Overhead
	%
zlib	10.48
lz4	4.2
zstd	9.22
zstd_no_dict	2.55

The numbers denote the %age CPU utilized for the compression. Please let me know if I can help with more information regarding the experiments.

sarthakaggarwal97 · 2023-05-31T09:05:55Z

Some benchmarks were also performed here https://issues.apache.org/jira/browse/LUCENE-8739 although the pull request never made it into Apache Lucene.

@reta I think the reason why PR never made it into Lucene was because they were looking for a pure java implementation and didn't want to use libraries with JNI bindings in the lucere-core build.

reta · 2023-05-31T12:48:31Z

The numbers denote the %age CPU utilized for the compression. Please let me know if I can help with more information regarding the experiments.

Thanks @sarthakaggarwal97 , do you have Java heap (memory) stats?

sarthakaggarwal97 · 2023-05-31T15:10:41Z

With respect to the implementation for this issue, there are two possible approaches I can see.

Move the zstd sandboxed code directly into source/server:
This approach should have minimal changes from the already present code in the sandbox, and would be enabling new compression codecs using the index settings just by adding support for the new codecs.
Introduce the Codecs as a new module:
In this approach, we will be introducing the new codecs (or even old ones) as a module, which will be plugged in to the CodecService eventually. With this, we will have to wire few more components like IndicesServices, IndexService, IndexModule, IndexShard so that we are able to fetch the filtered codecs from the newly added module using PluginsService. We should then be able to initialize the CodecService with the filtered/new codecs. This approach might also enable us in the future to extend support to more codecs.

I would request the community to review the implementation approaches and which one would be preferable. Please share if there are any other approaches that can be taken into consideration.

cc: @reta @dblock @backslasht @mgodwan

dblock · 2023-05-31T16:19:58Z

I like (2) of course because it's more generic and extensibly, but I think I'd also merge (1) if it gives the feature to users earlier.

reta · 2023-05-31T16:25:56Z

I would go with 2nd option, Introduce the Codecs as a new module

mulugetam · 2023-05-31T17:33:54Z

The quickest approach to make it accessible to users is to move it from sandbox/plugins to modules. However, the second option is more generic and can hopefully be done soon.

backslasht · 2023-05-31T18:34:21Z

I agree option 2 is the right long term solution. Can it be done in two steps where in option 1 is done first followed by option 2?

sarthakaggarwal97 · 2023-06-02T14:35:39Z

The implementation and design to make the Custom Codecs pluggable would require some discussions about designs and implementation.
As mentioned by few of the folks, we will start with the option (1), since it would provide the feature to the users much sooner. I will raise the PR for it.
I have also created a separate issue #7886 to track the pluggable module for Custom Codecs. This way, we can start the discussions and would be picked after (1)

sarthakaggarwal97 added enhancement Enhancement or improvement to existing feature or request untriaged labels May 29, 2023

sarthakaggarwal97 mentioned this issue May 29, 2023

Introduce more compression libraries and implementations #3354

Open

wbeckler removed the untriaged label May 30, 2023

sarthakaggarwal97 mentioned this issue Jun 4, 2023

Moving zstd out of sandbox #7908

Merged

15 tasks

reta closed this as completed in #7908 Jun 29, 2023

reta added the v2.9.0 'Issues and PRs related to version v2.9.0' label Jun 29, 2023

sarthakaggarwal97 mentioned this issue Jun 29, 2023

[Backport 2.x] Moving zstd out of sanbox and enabling index setting #8336

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Moving zstd and zstd_no_dict compression codecs out of experimental #7805

Moving zstd and zstd_no_dict compression codecs out of experimental #7805

sarthakaggarwal97 commented May 29, 2023 •

edited

Loading

reta commented May 30, 2023

Bukhtawar commented May 30, 2023

reta commented May 30, 2023

sarthakaggarwal97 commented May 31, 2023 •

edited

Loading

sarthakaggarwal97 commented May 31, 2023

reta commented May 31, 2023

sarthakaggarwal97 commented May 31, 2023

dblock commented May 31, 2023

reta commented May 31, 2023

mulugetam commented May 31, 2023

backslasht commented May 31, 2023

sarthakaggarwal97 commented Jun 2, 2023

Moving zstd and zstd_no_dict compression codecs out of experimental #7805

Moving zstd and zstd_no_dict compression codecs out of experimental #7805

Comments

sarthakaggarwal97 commented May 29, 2023 • edited Loading

reta commented May 30, 2023

Bukhtawar commented May 30, 2023

reta commented May 30, 2023

sarthakaggarwal97 commented May 31, 2023 • edited Loading

Summary of Compression Overheads across Codecs

sarthakaggarwal97 commented May 31, 2023

reta commented May 31, 2023

sarthakaggarwal97 commented May 31, 2023

dblock commented May 31, 2023

reta commented May 31, 2023

mulugetam commented May 31, 2023

backslasht commented May 31, 2023

sarthakaggarwal97 commented Jun 2, 2023

sarthakaggarwal97 commented May 29, 2023 •

edited

Loading

sarthakaggarwal97 commented May 31, 2023 •

edited

Loading