Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CDAP-21144] Spanner Compression for String & bytes type fields #15910

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

sidhdirenge
Copy link
Contributor

@sidhdirenge sidhdirenge commented Feb 20, 2025

Columns to be compressed should be added to cConf
table_name:column_name:compressor_type:(TO BE USED IN FUTURE)number_of_chunks

All this would be part of the compressionConfig

High level overview of changes:

  1. Specify the table.column in cConf that needs to be compressed
  2. In Spanner implementation, at the time of table creation, if field is present in cConf, we create one more column : column_name.compression_type.
  3. Based on the size, we compress (compression_type = null) or dont compress data for that particular field (compression_type = SNAPPY)

In case, we have any cx issue related to some new columns in future, we can append that column to the cConf property & thus enable compression.

Tested these changes using Distributed docker image

Unit tests to be added in follow up PR.
image

@sidhdirenge sidhdirenge requested a review from vsethi09 February 20, 2025 16:52
@sidhdirenge sidhdirenge added the build Triggers github actions build label Feb 20, 2025
<name>data.storage.properties.gcp-spanner.compression.config</name>
<value>application_specs:application_data:SNAPPY,provisioner_data:provisioner_task_info:SNAPPY,destination_fields_table.destination_data:SNAPPY,run_records.run_records_data:SNAPPY</value>
<description>
Represents table_name:column_name:compressor_type which possibly can cross spanner cell limit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Support for chunk size should be added now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be done in follow up PR

@sidhdirenge sidhdirenge changed the title [WIP] Spanner Compression for String & bytes type fields [CDAP-21144] Spanner Compression for String & bytes type fields Feb 21, 2025
/**
* Interface for various implementations of compressors used for spanner.
*/
public interface Compressor {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should go in storage-spi?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now everything compression related, we want to limit to Spanner. We do not have use case for this for different implementations.

Also the compression config to be introduced later would have fields related to chunking etc - again spanner specific

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that but making it a spi will help anyone add it for any extensions in future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any extension can implement their own compressor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me evaluate this with the chunking related changes - and move it to spi if there is nothing too specific.

For chunking, we might have to pass in some other args apart from just the value.

/**
* Configuration class for compression settings.
*/
public class CompressionConfig {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar this should also go in storage-spi?

Every extension can have its own.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these compressor configs are limited to spanner right now.

/**
* A {@link Compressor} that uses Snappy for compression and decompression.
*/
public class SnappyCompressor implements Compressor {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add UTs for this class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Snappy UTs have some issue when I run on local mac - It is a known issue & it needs some extra libraries

Will take this up in follow up

[FAILED_TO_LOAD_NATIVE_LIBRARY] no native library is found for os.name=Mac and os.arch=aarch64

Copy link

Quality Gate Failed Quality Gate failed

Failed conditions
9.6% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube Cloud

Copy link
Contributor

@tivv tivv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also disallow any range queries on compressed columns.

.append(escapeName(compressedFieldName))
.append(" ")
.append(getSpannerType(f.getType()));
if (primaryKeys.contains(fieldName)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will not be non-null as "no compression" value is null. I would probably disallow having primary keys compressed as there is a slight chance of collision between compressed and uncompressed value and making a two-column primary key index with nulls won't work well

@@ -388,6 +446,7 @@ private void executeDdlStatements(List<String> ddlStatements) throws IOException
LOG.debug("No ddl statements to execute");
return;
}
LOG.info("DDL statements {}", ddlStatements);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why info? It's not that important to customer. If you want to see it in raw logs, make it debug

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Triggers github actions build
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants