[Feature] Support LZO compress on File Read (#5083)

--------- Co-authored-by: Jia Fan <fanjiaeminem@qq.com>
apache · Oct 16, 2023 · a4a1901 · a4a1901
1 parent eb6d4cf
commit a4a1901
Show file tree

Hide file tree

Showing 24 changed files with 499 additions and 4 deletions.
diff --git a/docs/en/connector-v2/source/CosFile.md b/docs/en/connector-v2/source/CosFile.md
@@ -56,6 +56,7 @@ Read all the data in a split in a pollNext call. What splits are read will be sa
 | common-options            |         | no       | -                   |
 | sheet_name                | string  | no       | -                   |
 | file_filter_pattern       | string  | no       | -                   |
+| compress_codec            | string  | no       | none                |
 
 ### path [string]
 
@@ -252,6 +253,16 @@ Reader the sheet of the workbook,Only used when file_format is excel.
 
 Filter pattern, which used for filtering files.
 
+### compress_codec [string]
+
+The compress codec of files and the details that supported as the following shown:
+
+- txt: `lzo` `none`
+- json: `lzo` `none`
+- csv: `lzo` `none`
+- orc/parquet:  
+  automatically recognizes the compression type, no additional settings required.
+
 ## Example
 
 ```hocon

diff --git a/docs/en/connector-v2/source/FtpFile.md b/docs/en/connector-v2/source/FtpFile.md
@@ -49,6 +49,7 @@ If you use SeaTunnel Engine, It automatically integrated the hadoop jar when you
 | common-options            |         | no       | -                   |
 | sheet_name                | string  | no       | -                   |
 | file_filter_pattern       | string  | no       | -                   |
+| compress_codec            | string  | no       | none                |
 
 ### host [string]
 
@@ -228,6 +229,16 @@ Source plugin common parameters, please refer to [Source Common Options](common-
 
 Reader the sheet of the workbook,Only used when file_format_type is excel.
 
+### compress_codec [string]
+
+The compress codec of files and the details that supported as the following shown:
+
+- txt: `lzo` `none`
+- json: `lzo` `none`
+- csv: `lzo` `none`
+- orc/parquet:  
+  automatically recognizes the compression type, no additional settings required.
+
 ## Example
 
 ```hocon

diff --git a/docs/en/connector-v2/source/HdfsFile.md b/docs/en/connector-v2/source/HdfsFile.md
@@ -57,6 +57,17 @@ Read data from hdfs file system.
 | schema                    | config  | no       | -                   | the schema fields of upstream data                                                                                                                                                                                                                                                                                                            |
 | common-options            |         | no       | -                   | Source plugin common parameters, please refer to [Source Common Options](common-options.md) for details.                                                                                                                                                                                                                                      |
 | sheet_name                | string  | no       | -                   | Reader the sheet of the workbook,Only used when file_format is excel.                                                                                                                                                                                                                                                                         |
+| compress_codec            | string  | no       | none                | The compress codec of files                                                                                                                                                                                                                                                                                                                   |
+
+### compress_codec [string]
+
+The compress codec of files and the details that supported as the following shown:
+
+- txt: `lzo` `none`
+- json: `lzo` `none`
+- csv: `lzo` `none`
+- orc/parquet:  
+  automatically recognizes the compression type, no additional settings required.
 
 ### Tips
 

diff --git a/docs/en/connector-v2/source/Hive.md b/docs/en/connector-v2/source/Hive.md
@@ -44,6 +44,7 @@ Read all the data in a split in a pollNext call. What splits are read will be sa
 | read_partitions               | list    | no       | -             |
 | read_columns                  | list    | no       | -             |
 | abort_drop_partition_metadata | boolean | no       | true          |
+| compress_codec                | string  | no       | none          |
 | common-options                |         | no       | -             |
 
 ### table_name [string]
@@ -85,6 +86,16 @@ The read column list of the data source, user can use it to implement field proj
 
 Flag to decide whether to drop partition metadata from Hive Metastore during an abort operation. Note: this only affects the metadata in the metastore, the data in the partition will always be deleted(data generated during the synchronization process).
 
+### compress_codec [string]
+
+The compress codec of files and the details that supported as the following shown:
+
+- txt: `lzo` `none`
+- json: `lzo` `none`
+- csv: `lzo` `none`
+- orc/parquet:  
+  automatically recognizes the compression type, no additional settings required.
+
 ### common options
 
 Source plugin common parameters, please refer to [Source Common Options](common-options.md) for details

diff --git a/docs/en/connector-v2/source/LocalFile.md b/docs/en/connector-v2/source/LocalFile.md
@@ -50,6 +50,7 @@ Read all the data in a split in a pollNext call. What splits are read will be sa
 | common-options            |         | no       | -                   |
 | sheet_name                | string  | no       | -                   |
 | file_filter_pattern       | string  | no       | -                   |
+| compress_codec            | string  | no       | none                |
 
 ### path [string]
 
@@ -230,6 +231,16 @@ Reader the sheet of the workbook,Only used when file_format_type is excel.
 
 Filter pattern, which used for filtering files.
 
+### compress_codec [string]
+
+The compress codec of files and the details that supported as the following shown:
+
+- txt: `lzo` `none`
+- json: `lzo` `none`
+- csv: `lzo` `none`
+- orc/parquet:  
+  automatically recognizes the compression type, no additional settings required.
+
 ## Example
 
 ```hocon

diff --git a/docs/en/connector-v2/source/OssFile.md b/docs/en/connector-v2/source/OssFile.md
@@ -57,6 +57,7 @@ Read all the data in a split in a pollNext call. What splits are read will be sa
 | common-options            |         | no       | -                   |
 | sheet_name                | string  | no       | -                   |
 | file_filter_pattern       | string  | no       | -                   |
+| compress_codec            | string  | no       | none                |
 
 ### path [string]
 
@@ -249,6 +250,16 @@ Source plugin common parameters, please refer to [Source Common Options](common-
 
 Reader the sheet of the workbook,Only used when file_format_type is excel.
 
+### compress_codec [string]
+
+The compress codec of files and the details that supported as the following shown:
+
+- txt: `lzo` `none`
+- json: `lzo` `none`
+- csv: `lzo` `none`
+- orc/parquet:  
+  automatically recognizes the compression type, no additional settings required.
+
 ## Example
 
 ```hocon

diff --git a/docs/en/connector-v2/source/OssJindoFile.md b/docs/en/connector-v2/source/OssJindoFile.md
@@ -60,6 +60,7 @@ Read all the data in a split in a pollNext call. What splits are read will be sa
 | common-options            |         | no       | -                   |
 | sheet_name                | string  | no       | -                   |
 | file_filter_pattern       | string  | no       | -                   |
+| compress_codec            | string  | no       | none                |
 
 ### path [string]
 
@@ -256,6 +257,16 @@ Reader the sheet of the workbook,Only used when file_format_type is excel.
 
 Filter pattern, which used for filtering files.
 
+### compress_codec [string]
+
+The compress codec of files and the details that supported as the following shown:
+
+- txt: `lzo` `none`
+- json: `lzo` `none`
+- csv: `lzo` `none`
+- orc/parquet:  
+  automatically recognizes the compression type, no additional settings required.
+
 ## Example
 
 ```hocon

diff --git a/docs/en/connector-v2/source/S3File.md b/docs/en/connector-v2/source/S3File.md
@@ -214,6 +214,17 @@ If you assign file type to `parquet` `orc`, schema option not required, connecto
 | schema                          | config  | no       | -                                                     | The schema of upstream data.                                                                                                                                                                                                                                                                                                                                                                               |
 | common-options                  |         | no       | -                                                     | Source plugin common parameters, please refer to [Source Common Options](common-options.md) for details.                                                                                                                                                                                                                                                                                                   |
 | sheet_name                      | string  | no       | -                                                     | Reader the sheet of the workbook,Only used when file_format is excel.                                                                                                                                                                                                                                                                                                                                      |
+| compress_codec                  | string  | no       | none                                                  |
+
+### compress_codec [string]
+
+The compress codec of files and the details that supported as the following shown:
+
+- txt: `lzo` `none`
+- json: `lzo` `none`
+- csv: `lzo` `none`
+- orc/parquet:  
+  automatically recognizes the compression type, no additional settings required.
 
 ## Example
 

diff --git a/docs/en/connector-v2/source/SftpFile.md b/docs/en/connector-v2/source/SftpFile.md
@@ -48,6 +48,7 @@ If you use SeaTunnel Engine, It automatically integrated the hadoop jar when you
 | common-options            |         | no       | -                   |
 | sheet_name                | string  | no       | -                   |
 | file_filter_pattern       | string  | no       | -                   |
+| compress_codec            | string  | no       | none                |
 
 ### host [string]
 
@@ -231,6 +232,16 @@ Reader the sheet of the workbook,Only used when file_format_type is excel.
 
 Filter pattern, which used for filtering files.
 
+### compress_codec [string]
+
+The compress codec of files and the details that supported as the following shown:
+
+- txt: `lzo` `none`
+- json: `lzo` `none`
+- csv: `lzo` `none`
+- orc/parquet:  
+  automatically recognizes the compression type, no additional settings required.
+
 ## Example
 
 ```hocon

diff --git a/...src/main/java/org/apache/seatunnel/connectors/seatunnel/file/config/BaseSourceConfig.java b/...src/main/java/org/apache/seatunnel/connectors/seatunnel/file/config/BaseSourceConfig.java
@@ -119,4 +119,10 @@ public class BaseSourceConfig {
                     .noDefaultValue()
                     .withDescription(
                             "File pattern. The connector will filter some files base on the pattern.");
+
+    public static final Option<CompressFormat> COMPRESS_CODEC =
+            Options.key("compress_codec")
+                    .enumType(CompressFormat.class)
+                    .defaultValue(CompressFormat.NONE)
+                    .withDescription("Compression codec");
 }
diff --git a/...n/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/JsonReadStrategy.java b/...n/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/JsonReadStrategy.java
@@ -22,6 +22,8 @@
 import org.apache.seatunnel.api.table.type.SeaTunnelRow;
 import org.apache.seatunnel.api.table.type.SeaTunnelRowType;
 import org.apache.seatunnel.common.exception.CommonErrorCode;
+import org.apache.seatunnel.connectors.seatunnel.file.config.BaseSourceConfig;
+import org.apache.seatunnel.connectors.seatunnel.file.config.CompressFormat;
 import org.apache.seatunnel.connectors.seatunnel.file.config.HadoopConf;
 import org.apache.seatunnel.connectors.seatunnel.file.exception.FileConnectorException;
 import org.apache.seatunnel.format.json.JsonDeserializationSchema;
@@ -30,14 +32,29 @@
 import org.apache.hadoop.fs.FileSystem;
 import org.apache.hadoop.fs.Path;
 
+import io.airlift.compress.lzo.LzopCodec;
+import lombok.extern.slf4j.Slf4j;
+
 import java.io.BufferedReader;
 import java.io.IOException;
+import java.io.InputStream;
 import java.io.InputStreamReader;
 import java.nio.charset.StandardCharsets;
 import java.util.Map;
 
+@Slf4j
 public class JsonReadStrategy extends AbstractReadStrategy {
     private DeserializationSchema<SeaTunnelRow> deserializationSchema;
+    private CompressFormat compressFormat = BaseSourceConfig.COMPRESS_CODEC.defaultValue();
+
+    @Override
+    public void init(HadoopConf conf) {
+        super.init(conf);
+        if (pluginConfig.hasPath(BaseSourceConfig.COMPRESS_CODEC.key())) {
+            String compressCodec = pluginConfig.getString(BaseSourceConfig.COMPRESS_CODEC.key());
+            compressFormat = CompressFormat.valueOf(compressCodec.toUpperCase());
+        }
+    }
 
     @Override
     public void setSeaTunnelRowTypeInfo(SeaTunnelRowType seaTunnelRowType) {
@@ -58,9 +75,24 @@ public void read(String path, Collector<SeaTunnelRow> output)
         FileSystem fs = FileSystem.get(conf);
         Path filePath = new Path(path);
         Map<String, String> partitionsMap = parsePartitionsByPath(path);
+        InputStream inputStream;
+        switch (compressFormat) {
+            case LZO:
+                LzopCodec lzo = new LzopCodec();
+                inputStream = lzo.createInputStream(fs.open(filePath));
+                break;
+            case NONE:
+                inputStream = fs.open(filePath);
+                break;
+            default:
+                log.warn(
+                        "Text file does not support this compress type: {}",
+                        compressFormat.getCompressCodec());
+                inputStream = fs.open(filePath);
+                break;
+        }
         try (BufferedReader reader =
-                new BufferedReader(
-                        new InputStreamReader(fs.open(filePath), StandardCharsets.UTF_8))) {
+                new BufferedReader(new InputStreamReader(inputStream, StandardCharsets.UTF_8))) {
             reader.lines()
                     .forEach(
                             line -> {

diff --git a/...n/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/TextReadStrategy.java b/...n/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/TextReadStrategy.java
@@ -28,6 +28,7 @@
 import org.apache.seatunnel.common.utils.DateUtils;
 import org.apache.seatunnel.common.utils.TimeUtils;
 import org.apache.seatunnel.connectors.seatunnel.file.config.BaseSourceConfig;
+import org.apache.seatunnel.connectors.seatunnel.file.config.CompressFormat;
 import org.apache.seatunnel.connectors.seatunnel.file.config.FileFormat;
 import org.apache.seatunnel.connectors.seatunnel.file.config.HadoopConf;
 import org.apache.seatunnel.connectors.seatunnel.file.exception.FileConnectorErrorCode;
@@ -39,19 +40,25 @@
 import org.apache.hadoop.fs.FileSystem;
 import org.apache.hadoop.fs.Path;
 
+import io.airlift.compress.lzo.LzopCodec;
+import lombok.extern.slf4j.Slf4j;
+
 import java.io.BufferedReader;
 import java.io.IOException;
+import java.io.InputStream;
 import java.io.InputStreamReader;
 import java.nio.charset.StandardCharsets;
 import java.util.Map;
 
+@Slf4j
 public class TextReadStrategy extends AbstractReadStrategy {
     private DeserializationSchema<SeaTunnelRow> deserializationSchema;
     private String fieldDelimiter = BaseSourceConfig.DELIMITER.defaultValue();
     private DateUtils.Formatter dateFormat = BaseSourceConfig.DATE_FORMAT.defaultValue();
     private DateTimeUtils.Formatter datetimeFormat =
             BaseSourceConfig.DATETIME_FORMAT.defaultValue();
     private TimeUtils.Formatter timeFormat = BaseSourceConfig.TIME_FORMAT.defaultValue();
+    private CompressFormat compressFormat = BaseSourceConfig.COMPRESS_CODEC.defaultValue();
     private int[] indexes;
 
     @Override
@@ -61,9 +68,25 @@ public void read(String path, Collector<SeaTunnelRow> output)
         FileSystem fs = FileSystem.get(conf);
         Path filePath = new Path(path);
         Map<String, String> partitionsMap = parsePartitionsByPath(path);
+        InputStream inputStream;
+        switch (compressFormat) {
+            case LZO:
+                LzopCodec lzo = new LzopCodec();
+                inputStream = lzo.createInputStream(fs.open(filePath));
+                break;
+            case NONE:
+                inputStream = fs.open(filePath);
+                break;
+            default:
+                log.warn(
+                        "Text file does not support this compress type: {}",
+                        compressFormat.getCompressCodec());
+                inputStream = fs.open(filePath);
+                break;
+        }
+
         try (BufferedReader reader =
-                new BufferedReader(
-                        new InputStreamReader(fs.open(filePath), StandardCharsets.UTF_8))) {
+                new BufferedReader(new InputStreamReader(inputStream, StandardCharsets.UTF_8))) {
             reader.lines()
                     .skip(skipHeaderNumber)
                     .forEach(
@@ -200,5 +223,9 @@ private void initFormatter() {
                     TimeUtils.Formatter.parse(
                             pluginConfig.getString(BaseSourceConfig.TIME_FORMAT.key()));
         }
+        if (pluginConfig.hasPath(BaseSourceConfig.COMPRESS_CODEC.key())) {
+            String compressCodec = pluginConfig.getString(BaseSourceConfig.COMPRESS_CODEC.key());
+            compressFormat = CompressFormat.valueOf(compressCodec.toUpperCase());
+        }
     }
 }
diff --git a/.../java/org/apache/seatunnel/connectors/seatunnel/file/cos/source/CosFileSourceFactory.java b/.../java/org/apache/seatunnel/connectors/seatunnel/file/cos/source/CosFileSourceFactory.java
@@ -61,6 +61,7 @@ public OptionRule optionRule() {
                 .optional(BaseSourceConfig.DATETIME_FORMAT)
                 .optional(BaseSourceConfig.TIME_FORMAT)
                 .optional(BaseSourceConfig.FILE_FILTER_PATTERN)
+                .optional(BaseSourceConfig.COMPRESS_CODEC)
                 .build();
     }
 

diff --git a/.../java/org/apache/seatunnel/connectors/seatunnel/file/ftp/source/FtpFileSourceFactory.java b/.../java/org/apache/seatunnel/connectors/seatunnel/file/ftp/source/FtpFileSourceFactory.java
@@ -61,6 +61,7 @@ public OptionRule optionRule() {
                 .optional(BaseSourceConfig.DATETIME_FORMAT)
                 .optional(BaseSourceConfig.TIME_FORMAT)
                 .optional(BaseSourceConfig.FILE_FILTER_PATTERN)
+                .optional(BaseSourceConfig.COMPRESS_CODEC)
                 .build();
     }
 

diff --git a/...ava/org/apache/seatunnel/connectors/seatunnel/file/hdfs/source/HdfsFileSourceFactory.java b/...ava/org/apache/seatunnel/connectors/seatunnel/file/hdfs/source/HdfsFileSourceFactory.java
@@ -58,6 +58,7 @@ public OptionRule optionRule() {
                 .optional(BaseSourceConfig.DATETIME_FORMAT)
                 .optional(BaseSourceConfig.TIME_FORMAT)
                 .optional(BaseSourceConfig.FILE_FILTER_PATTERN)
+                .optional(BaseSourceConfig.COMPRESS_CODEC)
                 .build();
     }
 

diff --git a/.../java/org/apache/seatunnel/connectors/seatunnel/file/oss/source/OssFileSourceFactory.java b/.../java/org/apache/seatunnel/connectors/seatunnel/file/oss/source/OssFileSourceFactory.java
@@ -61,6 +61,7 @@ public OptionRule optionRule() {
                 .optional(BaseSourceConfig.DATETIME_FORMAT)
                 .optional(BaseSourceConfig.TIME_FORMAT)
                 .optional(BaseSourceConfig.FILE_FILTER_PATTERN)
+                .optional(BaseSourceConfig.COMPRESS_CODEC)
                 .build();
     }
 

diff --git a/.../java/org/apache/seatunnel/connectors/seatunnel/file/oss/source/OssFileSourceFactory.java b/.../java/org/apache/seatunnel/connectors/seatunnel/file/oss/source/OssFileSourceFactory.java
@@ -61,6 +61,7 @@ public OptionRule optionRule() {
                 .optional(BaseSourceConfig.DATETIME_FORMAT)
                 .optional(BaseSourceConfig.TIME_FORMAT)
                 .optional(BaseSourceConfig.FILE_FILTER_PATTERN)
+                .optional(BaseSourceConfig.COMPRESS_CODEC)
                 .build();
     }