-
Notifications
You must be signed in to change notification settings - Fork 3
SerDe
Andrei Tupitcyn edited this page Dec 23, 2015
·
1 revision
SerDe is short for Serializer/Deserializer. Parsek uses SerDe interface for parsing source data and serialize PValue.
Allow to work with delimiter-separated values (DSV). The most popular DSV formats are CSV and TSV. Also Separated SerDe compatible with Hive Delimited format and support complex types like Map, Struct and List.
Configuration key | Default | Description |
---|---|---|
fields | - | List of fields. Can be list of field names or field definitions. |
enclosure | " | Enclosure character. |
escape | \ | Escape character. |
delimiter | \u0001 | Delimiter character. The same as FIELDS TERMINATED BY for Hive. |
listDelimiter | \u0002 | The same as COLLECTION ITEMS TERMINATED BY for Hive. |
mapFieldDelimiter | \u0002 | The same as MAP KEYS TERMINATED BY for Hive. |
nullValue | "" | Value for Null's. |
multiLine | false | Allow multiline output. Hive is not supported multiline data. |
timeFormat | yyyy-MM-dd HH:mm:ss | Parse/write format for time fields. For unixtimestamp format use "timestamp" value. |
The same as Separated SerDe but with overwritten defaults:
case class CsvSerDe(config: Config) extends DelimitedSerDeTrait {
override val delimiter = ','
override val listDelimiter = '|'
override val mapFieldDelimiter = ':'
}
The same as Separated SerDe but with overwritten defaults:
case class TsvSerDe(config: Config) extends DelimitedSerDeTrait {
override val delimiter = '\t'
override val listDelimiter = '|'
override val mapFieldDelimiter = ':'
}
The same as Separated SerDe but with overwritten defaults:
case class HiveTsvSerDe(config: Config) extends DelimitedSerDeTrait {
override val delimiter = '\t'
override val listDelimiter = '|'
override val mapFieldDelimiter = ':'
override val enclosure = CSVWriter.NO_ESCAPE_CHARACTER
}
Allow to work with data in json format.
Configuration key | Default | Description |
---|---|---|
fields | - | List of field names. |
timeFormat | yyyy-MM-dd HH:mm:ss | Parse/write format for time fields. For unixtimestamp format use "timestamp" value. |