You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, we have "Spark source" that allows reading data sources located in Hadoop (HDFS, S3, etc) as an external source (za.co.absa.pramen.core.source.SparkSource).
But sometimes it is useful to have a "Spark sink" so the data from the metastore could be written as a part of a pipeline.
Specifically, such a sink can be used to run a pipeline on prem and write data to S3.
Feature
Add a new sink, SparkSink that will just do 'df.write(...)' with options provided by the configuration of the sink.
Example
This is how it might look like (not final).
sinks.conf:
pramen.sinks = [
{
name = "my_sink"
factory.class = "za.co.absa.pramen.core.sink.SparkSink"
# These is the format passed to 'df.write.format(...)'
format = "csv"
# These is the mode passed to 'df.write.format(...).mode(...)'
mode = "append"
## Use at most one of for 'df.repartition(...).write(...)':
# Number of partitions
number.of.partitions = 10
# Records per partition
records.per.partition = 1000000
# These are additional option passed to the writer as 'df.write(...).options(...)'
option {
sep = "|"
quoteAll = "false"
header = "true"
}
}
]
pipeline.conf:
pramen.operations = [
{
name = "CSV sink"
type = "sink"
sink = "local_csv"
schedule.type = "daily"
dependencies = [
{
tables = [ users2 ]
date.from = "@infoDate + 100"
date.until = "@infoDate + 101"
optional = true
}
]
tables = [
{
input.metastore.table = my_table
output.path = "s3://bucket/prefix"
date {
from = "@infoDate - 5"
to = "@infoDate"
}
# This overrides options of the sink
sink {
mode = "overwrite"
# These are additional option passed to the writer as 'df.write(...).options(...)'
option {
escape = ""
emptyValue = ""
}
}
}
]
},
]
The text was updated successfully, but these errors were encountered:
Background
Currently, we have "Spark source" that allows reading data sources located in Hadoop (HDFS, S3, etc) as an external source (
za.co.absa.pramen.core.source.SparkSource
).But sometimes it is useful to have a "Spark sink" so the data from the metastore could be written as a part of a pipeline.
Specifically, such a sink can be used to run a pipeline on prem and write data to S3.
Feature
Add a new sink,
SparkSink
that will just do 'df.write(...)' with options provided by the configuration of the sink.Example
This is how it might look like (not final).
sinks.conf:
pipeline.conf:
The text was updated successfully, but these errors were encountered: