-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support COPY TO Externally Defined File Formats, add FileType trait #11060
Merged
Merged
Changes from all commits
Commits
Show all changes
23 commits
Select commit
Hold shift + click to select a range
a67d73b
wip create and register ext file types with session
devinjdangelo 85e64ba
Add contains function, and support in datafusion substrait consumer (…
Lordworms 05d329b
logical planning updated
devinjdangelo 027131b
compiling
devinjdangelo 54f5f8f
removing filetype enum
devinjdangelo 0274eec
compiling
devinjdangelo 4319881
working on tests
devinjdangelo 487da13
fix some tests
devinjdangelo 9527f82
test fixes
devinjdangelo 891afc3
cli fix
devinjdangelo 128bf2b
cli fmt
devinjdangelo e6d2fa3
Merge branch 'main' into external_file_types
devinjdangelo b065cd7
Update datafusion/core/src/datasource/file_format/mod.rs
devinjdangelo 485f754
Update datafusion/core/src/execution/session_state.rs
devinjdangelo da8e635
review comments
devinjdangelo 6dec6ef
review comments
devinjdangelo 013ef91
review comments
devinjdangelo 132af14
typo fix
devinjdangelo 493e82f
fmt
devinjdangelo bcc3119
fix err log style
devinjdangelo 77bcc54
merge main
devinjdangelo 11071ac
fmt
devinjdangelo 826721f
Merge remote-tracking branch 'apache/main' into external_file_types
alamb File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -24,7 +24,7 @@ use std::str::FromStr; | |
|
||
use crate::error::_config_err; | ||
use crate::parsers::CompressionTypeVariant; | ||
use crate::{DataFusionError, FileType, Result}; | ||
use crate::{DataFusionError, Result}; | ||
|
||
/// A macro that wraps a configuration struct and automatically derives | ||
/// [`Default`] and [`ConfigField`] for it, allowing it to be used | ||
|
@@ -1116,6 +1116,16 @@ macro_rules! extensions_options { | |
} | ||
} | ||
|
||
/// These file types have special built in behavior for configuration. | ||
/// Use TableOptions::Extensions for configuring other file types. | ||
#[derive(Debug, Clone)] | ||
pub enum ConfigFileType { | ||
CSV, | ||
#[cfg(feature = "parquet")] | ||
PARQUET, | ||
JSON, | ||
} | ||
|
||
/// Represents the configuration options available for handling different table formats within a data processing application. | ||
/// This struct encompasses options for various file formats including CSV, Parquet, and JSON, allowing for flexible configuration | ||
/// of parsing and writing behaviors specific to each format. Additionally, it supports extending functionality through custom extensions. | ||
|
@@ -1134,7 +1144,7 @@ pub struct TableOptions { | |
|
||
/// The current file format that the table operations should assume. This option allows | ||
/// for dynamic switching between the supported file types (e.g., CSV, Parquet, JSON). | ||
pub current_format: Option<FileType>, | ||
pub current_format: Option<ConfigFileType>, | ||
|
||
/// Optional extensions that can be used to extend or customize the behavior of the table | ||
/// options. Extensions can be registered using `Extensions::insert` and might include | ||
|
@@ -1152,10 +1162,9 @@ impl ConfigField for TableOptions { | |
if let Some(file_type) = &self.current_format { | ||
match file_type { | ||
#[cfg(feature = "parquet")] | ||
FileType::PARQUET => self.parquet.visit(v, "format", ""), | ||
FileType::CSV => self.csv.visit(v, "format", ""), | ||
FileType::JSON => self.json.visit(v, "format", ""), | ||
_ => {} | ||
ConfigFileType::PARQUET => self.parquet.visit(v, "format", ""), | ||
ConfigFileType::CSV => self.csv.visit(v, "format", ""), | ||
ConfigFileType::JSON => self.json.visit(v, "format", ""), | ||
} | ||
} else { | ||
self.csv.visit(v, "csv", ""); | ||
|
@@ -1188,12 +1197,9 @@ impl ConfigField for TableOptions { | |
match key { | ||
"format" => match format { | ||
#[cfg(feature = "parquet")] | ||
FileType::PARQUET => self.parquet.set(rem, value), | ||
FileType::CSV => self.csv.set(rem, value), | ||
FileType::JSON => self.json.set(rem, value), | ||
_ => { | ||
_config_err!("Config value \"{key}\" is not supported on {}", format) | ||
} | ||
ConfigFileType::PARQUET => self.parquet.set(rem, value), | ||
ConfigFileType::CSV => self.csv.set(rem, value), | ||
ConfigFileType::JSON => self.json.set(rem, value), | ||
}, | ||
_ => _config_err!("Config value \"{key}\" not found on TableOptions"), | ||
} | ||
|
@@ -1210,15 +1216,6 @@ impl TableOptions { | |
Self::default() | ||
} | ||
|
||
/// Sets the file format for the table. | ||
/// | ||
/// # Parameters | ||
/// | ||
/// * `format`: The file format to use (e.g., CSV, Parquet). | ||
pub fn set_file_format(&mut self, format: FileType) { | ||
self.current_format = Some(format); | ||
} | ||
|
||
/// Creates a new `TableOptions` instance initialized with settings from a given session config. | ||
/// | ||
/// # Parameters | ||
|
@@ -1249,6 +1246,15 @@ impl TableOptions { | |
clone | ||
} | ||
|
||
/// Sets the file format for the table. | ||
/// | ||
/// # Parameters | ||
/// | ||
/// * `format`: The file format to use (e.g., CSV, Parquet). | ||
pub fn set_config_format(&mut self, format: ConfigFileType) { | ||
self.current_format = Some(format); | ||
} | ||
|
||
/// Sets the extensions for this `TableOptions` instance. | ||
/// | ||
/// # Parameters | ||
|
@@ -1673,6 +1679,8 @@ config_namespace! { | |
} | ||
} | ||
|
||
pub trait FormatOptionsExt: Display {} | ||
|
||
#[derive(Debug, Clone, PartialEq)] | ||
#[allow(clippy::large_enum_variant)] | ||
pub enum FormatOptions { | ||
|
@@ -1698,28 +1706,15 @@ impl Display for FormatOptions { | |
} | ||
} | ||
|
||
impl From<FileType> for FormatOptions { | ||
fn from(value: FileType) -> Self { | ||
match value { | ||
FileType::ARROW => FormatOptions::ARROW, | ||
FileType::AVRO => FormatOptions::AVRO, | ||
#[cfg(feature = "parquet")] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it is certainly nice to avoid many of these |
||
FileType::PARQUET => FormatOptions::PARQUET(TableParquetOptions::default()), | ||
FileType::CSV => FormatOptions::CSV(CsvOptions::default()), | ||
FileType::JSON => FormatOptions::JSON(JsonOptions::default()), | ||
} | ||
} | ||
} | ||
|
||
#[cfg(test)] | ||
mod tests { | ||
use std::any::Any; | ||
use std::collections::HashMap; | ||
|
||
use crate::config::{ | ||
ConfigEntry, ConfigExtension, ExtensionOptions, Extensions, TableOptions, | ||
ConfigEntry, ConfigExtension, ConfigFileType, ExtensionOptions, Extensions, | ||
TableOptions, | ||
}; | ||
use crate::FileType; | ||
|
||
#[derive(Default, Debug, Clone)] | ||
pub struct TestExtensionConfig { | ||
|
@@ -1777,7 +1772,7 @@ mod tests { | |
let mut extension = Extensions::new(); | ||
extension.insert(TestExtensionConfig::default()); | ||
let mut table_config = TableOptions::new().with_extensions(extension); | ||
table_config.set_file_format(FileType::CSV); | ||
table_config.set_config_format(ConfigFileType::CSV); | ||
table_config.set("format.delimiter", ";").unwrap(); | ||
assert_eq!(table_config.csv.delimiter, b';'); | ||
table_config.set("test.bootstrap.servers", "asd").unwrap(); | ||
|
@@ -1794,7 +1789,7 @@ mod tests { | |
#[test] | ||
fn csv_u8_table_options() { | ||
let mut table_config = TableOptions::new(); | ||
table_config.set_file_format(FileType::CSV); | ||
table_config.set_config_format(ConfigFileType::CSV); | ||
table_config.set("format.delimiter", ";").unwrap(); | ||
assert_eq!(table_config.csv.delimiter as char, ';'); | ||
table_config.set("format.escape", "\"").unwrap(); | ||
|
@@ -1807,7 +1802,7 @@ mod tests { | |
#[test] | ||
fn parquet_table_options() { | ||
let mut table_config = TableOptions::new(); | ||
table_config.set_file_format(FileType::PARQUET); | ||
table_config.set_config_format(ConfigFileType::PARQUET); | ||
table_config | ||
.set("format.bloom_filter_enabled::col1", "true") | ||
.unwrap(); | ||
|
@@ -1821,7 +1816,7 @@ mod tests { | |
#[test] | ||
fn parquet_table_options_config_entry() { | ||
let mut table_config = TableOptions::new(); | ||
table_config.set_file_format(FileType::PARQUET); | ||
table_config.set_config_format(ConfigFileType::PARQUET); | ||
table_config | ||
.set("format.bloom_filter_enabled::col1", "true") | ||
.unwrap(); | ||
|
@@ -1835,7 +1830,7 @@ mod tests { | |
#[test] | ||
fn parquet_table_options_config_metadata_entry() { | ||
let mut table_config = TableOptions::new(); | ||
table_config.set_file_format(FileType::PARQUET); | ||
table_config.set_config_format(ConfigFileType::PARQUET); | ||
table_config.set("format.metadata::key1", "").unwrap(); | ||
table_config.set("format.metadata::key2", "value2").unwrap(); | ||
table_config | ||
|
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the last vestige of the old
FileType
enum. It can be completely ignored if using a custom format.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 perhaps we should add some Trait to unify the handling of options for built in formats and custom formats 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could potentially remove the
TableOptions
code and instead have eachFileFormatFactory
handle configuration. This is actually mostly the case in this PR already, butTableOptions
is a common helper.