Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-44732][SQL] Built-in XML data source support #41832

Closed
wants to merge 8 commits into from

Conversation

sandip-db
Copy link
Contributor

@sandip-db sandip-db commented Jul 3, 2023

What changes were proposed in this pull request?

XML is a widely used data format. An external spark-xml package (https://github.com/databricks/spark-xml) is available to read and write XML data in spark. Making spark-xml built-in will provide a better user experience for Spark SQL and structured streaming. The proposal is to inline code from spark-xml package.

The PR has the following commits:
i) The first commit has the following:

  • Copy of spark-xml src files.
  • Update license
  • Scala style and format fixes
  • Change AnyFunSuite to SparkFunSuite

ii) Miscellaneous import and scala style fixes.
iii) Add library dependencies
iv) Resource file path fixes and change AnyFunSuite to SharedSparkSession or SQLTestUtils
v) Exclude XML test resource files from license check
vi) Change import from scala.jdk.Collections to scala.collection.JavaConverters

Why are the changes needed?

Built-in support for XML data source would provide better user experience than having to import an external package.

Does this PR introduce any user-facing change?

Yes, Add built-in support for XML data source.

How was this patch tested?

Tested the new unit-tests that came with the imported spark-xml package.
Also ran ./dev/run-test

@sandip-db
Copy link
Contributor Author

@HyukjinKwon

@HyukjinKwon
Copy link
Member

This would need an SPIP.

@HyukjinKwon HyukjinKwon marked this pull request as draft July 3, 2023 23:45
@sandip-db sandip-db marked this pull request as ready for review July 11, 2023 20:09
@sandip-db
Copy link
Contributor Author

This would need an SPIP.

SPIP link
JIRA
Discussion Thread
Vote thread

@HyukjinKwon
Copy link
Member

Seems the test failure is flakiness. Mind retriggering https://github.com/sandip-db/spark/actions/runs/5745270542/job/15573043078 please @sandip-db ?

}

private def readUntilStartElement(): Boolean = {
currentStartTag = startTag
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ: is there a existing lib we can use for the parsing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XMLEventReader is being used to parse fields from XML records. It may be possible to use XMLEventReader to extract records as well. The current scope of the PR is to inline the spark-xml as is.

@sandip-db
Copy link
Contributor Author

Seems the test failure is flakiness. Mind retriggering https://github.com/sandip-db/spark/actions/runs/5745270542/job/15573043078 please @sandip-db ?

The following test (unrelated to this PR) is failing consistently. It was supposedly fixed by the Jira/PR. But its continuing to fail even after the fix. How do we get past this? @HyukjinKwon

2023-08-03T17:34:49.5298626Z [info] *** 1 TEST FAILED ***
2023-08-03T17:34:49.5412181Z [error] Failed: Total 9521, Failed 1, Errors 0, Passed 9520, Ignored 27
2023-08-03T17:34:49.5674077Z [error] Failed tests:
2023-08-03T17:34:49.5674974Z [error] org.apache.spark.sql.execution.QueryExecutionSuite

StackTrace:
2023-08-03T17:07:15.5739831Z [info] - Logging plan changes for execution *** FAILED *** (11 milliseconds)
2023-08-03T17:07:15.5740917Z [info] testAppender.loggingEvents.exists(((x$10: org.apache.logging.log4j.core.LogEvent) => x$10.getMessage().getFormattedMessage().contains(expectedMsg))) was false (QueryExecutionSuite.scala:233)
2023-08-03T17:07:15.5741729Z [info] org.scalatest.exceptions.TestFailedException:
2023-08-03T17:07:15.5742406Z [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
2023-08-03T17:07:15.5743140Z [info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
2023-08-03T17:07:15.5743968Z [info] at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
2023-08-03T17:07:15.5744873Z [info] at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
2023-08-03T17:07:15.5745601Z [info] at org.apache.spark.sql.execution.QueryExecutionSuite.$anonfun$new$34(QueryExecutionSuite.scala:233)
2023-08-03T17:07:15.5746281Z [info] at scala.collection.immutable.List.foreach(List.scala:431)
2023-08-03T17:07:15.5746979Z [info] at org.apache.spark.sql.execution.QueryExecutionSuite.$anonfun$new$31(QueryExecutionSuite.scala:232)

- Copy spark-xml sources
- Update license
- Scala style and format fixes
- AnyFunSuite --> SparkFunSuite
… or SQLTestUtils

Also converted AnyFunSuite to SharedSparkSession or SQLTestUtils
The following test in XmlSuite.scala failed with SharedSparkSession, but works with SQLTestUtils.
test("from_xml array basic test")
@HyukjinKwon
Copy link
Member

@sandip-db would you mind adding some comments about what's new codes (e.g., pom.xml)? Then I think it'd be much easier to merge since most of the codes are already exiting one.

I took a cursory look, and it seems pretty fine to me.

@sandip-db
Copy link
Contributor Author

sandip-db commented Aug 9, 2023

@sandip-db would you mind adding some comments about what's new codes (e.g., pom.xml)? Then I think it'd be much easier to merge since most of the codes are already exiting one.

I took a cursory look, and it seems pretty fine to me.

Hi @HyukjinKwon and reviewers,
As indicated in the description, there is very little new code in this PR. Most of it is picked up from spark-xml package.
The first commit is the vanilla copy of spark-xml with license and scala style fixes.
Reviewers should focus only on the new code in other commits, which doesn't change any functionality. These changes addresses the dependencies and test issues. If it helps, I can squash those other commits into one.

import org.apache.spark.sql.types._
import org.apache.spark.unsafe.types.UTF8String

case class XmlDataToCatalyst(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would need to register this to SQL expression. Can be done in a followup.

/**
* A collection of static functions for working with XML files in Spark SQL
*/
class XmlReader(private var schema: StructType,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think we can remove the whole file but can be done separately in a followup.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably add a couple of methods in DataFrameReader/DataStreamReader

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a PR ready to be sent out with these and other changes you have suggested. Just have to get this merged first  :-)

fs.delete(filesystemPath, true)
} catch {
case e: IOException =>
throw new IOException(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a followup, we should leverage error framework we added (see QueryExecutionErrors)

* Support functions for working with XML columns directly.
*/
// scalastyle:off: object.name
object functions {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a followup, we should move this function to org.apache.spark.sql.functions

import org.apache.spark.sql.execution.datasources.xml.util.{InferSchema, XmlFile}
import org.apache.spark.sql.types.{ArrayType, StructType}

package object xml {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would have to remove this file too, and move those utils such as schema_of_xml to somewhere else.

@@ -0,0 +1,336 @@
/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would have to be placed together with CSVInferSchema at catalyst.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the parsers too

* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.spark.sql.execution.datasources.xml.util
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually have it in the codebase already :-).

import org.apache.spark.sql.execution.datasources.xml.{XmlInputFormat, XmlOptions}
import org.apache.spark.sql.execution.datasources.xml.parsers.StaxXmlGenerator

private[xml] object XmlFile {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we won't likely need this file too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the above comments will be addressed in this sub-task:
https://issues.apache.org/jira/browse/SPARK-44751

@@ -0,0 +1,20 @@
<people>
<person>
<age born=" 1990-02-24 "> 25 </age>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was so young :-)

public final class JavaXmlSuite {

private static final int numBooks = 12;
private static final String booksFile = "src/test/resources/test-data/xml-resources/books.xml";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should just name it resources instead of xml-resources. Or create a sub directory?

@HyukjinKwon
Copy link
Member

I think I'd prefer to merge this first, and address them as a followup so other people can work together. I will review one more time tmr but would be great if others can review this too.

@HyukjinKwon HyukjinKwon changed the title [SPARK-44265][SQL] Built-in XML data source support [SPARK-44732][SQL] Built-in XML data source support Aug 9, 2023

def this() = this(Map.empty)

val charset = parameters.getOrElse("charset", XmlOptions.DEFAULT_CHARSET)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created a sub-task for documentation: https://issues.apache.org/jira/browse/SPARK-44752

* column is string-valued, or ArrayType[StructType] if column is an array of strings
* @param options key-value pairs that correspond to those supported by [[XmlOptions]]
*/
def from_xml(e: Column, schema: DataType, options: Map[String, String] = Map.empty): Column = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should add Python and sparkR binding including Spark Connect

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no more highlevel comments. I will file a JIRAs for my own comments tmr.

@HyukjinKwon
Copy link
Member

Merged to master.

@HyukjinKwon
Copy link
Member

@sandip-db I would appreciate if you create sub-tasks JIRAs based on my comments when you find some time.

hvanhovell pushed a commit to hvanhovell/spark that referenced this pull request Aug 13, 2023
### What changes were proposed in this pull request?
XML is a widely used data format. An external spark-xml package (https://github.com/databricks/spark-xml) is available to read and write XML data in spark. Making spark-xml built-in will provide a better user experience for Spark SQL and structured streaming. The proposal is to inline code from spark-xml package.

The PR has the following commits:
i) The first commit has the following:
- Copy of spark-xml src files.
- Update license
- Scala style and format fixes
- Change AnyFunSuite to SparkFunSuite

ii) Miscellaneous import and scala style fixes.
iii) Add library dependencies
iv) Resource file path fixes and change AnyFunSuite to SharedSparkSession or SQLTestUtils
v) Exclude XML test resource files from license check
vi) Change import from scala.jdk.Collections to scala.collection.JavaConverters

### Why are the changes needed?
Built-in support for XML data source would provide better user experience than having to import an external package.

### Does this PR introduce _any_ user-facing change?
Yes, Add built-in support for XML data source.

### How was this patch tested?
Tested the new unit-tests that came with the imported spark-xml package.
Also ran ./dev/run-test

Closes apache#41832 from sandip-db/spark-xml-master.

Authored-by: Sandip Agarwala <131817656+sandip-db@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
HyukjinKwon pushed a commit that referenced this pull request Aug 22, 2023
### What changes were proposed in this pull request?
This is the second PR related to the built-in XML data source implementation ([jira](https://issues.apache.org/jira/browse/SPARK-44751)).
The previous [PR](#41832) ported the spark-xml package.
This PR addresses the following:
- Implement FileFormat interface
- Address the review comments in the previous [XML PR](#41832)
- Moved from_xml and schema_of_xml to sql/functions
- Moved ".xml" to DataFrameReader/DataFrameWriter
- Removed old APIs like XmlRelation, XmlReader, etc.
- StaxXmlParser changes:
   - Use FailureSafeParser
   - Convert 'Row' usage to 'InternalRow'
   - Convert String to UTF8String
   - Handle MapData and ArrayData for MapType and ArrayType respectively
   - Use TimestampFormatter to parse timestamp
   - Use DateFormatter to parse date
- StaxXmlGenerator changes:
   - Convert 'Row' usage to 'InternalRow'
   - Handle UTF8String for StringType
   - Handle MapData and ArrayData for MapType and ArrayType respectively
   - Use TimestampFormatter to format timestamp
   - Use DateFormatter to format date
- Update XML tests accordingly because of the above changes

### Why are the changes needed?
These changes are required to bring XML data source capability at par with CSV and JSON and supports features like streaming, which requires FileFormat interface to be implemented.

### Does this PR introduce _any_ user-facing change?
Yes, it adds support for XML data source.

### How was this patch tested?
- Ran all the XML unit tests.
- Github Action

Closes #42462 from sandip-db/xml-file-format-master.

Authored-by: Sandip Agarwala <131817656+sandip-db@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
valentinp17 pushed a commit to valentinp17/spark that referenced this pull request Aug 24, 2023
### What changes were proposed in this pull request?
XML is a widely used data format. An external spark-xml package (https://github.com/databricks/spark-xml) is available to read and write XML data in spark. Making spark-xml built-in will provide a better user experience for Spark SQL and structured streaming. The proposal is to inline code from spark-xml package.

The PR has the following commits:
i) The first commit has the following:
- Copy of spark-xml src files.
- Update license
- Scala style and format fixes
- Change AnyFunSuite to SparkFunSuite

ii) Miscellaneous import and scala style fixes.
iii) Add library dependencies
iv) Resource file path fixes and change AnyFunSuite to SharedSparkSession or SQLTestUtils
v) Exclude XML test resource files from license check
vi) Change import from scala.jdk.Collections to scala.collection.JavaConverters

### Why are the changes needed?
Built-in support for XML data source would provide better user experience than having to import an external package.

### Does this PR introduce _any_ user-facing change?
Yes, Add built-in support for XML data source.

### How was this patch tested?
Tested the new unit-tests that came with the imported spark-xml package.
Also ran ./dev/run-test

Closes apache#41832 from sandip-db/spark-xml-master.

Authored-by: Sandip Agarwala <131817656+sandip-db@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
valentinp17 pushed a commit to valentinp17/spark that referenced this pull request Aug 24, 2023
### What changes were proposed in this pull request?
This is the second PR related to the built-in XML data source implementation ([jira](https://issues.apache.org/jira/browse/SPARK-44751)).
The previous [PR](apache#41832) ported the spark-xml package.
This PR addresses the following:
- Implement FileFormat interface
- Address the review comments in the previous [XML PR](apache#41832)
- Moved from_xml and schema_of_xml to sql/functions
- Moved ".xml" to DataFrameReader/DataFrameWriter
- Removed old APIs like XmlRelation, XmlReader, etc.
- StaxXmlParser changes:
   - Use FailureSafeParser
   - Convert 'Row' usage to 'InternalRow'
   - Convert String to UTF8String
   - Handle MapData and ArrayData for MapType and ArrayType respectively
   - Use TimestampFormatter to parse timestamp
   - Use DateFormatter to parse date
- StaxXmlGenerator changes:
   - Convert 'Row' usage to 'InternalRow'
   - Handle UTF8String for StringType
   - Handle MapData and ArrayData for MapType and ArrayType respectively
   - Use TimestampFormatter to format timestamp
   - Use DateFormatter to format date
- Update XML tests accordingly because of the above changes

### Why are the changes needed?
These changes are required to bring XML data source capability at par with CSV and JSON and supports features like streaming, which requires FileFormat interface to be implemented.

### Does this PR introduce _any_ user-facing change?
Yes, it adds support for XML data source.

### How was this patch tested?
- Ran all the XML unit tests.
- Github Action

Closes apache#42462 from sandip-db/xml-file-format-master.

Authored-by: Sandip Agarwala <131817656+sandip-db@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
ragnarok56 pushed a commit to ragnarok56/spark that referenced this pull request Mar 2, 2024
### What changes were proposed in this pull request?
XML is a widely used data format. An external spark-xml package (https://github.com/databricks/spark-xml) is available to read and write XML data in spark. Making spark-xml built-in will provide a better user experience for Spark SQL and structured streaming. The proposal is to inline code from spark-xml package.

The PR has the following commits:
i) The first commit has the following:
- Copy of spark-xml src files.
- Update license
- Scala style and format fixes
- Change AnyFunSuite to SparkFunSuite

ii) Miscellaneous import and scala style fixes.
iii) Add library dependencies
iv) Resource file path fixes and change AnyFunSuite to SharedSparkSession or SQLTestUtils
v) Exclude XML test resource files from license check
vi) Change import from scala.jdk.Collections to scala.collection.JavaConverters

### Why are the changes needed?
Built-in support for XML data source would provide better user experience than having to import an external package.

### Does this PR introduce _any_ user-facing change?
Yes, Add built-in support for XML data source.

### How was this patch tested?
Tested the new unit-tests that came with the imported spark-xml package.
Also ran ./dev/run-test

Closes apache#41832 from sandip-db/spark-xml-master.

Authored-by: Sandip Agarwala <131817656+sandip-db@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
ragnarok56 pushed a commit to ragnarok56/spark that referenced this pull request Mar 2, 2024
### What changes were proposed in this pull request?
This is the second PR related to the built-in XML data source implementation ([jira](https://issues.apache.org/jira/browse/SPARK-44751)).
The previous [PR](apache#41832) ported the spark-xml package.
This PR addresses the following:
- Implement FileFormat interface
- Address the review comments in the previous [XML PR](apache#41832)
- Moved from_xml and schema_of_xml to sql/functions
- Moved ".xml" to DataFrameReader/DataFrameWriter
- Removed old APIs like XmlRelation, XmlReader, etc.
- StaxXmlParser changes:
   - Use FailureSafeParser
   - Convert 'Row' usage to 'InternalRow'
   - Convert String to UTF8String
   - Handle MapData and ArrayData for MapType and ArrayType respectively
   - Use TimestampFormatter to parse timestamp
   - Use DateFormatter to parse date
- StaxXmlGenerator changes:
   - Convert 'Row' usage to 'InternalRow'
   - Handle UTF8String for StringType
   - Handle MapData and ArrayData for MapType and ArrayType respectively
   - Use TimestampFormatter to format timestamp
   - Use DateFormatter to format date
- Update XML tests accordingly because of the above changes

### Why are the changes needed?
These changes are required to bring XML data source capability at par with CSV and JSON and supports features like streaming, which requires FileFormat interface to be implemented.

### Does this PR introduce _any_ user-facing change?
Yes, it adds support for XML data source.

### How was this patch tested?
- Ran all the XML unit tests.
- Github Action

Closes apache#42462 from sandip-db/xml-file-format-master.

Authored-by: Sandip Agarwala <131817656+sandip-db@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants