Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for processing Photon event logs in Scala #1338

Open
wants to merge 16 commits into
base: dev
Choose a base branch
from

Conversation

parthosa
Copy link
Collaborator

@parthosa parthosa commented Sep 10, 2024

Contributes to #251. This PR adds support for qualifying Photon event logs in Scala.

Approach:

  • Mapped Photon operators to Spark operators using com.databricks.photon.PhotonSupport.
  • During Spark plan graph construction, identified Photon nodes and created:
    • PhotonSparkPlanGraphNode: Extends SparkPlanGraphNode for non-WholeStageCodegen nodes
    • PhotonSparkPlanGraphCluster: Extends SparkPlanGraphCluster for Photon's WholeStageCodegen nodes.
  • Added PhotonPlanParser for non-WholeStageCodegen nodes:
    • Includes parseNode method to parse Photon nodes using Photon-specific parser (e.g., PhotonBroadcastNestedLoopJoinExecParser)
    • Fallbacks to Spark CPU parser if no Photon-specific parser is provided .
  • Added PhotonStageExecParser for WholeStageCodegen nodes.

Limitations:

  • Some mappings are one-to-many; the tool selects the first match.
  • Supports Databricks Runtime up to 13.3. Future versions will require separate mappings.

Output Changes:

  • No changes in Output Files schema.
  • Photon event logs will be processed successfully.

Testing:

Unit Tests:

  • QualificationSuite: Tests for Photon processing were added.
  • PhotonPlanParserSuite: Added tests for Photon nodes.

E2E Tests:

  • Updated Photon test case from skipped to success.

Manual Test Command:

java -Xmx8g -XX:+UseG1GC -cp "$SPARK_RAPIDS_TOOLS_JAR:$SPARK_HOME/jars/*" com.nvidia.spark.rapids.tool.qualification.QualificationMain <photon-event-log>

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
@parthosa parthosa added feature request New feature or request core_tools Scope the core module (scala) labels Sep 10, 2024
@parthosa parthosa self-assigned this Sep 10, 2024
@parthosa parthosa added the affect-output A change that modifies the output (add/remove/rename files, add/remove/rename columns) label Sep 10, 2024
@parthosa parthosa marked this pull request as ready for review September 10, 2024 23:20
core/src/main/resources/photonOperatorMapping.json Outdated Show resolved Hide resolved
// Implicitly define JSON formats for deserialization using DefaultFormats
implicit val formats: Formats = DefaultFormats
// Extract and deserialize the JValue object into a Map[String, String]
// TODO: Instead of only extracting the first value, we should consider extracting all values
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so is this implying the ones that have multiple entries in the mapping file aren't actually used? seems like a must fix

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question here about selecting the first match, is there a plan to fix later?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline with Tom, for operators that have 1-to-many mappings, we cannot distinguish between them. Selecting the first one for now.

nodeName = sparkNode,
simpleString = planInfo.simpleString.replace(planInfo.nodeName, sparkNode),
children = planInfo.children,
metadata = planInfo.metadata,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the meta all lineup and metrics? I wouldn't expect metrics to be anywhere close

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain the lining up and metrics?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets talk about off line. I mean that if photon execs have certain parameters, we need to make sure the CPU one we replace it with has its parameters filled in correctly. In addition to just parameters there are metrics that come out with each exec. In general the Photon ones I think keep the CPU ones and add their own but we need to verify that, especially if we use those metrics for other heuristics.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Verified from NDS photon runs that there are no photon specific parameters
  • Verified metrics from Photon runs, CPU metrics are kept intact.

} else {
// If exprString has the format: Inner, BuildRight
// Note: This format is present in Photon Event logs
(nestedLoopParameters(1).trim, nestedLoopParameters(0).trim)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this perhaps goes to my last comment, it seems we aren't really doing the full mapping with parameters and everything? if we are going to convert it seems more like to convert it fully over to have some format if possible. If its not possible then we need something like shim layer to deal with.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • By default any Photon specific parsing is not needed in most cases. Replacing the name by its equivalent Spark name is sufficient.
  • Incase, a Photon specific parser is needed, we define a custom parser (eg. added classes like PhotonBroadcastNestedLoopJoinExecParser)

@@ -46,6 +46,7 @@ import org.apache.spark.sql.rapids.tool.profiling.{ApplicationInfo, SparkPlanInf
object GenerateDot {
val GPU_COLOR = "#76b900" // NVIDIA Green
val CPU_COLOR = "#0071c5"
// TODO: Add color for Photon nodes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why this is here? Are we printing the photon plan or the converted plan or do we have options for both?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this

nodeIdGenerator: AtomicLong,
nodes: mutable.ArrayBuffer[SparkPlanGraphNode],
edges: mutable.ArrayBuffer[SparkPlanGraphEdge],
parent: SparkPlanGraphNode,
subgraph: SparkPlanGraphCluster,
exchanges: mutable.HashMap[SparkPlanInfo, SparkPlanGraphNode]): Unit = {
// Replace Photon node names with Spark node names
// TODO: Skip this if app.isPhoton is false
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see this sooner rather then later if we can easily determine up front

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the flow. We do not create a new instance of SparkPlanInfo here. We are only interested in the name

@parthosa parthosa linked an issue Sep 11, 2024 that may be closed by this pull request
Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be nice to learn from our experience dealing with different platforms/versions and be ahead of the changes/feature requests.
I had some previous plans to use shims/plugin to extend the implementation for different Operators/environments, and it never materialized because of priorities.
My 2 cents:

  • It is more accurate to parse an operator based on the envrionmentType. This is the reason why I changed the arguments of SqlPlanParser to pass in the app so that we can get this information. Ideally, SqlPlanParser should be shimmed since we have DB, DB-photon, and other custom spark from other customers.
  • Versioning is another dimension. We had suffer from versioning a lot. Especially, for testing. I expect that Photon is changing very frequently and optimizations are going to vary from a release to the other. We will be shooting ourselves in the foot if we repeat the same methodology.
  • Our use of trait ExecParser could be improved. There are so many copy-pasted code in all the ExecParser classes. Ideally an abstract class should contain all the functions we need to apply, then we only implement the code for what needs to be handled differently. This will reduce our code base significantly.

photoOperatorMapping.json is a very good first step. In the future, as an improvement, it will be nice to see a generic way to plugin other framework.

Thanks @parthosa !

@parthosa
Copy link
Collaborator Author

parthosa commented Sep 17, 2024

Thanks @amahussein for the feedback.

It is more accurate to parse an operator based on the envrionmentType.

I agree. I was thinking to design parsers specific for each Photon Exec that extend Spark Exec parser. Eg

case class PhotonBroadcastNestedLoopJoinExecParser(
    node: PhotonSparkPlanGraphNode,
    checker: PluginTypeChecker,
    sqlID: Long)
  extends BroadcastNestedLoopJoinExecParserBase(node, checker, sqlID)

Versioning is another dimension. We had suffer from versioning a lot.

Currently we are using a mapping generated by a single Databricks runtime version 13.3. In future, we could extend this by having separate mapping files for separate versions.

@parthosa parthosa marked this pull request as draft September 18, 2024 17:18
Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
@parthosa parthosa force-pushed the spark-rapids-tools-251-support-photon-in-scala branch from fea97ec to 9bc9d10 Compare September 19, 2024 20:46
Comment on lines +34 to +35
name: Check for file over 4.0MiB
args: ['--maxkb=4000', '--enforce-all']
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Increasing the threshold, as even with ultra zstd compression, unable to reduce photon event log size below 2GB.

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
@parthosa parthosa marked this pull request as ready for review September 19, 2024 22:04
/**
* Parse the SparkPlanGraphNode and return ExecInfo.
*/
def parseSparkNode(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved from parsePlanNode

@@ -0,0 +1,82 @@
/*
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved methods from SQLPlanParserSuite to BasePlanParserSuite

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
cindyyuanjiang
cindyyuanjiang previously approved these changes Sep 26, 2024
Copy link
Collaborator

@cindyyuanjiang cindyyuanjiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @parthosa!

Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @parthosa
We can discuss offline The ToolsPlanGraph changes.

@@ -466,6 +467,7 @@ object QualOutputWriter {
val RECOMMENDED_WORKER_NODE_TYPE = "Recommended Worker Node Type"
val DRIVER_NODE_TYPE = "Driver Node Type"
val TOTAL_CORE_SEC = "Total Core Seconds"
val IS_PHOTON = "Photon App"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it part of the requirements to add a binary flag for "Photon App" in the output files?
If not, then this should be part of the App properties Map[String, String]. For example, a user running on dataproc, the columns will not be useful.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the binary flag as we can detect this information from spark_properties.csv

Comment on lines 198 to 201
val photonName: String,
val photonDesc: String,
sparkName: String,
sparkDesc: String,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can revisit this because I think it is redundant to keep this information in node level since the map should be an operator scope rather than node scope. This will also impact memory storage.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sparkName and sparkDesc are passed to SparkPlanGraphNode as node and desc respectively.

This ensures that any call to node.name and node.desc returns the corresponding Spark name and desc.

@@ -290,7 +356,8 @@ object ToolsPlanGraph {
planInfo.nodeName,
planInfo.simpleString,
mutable.ArrayBuffer[SparkPlanGraphNode](),
metrics)
metrics,
isPhotonNode = isPhotonNode)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to have isPhotonNode as argument because it is set using an object's method. Then have different classes for photonNodes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed addition of isPhotonNode as we have different classes for PhotonNodes

Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline some of the main concepts. The actions:

  • Adding a column "Is Photon App" to all SQL and applications seem to be invasive.
  • Discussed the possibility to generate that as a property map that can be extended later to include any sort of information that can be used by QualX.

@parthosa
Copy link
Collaborator Author

parthosa commented Oct 2, 2024

Converting it to draft as changes in PR-1362 will affect this.

@parthosa parthosa marked this pull request as draft October 2, 2024 22:47
Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
…a-dev

# Conflicts:
#	core/src/main/scala/org/apache/spark/sql/rapids/tool/util/ToolsPlanGraph.scala
Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
@parthosa parthosa marked this pull request as ready for review October 9, 2024 00:14
@parthosa
Copy link
Collaborator Author

parthosa commented Oct 9, 2024

@amahussein

Adding a column "Is Photon App" to all SQL and applications seem to be invasive.

Removed Is Photon App column from the output file.

Discussed the possibility to generate that as a property map that can be extended later to include any sort of information that can be used by QualX.

We decided that it is not needed since downstream processes (QualX + Python CLI) can detect photon app using spark_properties.csv

@parthosa parthosa removed the affect-output A change that modifies the output (add/remove/rename files, add/remove/rename columns) label Oct 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core_tools Scope the core module (scala) feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Add qualification support for Databricks Photon event logs
4 participants