Need to re-write a scala code written to get notebook details for databricks #620

zacayd · 2023-03-01T12:54:55Z

zacayd
Mar 1, 2023

Hi
i have this code to get details of notebook into the spline aget:

However from version 1.0.0 this is not supported
do you have a way to re right it that will be supported on 1.0.0

import scala.util.Try

import com.databricks.dbutils_v1.DBUtilsHolder.dbutils
import za.co.absa.spline.harvester.conf.StandardSplineConfigurationStack
import za.co.absa.spline.harvester.extra.UserExtraMetadataProvider
import za.co.absa.spline.harvester.extra.UserExtraAppendingPostProcessingFilter
import za.co.absa.spline.harvester.HarvestingContext
import org.apache.commons.configuration.Configuration
import za.co.absa.spline.harvester.SparkLineageInitializer._
import za.co.absa.spline.harvester.conf.DefaultSplineConfigurer
import za.co.absa.spline.producer.model.v1_1._
import za.co.absa.spline.producer.model._
import scala.util.parsing.json.JSON
import scala.concurrent.duration.Duration
import scala.util.{Failure, Success, Try}
val splineConf: Configuration = StandardSplineConfigurationStack(spark)
spark.enableLineageTracking(new DefaultSplineConfigurer(spark,splineConf) {
  //override protected def userExtraMetadataProvider = new UserExtraMetaDataProvider {
  //val test = dbutils.notebook.getContext.notebookPath
  val notebookInformationJson = dbutils.notebook.getContext.toJson
  val outerMap = JSON.parseFull(notebookInformationJson).getOrElse(0).asInstanceOf[Map[String,String]]
  val tagMap = outerMap("tags").asInstanceOf[Map[String,String]]
  val extraContextMap = outerMap("extraContext").asInstanceOf[Map[String,String]]
  val notebookPath = extraContextMap("notebook_path").split("/")
  val workspaceUrl=tagMap("browserHostName")
    
  
    
  val workspaceName=dbutils.notebook().getContext().notebookPath.get
  val notebookURL = tagMap("browserHostName")+"/?o="+tagMap("orgId")+tagMap("browserHash")
  val user = tagMap("user")
  val name = notebookPath(notebookPath.size-1)
  val notebookInfo = Map("notebookURL" -> notebookURL,
  "user" -> user,
  "workspaceName" ->workspaceName,
  "workspaceUrl" -> workspaceUrl,                       
  "name" -> name,
  "mounts" -> dbutils.fs.ls("/FileStore/tables").map(_.path),
  "timestamp" -> System.currentTimeMillis)
  val notebookInfoJson = scala.util.parsing.json.JSONObject(notebookInfo)
  override protected def maybeUserExtraMetadataProvider: Option[UserExtraMetadataProvider] = Some(new UserExtraMetadataProvider() {
    override def forExecEvent(event: ExecutionEvent, ctx: HarvestingContext): Map[String, Any] = Map("foo" -> "bar1")
    override def forExecPlan(plan: ExecutionPlan, ctx: HarvestingContext): Map[String, Any] = Map("notebookInfo" -> notebookInfoJson)
    override def forOperation(op: ReadOperation, ctx: HarvestingContext): Map[String, Any] = Map("foo" -> "bar3")
    override def forOperation(op: WriteOperation, ctx: HarvestingContext): Map[String, Any] = Map("foo" -> "bar4")
    override def forOperation(op: DataOperation, ctx: HarvestingContext): Map[String, Any] = Map("foo" -> "bar5")
  })
})

cerveada · 2023-03-01T13:53:29Z

cerveada
Mar 1, 2023
Maintainer

see: #307 (comment)

Then to use the filter via programmatic config:

spark.enableLineageTracking(
  AgentConfig.builder()
    .postProcessingFilter(myFilter)
    .build()
)

Assuming the filter is not specified elsewhere. If you need multiple filters, you would have to use a composite filter.

Take a look into SparkLineageInitializer and AgentConfig classes for more detials.

0 replies

zacayd · 2023-03-06T12:10:54Z

zacayd
Mar 6, 2023
Author

Hi
i would like to share that i succeed to write a code in Scala that can get the notebook path and name

%scala
import scala.util.parsing.json.JSON
import za.co.absa.spline.harvester.SparkLineageInitializer._
import za.co.absa.spline.agent.AgentConfig
import za.co.absa.spline.harvester.postprocessing.AbstractPostProcessingFilter
import za.co.absa.spline.harvester.postprocessing.PostProcessingFilter
import org.apache.commons.configuration.Configuration
import za.co.absa.spline.harvester.conf.StandardSplineConfigurationStack
import za.co.absa.spline.harvester.HarvestingContext
import za.co.absa.spline.producer.model.ExecutionPlan
import za.co.absa.spline.producer.model.ExecutionEvent
import za.co.absa.spline.producer.model.ReadOperation
import za.co.absa.spline.producer.model.WriteOperation
import za.co.absa.spline.producer.model.DataOperation
import za.co.absa.spline.harvester.ExtraMetadataImplicits._
import za.co.absa.spline.harvester.SparkLineageInitializer._

val notebookInformationJson = dbutils.notebook.getContext.toJson
val outerMap = JSON.parseFull(notebookInformationJson).getOrElse(0).asInstanceOf[Map[String,String]]
val tagMap = outerMap("tags").asInstanceOf[Map[String,String]]
val extraContextMap = outerMap("extraContext").asInstanceOf[Map[String,String]]
val notebookPath = extraContextMap("notebook_path").split("/")
val workspaceUrl=tagMap("browserHostName")
    
val workspaceName=dbutils.notebook().getContext().notebookPath.get
val notebookURL = tagMap("browserHostName")+"/?o="+tagMap("orgId")+tagMap("browserHash")
val user = tagMap("user")
val name = notebookPath(notebookPath.size-1)
val notebookInfo = Map("notebookURL" -> notebookURL,
"user" -> user,
"workspaceName" ->workspaceName,
"workspaceUrl" -> workspaceUrl,                       
"name" -> name,
"mounts" -> dbutils.fs.ls("/FileStore/tables").map(_.path),
"timestamp" -> System.currentTimeMillis)
val notebookInfoJson = scala.util.parsing.json.JSONObject(notebookInfo)


class CustomFilter extends PostProcessingFilter {
  def this(conf: Configuration) = this()

  override def processExecutionEvent(event: ExecutionEvent, ctx: HarvestingContext): ExecutionEvent =
    event.withAddedExtra(Map("foo" -> "bar"))

  override def processExecutionPlan(plan: ExecutionPlan, ctx: HarvestingContext ): ExecutionPlan =
    plan.withAddedExtra(Map( "notebookInfo" -> notebookInfoJson))

  override def processReadOperation(op: ReadOperation, ctx: HarvestingContext ): ReadOperation =
    op.withAddedExtra(Map("foo" -> "bar"))

  override def processWriteOperation(op: WriteOperation, ctx: HarvestingContext): WriteOperation =
    op.withAddedExtra(Map("foo" -> "bar"))

  override def processDataOperation(op: DataOperation, ctx: HarvestingContext  ): DataOperation =
    op.withAddedExtra(Map("foo" -> "bar"))
}



val myInstance = new CustomFilter()


spark.enableLineageTracking(
  AgentConfig.builder()
    .postProcessingFilter(myInstance)
    .build()
)

1 reply

wajda Mar 7, 2023
Maintainer

You can simplify it a little if you inherit your custom filter from the AbstractPostProcessingFilter. If you do that you don't have to override the method that you do not use, and you can save of a few import statements as well.

class CustomFilter extends AbstractPostProcessingFilter("myCustomFilter") {
  override def processExecutionPlan(plan: ExecutionPlan, ctx: HarvestingContext): ExecutionPlan =
    plan.withAddedExtra(Map("notebookInfo" -> notebookInfo))
}

Also, it looks like notebookInfoJson is redundant, as you can use notebookInfo map directly - Spline agent will serialize it to JSON for you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need to re-write a scala code written to get notebook details for databricks #620

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Need to re-write a scala code written to get notebook details for databricks #620

zacayd Mar 1, 2023

Replies: 2 comments · 1 reply

cerveada Mar 1, 2023 Maintainer

zacayd Mar 6, 2023 Author

wajda Mar 7, 2023 Maintainer

zacayd
Mar 1, 2023

Replies: 2 comments 1 reply

cerveada
Mar 1, 2023
Maintainer

zacayd
Mar 6, 2023
Author

wajda Mar 7, 2023
Maintainer