Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Scala DF documentation for AUK derivatives. #34

Merged
merged 6 commits into from
Apr 14, 2020
Merged

Conversation

ruebot
Copy link
Member

@ruebot ruebot commented Jan 1, 2020

@lintool @ianmilligan1 here are the first two. We still need to do the third derivative, and I'll move this out of draft when we get it done.

@SinghGursimran can you make this your next focus point in archivesunleashed/aut#223? Converting this (below) to DF?

val links = validPages
  .map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString)))
  .flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\\\s*www\\\\.", ""), ExtractDomain(f._2).replaceAll("^\\\\s*www\\\\.", ""))))
  .filter(r => r._2 != "" && r._3 != "")
  .countItems()
  .filter(r => r._2 > 5)

WriteGraph.asGraphml(links, "example.graphml")

@ruebot ruebot requested review from lintool and ianmilligan1 January 1, 2020 18:51
current/standard-derivatives.md Outdated Show resolved Hide resolved
current/standard-derivatives.md Outdated Show resolved Hide resolved
Copy link
Member

@ianmilligan1 ianmilligan1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested and all checks out - I'll keep an eye on the draft PR @ruebot!

@SinghGursimran
Copy link
Contributor

SinghGursimran commented Jan 2, 2020

@ruebot
Dataframe implementation for the above query:

import io.archivesunleashed._
import io.archivesunleashed.df._

val target = udf((vs: Any) => {
   				       var res = ""
   					if(vs != null){
   						res = vs.toString.split(",")(1)
   					}
   					res
   				})
val src = udf((vs: Any) => {
   				var res = ""
   				if(vs != null){
   					val s = vs.toString.split(",")(0)
   					if(s.length() != 0)
   						res = s.drop(1)
   				}
   				res
   			   })
val modify = udf((str: String) => str.replaceAll("^\\\\s*www\\\\.", ""))

val df = RecordLoader.loadArchives("./src/test/resources/arc/example.arc.gz",sc).webpages()
   				 .select($"crawl_date",explode_outer(ExtractLinksDF($"url",$"content")).as("link"))

df.select($"crawl_date",modify(ExtractDomainDF(src($"link"))).as("Source"),modify(ExtractDomainDF(target($"link"))).as("Target"))
.filter($"Source" =!= "")
.filter($"Target" =!= "")
.groupBy("crawl_date","Source","Target")
.count()
.filter($"count" > 5)
.orderBy(desc("count"))
.show(20)

For writing as graph, we do not have df implementation yet. I will add the code for that then update this.

@ruebot ruebot marked this pull request as ready for review April 14, 2020 12:18
@ruebot
Copy link
Member Author

ruebot commented Apr 14, 2020

DataFrame graphml tested locally with:

df:

import io.archivesunleashed._
import io.archivesunleashed.df._
import io.archivesunleashed.app._

sc.setLogLevel("INFO")

// Web archive collection; web graph.
val webgraph = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocities", sc)
  .webgraph()


val graph = webgraph.groupBy(
                       $"crawl_date",
                       RemovePrefixWWWDF(ExtractDomainDF($"src")).as("src_domain"),
                       RemovePrefixWWWDF(ExtractDomainDF($"dest")).as("dest_domain"))
              .count()
              .filter(!($"dest_domain"===""))
              .filter(!($"src_domain"===""))
              .filter($"count" > 5)
              .orderBy(desc("count"))

WriteGraphML(graph.collect(), "/home/nruest/Projects/au/sample-data/issue-439/documentation-pr-test-df.graphml")

rdd:

import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
  
sc.setLogLevel("INFO")

// Web archive collection.
val warcs = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocities", sc)
  .keepValidPages()


// GraphML.
val links = warcs
  .map(r => (r.getCrawlDate, ExtractLinksRDD(r.getUrl, r.getContentString)))
  .flatMap(r => r._2.map(f => (r._1,
                               ExtractDomainRDD(f._1).replaceAll("^\\s*www\\.", ""),
                               ExtractDomainRDD(f._2).replaceAll("^\\s*www\\.", ""))))
  .filter(r => r._2 != "" && r._3 != "")
  .countItems()
  .filter(r => r._2 > 5)

WriteGraphML(links, "/home/nruest/Projects/au/sample-data/issue-439/documentation-pr-test-rdd.graphml")

Copy link
Member

@ianmilligan1 ianmilligan1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works like a charm!
Screen Shot 2020-04-14 at 4 57 06 PM

@ianmilligan1 ianmilligan1 merged commit f5b2652 into master Apr 14, 2020
@ianmilligan1 ianmilligan1 deleted the auk-derv-df branch April 14, 2020 21:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants