-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Scala DF documentation for AUK derivatives. #34
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested and all checks out - I'll keep an eye on the draft PR @ruebot!
@ruebot import io.archivesunleashed._
import io.archivesunleashed.df._
val target = udf((vs: Any) => {
var res = ""
if(vs != null){
res = vs.toString.split(",")(1)
}
res
})
val src = udf((vs: Any) => {
var res = ""
if(vs != null){
val s = vs.toString.split(",")(0)
if(s.length() != 0)
res = s.drop(1)
}
res
})
val modify = udf((str: String) => str.replaceAll("^\\\\s*www\\\\.", ""))
val df = RecordLoader.loadArchives("./src/test/resources/arc/example.arc.gz",sc).webpages()
.select($"crawl_date",explode_outer(ExtractLinksDF($"url",$"content")).as("link"))
df.select($"crawl_date",modify(ExtractDomainDF(src($"link"))).as("Source"),modify(ExtractDomainDF(target($"link"))).as("Target"))
.filter($"Source" =!= "")
.filter($"Target" =!= "")
.groupBy("crawl_date","Source","Target")
.count()
.filter($"count" > 5)
.orderBy(desc("count"))
.show(20)
For writing as graph, we do not have df implementation yet. I will add the code for that then update this. |
DataFrame graphml tested locally with: df: import io.archivesunleashed._
import io.archivesunleashed.df._
import io.archivesunleashed.app._
sc.setLogLevel("INFO")
// Web archive collection; web graph.
val webgraph = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocities", sc)
.webgraph()
val graph = webgraph.groupBy(
$"crawl_date",
RemovePrefixWWWDF(ExtractDomainDF($"src")).as("src_domain"),
RemovePrefixWWWDF(ExtractDomainDF($"dest")).as("dest_domain"))
.count()
.filter(!($"dest_domain"===""))
.filter(!($"src_domain"===""))
.filter($"count" > 5)
.orderBy(desc("count"))
WriteGraphML(graph.collect(), "/home/nruest/Projects/au/sample-data/issue-439/documentation-pr-test-df.graphml") rdd: import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
sc.setLogLevel("INFO")
// Web archive collection.
val warcs = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocities", sc)
.keepValidPages()
// GraphML.
val links = warcs
.map(r => (r.getCrawlDate, ExtractLinksRDD(r.getUrl, r.getContentString)))
.flatMap(r => r._2.map(f => (r._1,
ExtractDomainRDD(f._1).replaceAll("^\\s*www\\.", ""),
ExtractDomainRDD(f._2).replaceAll("^\\s*www\\.", ""))))
.filter(r => r._2 != "" && r._3 != "")
.countItems()
.filter(r => r._2 > 5)
WriteGraphML(links, "/home/nruest/Projects/au/sample-data/issue-439/documentation-pr-test-rdd.graphml") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lintool @ianmilligan1 here are the first two. We still need to do the third derivative, and I'll move this out of draft when we get it done.
@SinghGursimran can you make this your next focus point in archivesunleashed/aut#223? Converting this (below) to DF?