Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates for https://github.com/archivesunleashed/aut/pull/387 #30

Merged
merged 2 commits into from
Dec 5, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 14 additions & 14 deletions current/text-analysis.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ import io.archivesunleashed._
import io.archivesunleashed.matchbox._

RecordLoader.loadArchives("example.arc.gz", sc).keepValidPages()
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString)))
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTMLRDD(r.getContentString)))
.saveAsTextFile("plain-text/")
```

Expand All @@ -53,7 +53,7 @@ import io.archivesunleashed._
import io.archivesunleashed.matchbox._

RecordLoader.loadArchives("example.arc.gz", sc).keepValidPages()
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(RemoveHTTPHeaderRDD(r.getContentString))))
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTMLRDD(RemoveHTTPHeaderRDD(r.getContentString))))
.saveAsTextFile("plain-text-noheaders/")
```

Expand All @@ -67,7 +67,7 @@ import io.archivesunleashed.df._

RecordLoader.loadArchives("example.warc.gz", sc)
.extractValidPagesDF()
.select(RemoveHTML($"content"))
.select(RemoveHTMLDF($"content"))
.write
.option("header","true")
.csv("plain-text-noheaders/")
Expand All @@ -89,7 +89,7 @@ import io.archivesunleashed.matchbox._

RecordLoader.loadArchives("example.arc.gz", sc).keepValidPages()
.keepDomains(Set("www.archive.org"))
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(RemoveHTTPHeaderRDD(r.getContentString))))
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTMLRDD(RemoveHTTPHeaderRDD(r.getContentString))))
.saveAsTextFile("plain-text-domain/")
```
### Scala DF
Expand All @@ -114,7 +114,7 @@ import io.archivesunleashed.matchbox._

RecordLoader.loadArchives("example.arc.gz", sc).keepValidPages()
.keepUrlPatterns(Set("(?i)http://www.archive.org/details/.*".r))
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(RemoveHTTPHeaderRDD(r.getContentString))))
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTMLRDD(RemoveHTTPHeaderRDD(r.getContentString))))
.saveAsTextFile("details/")
```

Expand All @@ -138,7 +138,7 @@ import io.archivesunleashed.matchbox._

RecordLoader.loadArchives("example.arc.gz", sc).keepValidPages()
.keepDomains(Set("www.archive.org"))
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, ExtractBoilerpipeText(RemoveHTTPHeaderRDD(r.getContentString))))
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, ExtractBoilerpipeTextRDD(RemoveHTTPHeaderRDD(r.getContentString))))
.saveAsTextFile("plain-text-no-boilerplate/")
```

Expand All @@ -165,8 +165,8 @@ import io.archivesunleashed._
import io.archivesunleashed.matchbox._

RecordLoader.loadArchives("example.arc.gz", sc).keepValidPages()
.keepDate(List("200804"), ExtractDate.DateComponent.YYYYMM)
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(RemoveHTTPHeaderRDD(r.getContentString))))
.keepDate(List("200804"), ExtractDateRDD.DateComponent.YYYYMM)
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTMLRDD(RemoveHTTPHeaderRDD(r.getContentString))))
.saveAsTextFile("plain-text-date-filtered-200804/")
```

Expand All @@ -177,8 +177,8 @@ import io.archivesunleashed._
import io.archivesunleashed.matchbox._

RecordLoader.loadArchives("example.arc.gz", sc).keepValidPages()
.keepDate(List("2008"), ExtractDate.DateComponent.YYYY)
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(RemoveHTTPHeaderRDD(r.getContentString))))
.keepDate(List("2008"), ExtractDateRDD.DateComponent.YYYY)
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTMLRDD(RemoveHTTPHeaderRDD(r.getContentString))))
.saveAsTextFile("plain-text-date-filtered-2008/")
```

Expand All @@ -189,8 +189,8 @@ import io.archivesunleashed._
import io.archivesunleashed.matchbox._

RecordLoader.loadArchives("example.arc.gz", sc).keepValidPages()
.keepDate(List("2008","2015"), ExtractDate.DateComponent.YYYY)
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(RemoveHTTPHeaderRDD(r.getContentString))))
.keepDate(List("2008","2015"), ExtractDateRDD.DateComponent.YYYY)
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTMLRDD(RemoveHTTPHeaderRDD(r.getContentString))))
.saveAsTextFile("plain-text-date-filtered-2008-2015/")
```

Expand Down Expand Up @@ -223,7 +223,7 @@ import io.archivesunleashed.matchbox._
RecordLoader.loadArchives("example.arc.gz", sc).keepValidPages()
.keepDomains(Set("www.archive.org"))
.keepLanguages(Set("fr"))
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(RemoveHTTPHeaderRDD(r.getContentString))))
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTMLRDD(RemoveHTTPHeaderRDD(r.getContentString))))
.saveAsTextFile("plain-text-fr/")
```

Expand All @@ -249,7 +249,7 @@ import io.archivesunleashed.matchbox._

RecordLoader.loadArchives("example.arc.gz",sc).keepValidPages()
.keepContent(Set("radio".r))
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(RemoveHTTPHeaderRDD(r.getContentString))))
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTMLRDD(RemoveHTTPHeaderRDD(r.getContentString))))
.saveAsTextFile("plain-text-radio/")
```

Expand Down
4 changes: 2 additions & 2 deletions current/toolkit-walkthrough.md
Original file line number Diff line number Diff line change
Expand Up @@ -197,7 +197,7 @@ Take some time to explore the various options and variables that you can swap in
Some options:

* **Keep URL Patterns**: Instead of domains, what if you wanted to have text relating to just a certain pattern? Substitute `.keepDomains` for a command like: `.keepUrlPatterns(Set("(?i)http://geocities.com/EnchantedForest/.*".r))`
* **Filter by Date**: What if we just wanted data from 2006? You could add the following command after `.keepValidPages()`: `.keepDate(List("2006"), ExtractDate.DateComponent.YYYY)`
* **Filter by Date**: What if we just wanted data from 2006? You could add the following command after `.keepValidPages()`: `.keepDate(List("2006"), ExtractDateRDD.DateComponent.YYYY)`
* **Filter by Language**: What if you just want French-language pages? After `.keepDomains` add a new line: `.keepLanguages(Set("fr"))`.

For example, if we just wanted the French-language Liberal pages, we would run:
Expand All @@ -222,7 +222,7 @@ import io.archivesunleashed.matchbox._

RecordLoader.loadArchives("/aut-resources/Sample-Data/*.gz", sc)
.keepValidPages()
.keepDate(List("2006"), ExtractDate.DateComponent.YYYY)
.keepDate(List("2006"), ExtractDateRDD.DateComponent.YYYY)
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTMLRDD(r.getContentString)))
.saveAsTextFile("/data/2006-text")
```
Expand Down