Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataframe matchbox Implementations #387

Merged
merged 12 commits into from
Dec 5, 2019

Conversation

SinghGursimran
Copy link
Collaborator

Dataframe Implementations for ExtractDate, DetectLanguage and ExtarctBoilerpipeText

For Testing:

ExtractDate:

import io.archivesunleashed._
import io.archivesunleashed.df._
import org.apache.spark.sql.functions._


val df = RecordLoader.loadArchives("./src/test/resources/arc/example.arc.gz",sc)
			.webpages()
			.select(ExtractDateDF($"crawl_date",lit("YYYY")))
			.show(3,false)

DetectLanguage:

import io.archivesunleashed._
import io.archivesunleashed.df._

RecordLoader.loadArchives("./src/test/resources/arc/example.arc.gz",sc)
			.webpages()
			.select(DetectLanguageDF($"content"))
			.show(3,false)

ExtarctBoilerpipeText:

import io.archivesunleashed._
import io.archivesunleashed.df._

RecordLoader.loadArchives("./src/test/resources/arc/example.arc.gz",sc)
			.webpages()
			.select(ExtractBoilerpipeTextDF($"content"))
			.show(3,false)

@ruebot ruebot self-requested a review December 4, 2019 21:28
@codecov
Copy link

codecov bot commented Dec 4, 2019

Codecov Report

Merging #387 into master will decrease coverage by 0.73%.
The diff coverage is 48.27%.

@@            Coverage Diff             @@
##           master     #387      +/-   ##
==========================================
- Coverage    76.7%   75.97%   -0.74%     
==========================================
  Files          41       41              
  Lines        1451     1469      +18     
  Branches      268      274       +6     
==========================================
+ Hits         1113     1116       +3     
- Misses        221      236      +15     
  Partials      117      117

Copy link
Member

@ruebot ruebot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple formatting and comment prose tweaks.

Tested, and works as expected. I'll get a PR in for the docs.

@@ -26,6 +26,7 @@ import java.util.Base64
/**
* UDFs for data frames.
*/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this blank line.

@@ -49,4 +49,29 @@ object ExtractDate {
""
}
}

/** Extracts the wanted date component from a date (for DataFrames).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's reword this to:

Extracts a provided date component from a date (for DataFrames).

@ruebot
Copy link
Member

ruebot commented Dec 5, 2019

#387

@ruebot ruebot merged commit 079cd24 into archivesunleashed:master Dec 5, 2019
ianmilligan1 pushed a commit to archivesunleashed/aut-docs that referenced this pull request Dec 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants