Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor ArchiveRecord classes; addresses #101 and #102 #107

Merged
merged 8 commits into from
Oct 29, 2017
Merged

Conversation

MapleOx
Copy link
Collaborator

@MapleOx MapleOx commented Oct 28, 2017

GitHub issue(s):

What does this Pull Request do?

Refactors unneeded archive record classes in the Scala code.

  • Removed ArcRecord and WarcRecord scala classes
  • Renamed GenericArchiveRecord to ArchiveRecord
  • Removed loadArc and loadWarc from RecordLoader
  • Baked keepValidPages into RecordLoader (see Bake keepValidPages() into RecordLoader #101)
  • Modified tests accordingly

How should this be tested?

TravisCI + code review

Additional Notes:

I got similar errors as @ruebot when changing loadArc to loadArchives in some of the tests. I found that this was due to the fact that ArcRecords are serializable but GenericArchiveRecords are not, because of the two fields

 var arcRecord: ARCRecord
 var warcRecord: WARCRecord

in GenericArchiveRecord which are not serializable.

So, for example, calling collect() works on RDDs containing ArcRecords but not on RDDs containing GenericArchiveRecords. I changed the offending tests so that they test for the same thing without calling collect()...

Interested parties

@ruebot @greebie @lintool

@codecov
Copy link

codecov bot commented Oct 28, 2017

Codecov Report

Merging #107 into master will decrease coverage by 2.07%.
The diff coverage is 82.92%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #107      +/-   ##
=========================================
- Coverage   67.67%   65.6%   -2.08%     
=========================================
  Files          38      36       -2     
  Lines         758     721      -37     
  Branches      142     140       -2     
=========================================
- Hits          513     473      -40     
- Misses        194     199       +5     
+ Partials       51      49       -2
Impacted Files Coverage Δ
...rchivesunleashed/spark/matchbox/ExtractGraph.scala 94.44% <ø> (ø) ⬆️
...ala/io/archivesunleashed/spark/rdd/RecordRDD.scala 70.37% <ø> (ø) ⬆️
...rchivesunleashed/spark/matchbox/RecordLoader.scala 70% <100%> (-5%) ⬇️
...ivesunleashed/spark/archive/io/ArchiveRecord.scala 80.55% <80.55%> (ø)
...rchivesunleashed/mapreduce/WacWarcInputFormat.java 65.85% <0%> (-9.76%) ⬇️
...archivesunleashed/mapreduce/WacArcInputFormat.java 75.6% <0%> (-9.76%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 010fe24...177bfbe. Read the comment docs.

val mime = Set ("text/plain", "image/jpeg")
val r = base.filter(x => !mime.contains(x.getMimeType)).collect()
//val r = base.filter(x => !mime.contains(x.getMimeType)).collect()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MapleOx this need to be removed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ruebot yes, you're right -- I'll delete that commented line.

@ruebot
Copy link
Member

ruebot commented Oct 28, 2017

ArcRecords are serializable but GenericArchiveRecords are not

Oof. I should have caught that. Nice work @MapleOx.

@ruebot
Copy link
Member

ruebot commented Oct 29, 2017

I'm good with this, but I'll wait to hear from @greebie before I merge since he created these tickets, and has a lot invested in testing.

@greebie
Copy link
Contributor

greebie commented Oct 29, 2017

LGTM - thanks @MapleOx !

@ruebot ruebot merged commit 5e99d63 into master Oct 29, 2017
@ruebot ruebot deleted the issue-102-c branch October 29, 2017 13:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants