-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor ArchiveRecord classes; addresses #101 and #102 #107
Conversation
… call collect() on RDD[ArchiveRecord] --- ArcRecords and WarcRecords are serializable, but ArchiveRecords are not...
Codecov Report
@@ Coverage Diff @@
## master #107 +/- ##
=========================================
- Coverage 67.67% 65.6% -2.08%
=========================================
Files 38 36 -2
Lines 758 721 -37
Branches 142 140 -2
=========================================
- Hits 513 473 -40
- Misses 194 199 +5
+ Partials 51 49 -2
Continue to review full report at Codecov.
|
val mime = Set ("text/plain", "image/jpeg") | ||
val r = base.filter(x => !mime.contains(x.getMimeType)).collect() | ||
//val r = base.filter(x => !mime.contains(x.getMimeType)).collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MapleOx this need to be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ruebot yes, you're right -- I'll delete that commented line.
Oof. I should have caught that. Nice work @MapleOx. |
I'm good with this, but I'll wait to hear from @greebie before I merge since he created these tickets, and has a lot invested in testing. |
LGTM - thanks @MapleOx ! |
GitHub issue(s):
What does this Pull Request do?
Refactors unneeded archive record classes in the Scala code.
ArcRecord
andWarcRecord
scala classesGenericArchiveRecord
toArchiveRecord
loadArc
andloadWarc
fromRecordLoader
keepValidPages
intoRecordLoader
(see Bake keepValidPages() into RecordLoader #101)How should this be tested?
TravisCI + code review
Additional Notes:
I got similar errors as @ruebot when changing
loadArc
toloadArchives
in some of the tests. I found that this was due to the fact thatArcRecord
s are serializable butGenericArchiveRecord
s are not, because of the two fieldsin
GenericArchiveRecord
which are not serializable.So, for example, calling
collect()
works on RDDs containingArcRecords
but not on RDDs containingGenericArchiveRecords
. I changed the offending tests so that they test for the same thing without callingcollect()
...Interested parties
@ruebot @greebie @lintool