Add binary extraction DataFrames to PySpark. #350

ruebot · 2019-08-20T17:26:52Z

GitHub issue(s):

What does this Pull Request do?

Add binary extraction DataFrames to PySpark.

How should this be tested?

TravisCI
Fire up a Jupyter Notebook:

$ PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS=notebook ~/bin/spark-2.4.3-bin-hadoop2.7/bin/pyspark --jars /home/nruest/Projects/au/aut/target/aut-0.17.1-SNAPSHOT-fatjar.jar --driver-class-path /home/nruest/Projects/au/aut/target/aut-0.17.1-SNAPSHOT-fatjar.jar --py-files /home/nruest/Projects/au/aut/target/aut.zip

Then do this for each of the additions, and make sure it works:

from aut import *

archive = WebArchive(sc, sqlContext, "/home/nruest/Projects/au/aut/src/test/resources/warc/")

df = archive.spreadsheets()
df.printSchema()

Additional Notes:

Let's see what happens with the test coverage. I assume I'm going to have to add some things to src/test/scala/io/archivesunleashed/df/DataFrameLoaderTest.scala

- Address #190 - Address #259 - Address #302 - Address #303 - Address #304 - Address #305 - Address #306 - Address #307

codecov · 2019-08-20T17:43:31Z

Codecov Report

Merging #350 into master will decrease coverage by 0.76%.
The diff coverage is 0%.

@@            Coverage Diff             @@
##           master     #350      +/-   ##
==========================================
- Coverage   75.52%   74.76%   -0.77%     
==========================================
  Files          39       39              
  Lines        1373     1387      +14     
  Branches      265      265              
==========================================
  Hits         1037     1037              
- Misses        220      234      +14     
  Partials      116      116

ruebot · 2019-08-20T19:49:04Z

Here's an even better test: https://github.com/archivesunleashed/aut/wiki/Using-AUT-with-PySpark

(Just swap out 0.18.0 with the path to aut-0.17.1-SNAPSHOT-fatjar.jar and aut.zip)

src/main/python/aut/common.py

lintool · 2019-08-21T08:35:54Z

Comments about naming df's if it's not too late...

ianmilligan1

Ran through the PySpark documentation with the Jupyter command in this PR. All worked perfectly! 🎉

Add binary extration DataFrames to PySpark.

1176fd5

- Address #190 - Address #259 - Address #302 - Address #303 - Address #304 - Address #305 - Address #306 - Address #307

ruebot requested a review from ianmilligan1 August 20, 2019 17:26

README updates

88ab028

ruebot mentioned this pull request Aug 21, 2019

Discussion: Idiom for loading DataFrames #231

Closed

lintool reviewed Aug 21, 2019

View reviewed changes

src/main/python/aut/common.py Show resolved Hide resolved

Merge branch 'master' into images-pyspark

3b0a376

ianmilligan1 approved these changes Aug 21, 2019

View reviewed changes

ianmilligan1 merged commit eda185b into master Aug 21, 2019

ianmilligan1 deleted the images-pyspark branch August 21, 2019 13:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add binary extraction DataFrames to PySpark. #350

Add binary extraction DataFrames to PySpark. #350

ruebot commented Aug 20, 2019

codecov bot commented Aug 20, 2019 •

edited

Loading

ruebot commented Aug 20, 2019

lintool commented Aug 21, 2019

ianmilligan1 left a comment

Add binary extraction DataFrames to PySpark. #350

Add binary extraction DataFrames to PySpark. #350

Conversation

ruebot commented Aug 20, 2019

What does this Pull Request do?

How should this be tested?

Additional Notes:

codecov bot commented Aug 20, 2019 • edited Loading

Codecov Report

ruebot commented Aug 20, 2019

lintool commented Aug 21, 2019

ianmilligan1 left a comment

Choose a reason for hiding this comment

codecov bot commented Aug 20, 2019 •

edited

Loading