Exposing Scala DataFrames in PySpark #214

lintool · 2018-05-01T14:45:26Z

What does this Pull Request do?

As the title suggests, this PR exposes DataFrames in Scala for use in PySpark. This is a cleanup of initial prototyping done at the Toronto Datathon in April 2018.

How should this be tested?

It's not. This is an experimental feature that is completely independent of existing AUT capabilities.

lintool · 2018-05-01T14:49:23Z

Adding reference to #209 - see discussion there on how to exactly use this new PySpark feature.

codecov · 2018-05-01T14:55:50Z

Codecov Report

Merging #214 into master will decrease coverage by 0.6%.
The diff coverage is 0%.

@@            Coverage Diff             @@
##           master     #214      +/-   ##
==========================================
- Coverage   66.76%   66.16%   -0.61%     
==========================================
  Files          33       34       +1     
  Lines         659      665       +6     
  Branches      124      124              
==========================================
  Hits          440      440              
- Misses        178      184       +6     
  Partials       41       41

Impacted Files	Coverage Δ
...n/scala/io/archivesunleashed/DataFrameLoader.scala	`0% <0%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ef76758...b384b66. Read the comment docs.

ianmilligan1

I've played with this and it works nicely – I think it's a great start to the PySpark functionality, and will be nice to have in the main repo as experimental functionality.

greebie · 2018-05-01T15:29:11Z

I'm going to try this this afternoon. I am going to make it part of my CSDH presentation / paper about switching from Scala to Python.

ruebot · 2018-05-02T13:20:10Z

Tested with:

N.B. I had to remove the line with .enableHiveSupport() in python/pyspark/shell.py with 2.1.1 and 2.2.1 as prescribed in #209. However, this was not necessary in 2.3.0.

lintool added 3 commits April 27, 2018 13:27

DataFrameLoader - provides bridge to PySpark.

aae9fc7

Initial python classes for aut.

0ae613c

Better packaging of Python modules.

b384b66

lintool mentioned this pull request May 1, 2018

Bringing Scala DataFrames into PySpark #209

Closed

ianmilligan1 requested a review from ruebot May 1, 2018 15:03

ianmilligan1 approved these changes May 1, 2018

View reviewed changes

ruebot approved these changes May 2, 2018

View reviewed changes

ruebot merged commit 505c47a into master May 2, 2018

ruebot deleted the df-pytorch branch May 2, 2018 13:43

This was referenced May 2, 2018

Prevent encoding errors in PySpark #122

Closed

Register Scala functions for use in Pyspark #148

Closed

PySpark performance bottlenecks: counting values #130

Closed

DataFrame discussion: open thread #190

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exposing Scala DataFrames in PySpark #214

Exposing Scala DataFrames in PySpark #214

lintool commented May 1, 2018

lintool commented May 1, 2018

codecov bot commented May 1, 2018 •

edited

Loading

ianmilligan1 left a comment

greebie commented May 1, 2018

ruebot commented May 2, 2018

Exposing Scala DataFrames in PySpark #214

Exposing Scala DataFrames in PySpark #214

Conversation

lintool commented May 1, 2018

What does this Pull Request do?

How should this be tested?

lintool commented May 1, 2018

codecov bot commented May 1, 2018 • edited Loading

Codecov Report

ianmilligan1 left a comment

Choose a reason for hiding this comment

greebie commented May 1, 2018

ruebot commented May 2, 2018

codecov bot commented May 1, 2018 •

edited

Loading