Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: Idiom for loading DataFrames #231

Closed
lintool opened this issue May 21, 2018 · 6 comments
Closed

Discussion: Idiom for loading DataFrames #231

lintool opened this issue May 21, 2018 · 6 comments

Comments

@lintool
Copy link
Member

lintool commented May 21, 2018

In my original implementation I wrote a DataFrameLoader, but it seems to have rapidly fallen out of use... We should decide on the idiom we want for loading DataFrames.

Current implementation:

val df = RecordLoader.loadArchives("example.arc.gz", sc).extractImageDetailsDF()
// alternatively, extractValidPagesDF, extractHyperlinksDF, etc.

The downside of this is that the user has access to raw RDDs, which is what loadArchives returns... this is asking for trouble in mixing RDDs and DFs in unpredictable ways?

Another option would be to introduce a DF interface that does not give access to RDDs. Something like:

val df = DataFrameLoader.loadArchives("example.arc.gz", sc).images

The other nice feature is that we can have much shorter DF names like pages, links, images, image_links, etc. - don't need the DF part to disambiguate because DataFrameLoader makes this clear. One more nice features is the ability to selectively reduce scope down the road and hide RDDs from the user, as we move completely over to DFs.

I'm leaning towards this design, but would be happy to hear opinions from others...

@ianmilligan1
Copy link
Member

The other nice feature is that we can have much shorter DF names like pages, links, images, image_links, etc. - don't need the DF part to disambiguate because DataFrameLoader makes this clear. One more nice features is the ability to selectively reduce scope down the road and hide RDDs from the user, as we move completely over to DFs.

I'm generally agnostic but this pushes me in the camp of having a DF-specific interface. The second syntax example you gave with the .images is very usable.

@ruebot
Copy link
Member

ruebot commented May 22, 2018

Fine by me. I can see moving towards strict DataFrames helping out on the AUK side of things.

@jwli229
Copy link
Contributor

jwli229 commented May 22, 2018

+1 for strict dataframes and hiding away RDDs

@ruebot
Copy link
Member

ruebot commented Aug 21, 2019

I think #350 hits this, and/or resolves it. I'll leave that to @lintool

@lintool
Copy link
Member Author

lintool commented Aug 21, 2019

👍

We can close this issue after #350 is merged.

@ianmilligan1
Copy link
Member

Closed with e32ae17.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants