Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

restrict columns to read for pandas.read_parquet #18155

Merged
merged 13 commits into from
Nov 8, 2017

Conversation

hoffmann
Copy link
Contributor

@hoffmann hoffmann commented Nov 7, 2017

@gfyoung gfyoung added Enhancement IO CSV read_csv, to_csv IO Parquet parquet, feather and removed IO CSV read_csv, to_csv labels Nov 7, 2017
@@ -282,6 +282,17 @@ def test_compression(self, engine, compression):
df = pd.DataFrame({'A': [1, 2, 3]})
self.check_round_trip(df, engine, compression=compression)

def test_read_columns(self, engine, fp):
df = pd.DataFrame({'string': list('abc'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reference issue number above.

@@ -188,6 +188,8 @@ def read_parquet(path, engine='auto', **kwargs):
----------
path : string
File path
columns: list
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Write out what the default is too i.e. "list, default None"

@gfyoung
Copy link
Member

gfyoung commented Nov 7, 2017

Will need a whatsnew note in 0.22.0

@pep8speaks
Copy link

pep8speaks commented Nov 7, 2017

Hello @hoffmann! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on November 08, 2017 at 13:04 Hours UTC

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc comments, lgtm. otherwise.

@@ -109,7 +109,7 @@ I/O
^^^

- :func:`read_html` now rewinds seekable IO objects after parse failure, before attempting to parse with a new parser. If a parser errors and the object is non-seekable, an informative error is raised suggesting the use of a different parser (:issue:`17975`)
-
- :func:`read_parquet` now allows to specify the columns to read from a parquet file (:issue:`18154`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can put this on 0.21.1

@@ -188,6 +188,8 @@ def read_parquet(path, engine='auto', **kwargs):
----------
path : string
File path
columns: list, default=None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a version added tag

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -109,7 +109,7 @@ I/O
^^^

- :func:`read_html` now rewinds seekable IO objects after parse failure, before attempting to parse with a new parser. If a parser errors and the object is non-seekable, an informative error is raised suggesting the use of a different parser (:issue:`17975`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a small example in the docs in io.rst as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done 21c5f5e

@codecov
Copy link

codecov bot commented Nov 8, 2017

Codecov Report

Merging #18155 into master will increase coverage by <.01%.
The diff coverage is 83.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18155      +/-   ##
==========================================
+ Coverage   91.41%   91.41%   +<.01%     
==========================================
  Files         163      163              
  Lines       50132    50132              
==========================================
+ Hits        45827    45830       +3     
+ Misses       4305     4302       -3
Flag Coverage Δ
#multiple 89.23% <83.33%> (+0.02%) ⬆️
#single 40.33% <50%> (-0.06%) ⬇️
Impacted Files Coverage Δ
pandas/io/parquet.py 65.38% <83.33%> (ø) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.8% <0%> (-0.1%) ⬇️
pandas/plotting/_converter.py 65.2% <0%> (+1.81%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 488db6f...21c5f5e. Read the comment docs.

@codecov
Copy link

codecov bot commented Nov 8, 2017

Codecov Report

Merging #18155 into master will decrease coverage by <.01%.
The diff coverage is 83.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18155      +/-   ##
==========================================
- Coverage   91.41%    91.4%   -0.01%     
==========================================
  Files         163      163              
  Lines       50132    50068      -64     
==========================================
- Hits        45827    45767      -60     
+ Misses       4305     4301       -4
Flag Coverage Δ
#multiple 89.21% <83.33%> (+0.01%) ⬆️
#single 40.33% <50%> (-0.06%) ⬇️
Impacted Files Coverage Δ
pandas/io/parquet.py 65.38% <83.33%> (ø) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.8% <0%> (-0.1%) ⬇️
pandas/tseries/offsets.py 97.11% <0%> (-0.05%) ⬇️
pandas/core/indexes/datetimes.py 95.48% <0%> (-0.04%) ⬇️
pandas/core/indexes/timedeltas.py 91.17% <0%> (-0.02%) ⬇️
pandas/core/nanops.py 96.67% <0%> (ø) ⬆️
pandas/core/indexes/datetimelike.py 97.11% <0%> (ø) ⬆️
pandas/core/indexes/period.py 92.89% <0%> (+0.01%) ⬆️
pandas/core/tools/timedeltas.py 98.41% <0%> (+0.02%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 488db6f...4b22c88. Read the comment docs.

path, _, _ = get_filepath_or_buffer(path)
return self.api.parquet.read_table(path).to_pandas()
return self.api.parquet.read_table(path, columns).to_pandas()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pass columns as a kwarg to read_table and to_pandas

@@ -188,6 +188,9 @@ def read_parquet(path, engine='auto', **kwargs):
----------
path : string
File path
columns: list, default=None
If not None, only these columns will be read from the file.
.. versionadded 0.21.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think u need a blank line before the version added tag

@@ -201,4 +204,4 @@ def read_parquet(path, engine='auto', **kwargs):
"""

impl = get_engine(engine)
return impl.read(path)
return impl.read(path, columns)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

df = pd.DataFrame({'string': list('abc'),
'int': list(range(1, 4))})

with tm.ensure_clean() as path:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don’t need the fp argument here: engine cycles thru both engines
use check_round_trip; pass in the expected (and the columns kwarg)

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comments

@jreback
Copy link
Contributor

jreback commented Nov 8, 2017

you have a linting issue

path, _, _ = get_filepath_or_buffer(path)
return self.api.parquet.read_table(path).to_pandas()
return self.api.parquet.read_table(path, columns=columns).to_pandas()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i’d like to pass thru kwargs as well; these won’t be specific names args just pass thru to the engine to validate
for both fp and pyarrow
could just be a simple test with row_groups

Copy link
Contributor Author

@hoffmann hoffmann Nov 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, i think it is good to pass explicit options like columns which are supported by both backends and also pass the kwargs to be able to provide additional engine specific kwargs.

Have to look at the test case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok that’s fine
really want row_group support :) (next PR!)
also if u want: #17102

@jreback jreback added this to the 0.21.1 milestone Nov 8, 2017
@jreback
Copy link
Contributor

jreback commented Nov 8, 2017

ping on green

@hoffmann
Copy link
Contributor Author

hoffmann commented Nov 8, 2017

@jreback green.

If it's ok I'd like to do the change to accept **kwargs in the read() function in a different pull request because it will require to rewrite https://github.com/pandas-dev/pandas/blob/master/pandas/tests/io/test_parquet.py#L191 to be able to handle **kwargs for to_parquet and read_parquet at the same time.

@jreback
Copy link
Contributor

jreback commented Nov 8, 2017

If it's ok I'd like to do the change to accept **kwargs in the read() function in a different pull request because it will require to rewrite https://github.com/pandas-dev/pandas/blob/master/pandas/tests/io/test_parquet.py#L191 to be able to handle **kwargs for to_parquet and read_parquet at the same time.

yep ok by me

@@ -4538,6 +4538,16 @@ Read from a parquet file.

result.dtypes

Read only certain columns of a parquet file.

.. ipython:: python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in next PR, can you add a version added tag here (for 0.21.1)

@jreback jreback merged commit 5128fe6 into pandas-dev:master Nov 8, 2017
@jreback
Copy link
Contributor

jreback commented Nov 8, 2017

thanks!

watercrossing pushed a commit to watercrossing/pandas that referenced this pull request Nov 10, 2017
No-Stream pushed a commit to No-Stream/pandas that referenced this pull request Nov 28, 2017
TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this pull request Dec 8, 2017
TomAugspurger pushed a commit that referenced this pull request Dec 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable to restrict columns for pandas.read_parquet
4 participants