Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: max rows in read_csv #14131

Open
jangorecki opened this issue Aug 31, 2016 · 22 comments
Open

BUG: max rows in read_csv #14131

jangorecki opened this issue Aug 31, 2016 · 22 comments
Labels
Bug IO CSV read_csv, to_csv

Comments

@jangorecki
Copy link

jangorecki commented Aug 31, 2016

I have 5e9 rows csv file that I'm trying to load with pandas, which appears to load it only partially and produces 2852516352 rows data frame.

#Python 2.7.3 (default, Jun 22 2015, 19:33:41) 
#[GCC 4.6.3] on linux2
#Type "help", "copyright", "credits" or "license" for more information.
import pandas as pd
import os
import gc
import timeit
x = pd.read_csv("X5e9_2c.csv")
x.head()
#          KEY          X2
#0  2632858426 -4008534609
#1  3176486913 -3394302982
#2  3235589527 -4982919314
#3  1017229071  3793469468
#4  1039519144  2740983791
gc.collect()
#21
t_start = timeit.default_timer()
ans = x.sort_values('KEY')
print ans.shape[0]
#2852516352
t = timeit.default_timer() - t_start
x.shape[0]
#2852516352
#t
6073.774994134903

Expected behaviour would be either properly read csv or an error, so it would save ~100 minutes of processing in my case.

my pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-61-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 8.0.3
setuptools: 17.0
Cython: None
numpy: 1.11.1
scipy: 0.16.0
statsmodels: 0.6.1
xarray: None
IPython: None
sphinx: 1.3.1
patsy: 0.4.0
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: None
tables: None
numexpr: 2.4.3
matplotlib: 1.4.3
openpyxl: None
xlrd: 0.9.3
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: 0.9.1
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.7.3
boto: 2.38.0
pandas_datareader: None
@jreback
Copy link
Contributor

jreback commented Aug 31, 2016

what is the error?

@jangorecki
Copy link
Author

There is no error at all, I get 2852516352 rows data frame, where I should get 5e9 rows data frame.

@jreback
Copy link
Contributor

jreback commented Aug 31, 2016

how do you know it contains 5e9 rows

how does it parse a small chunk (pass nrows=10)

assume you have 200 gb available (you need generally 2x memory to process) assuming 2x int64s per row

@jangorecki
Copy link
Author

I was able to load that file in other tools and they produces 5e9 dataset.
I have 256GB mem, csv file is 103GB.
Here is nrows=10

x = pd.read_csv("X5e9_2c.csv", nrows=10)
x
          KEY          X2
#0  2632858426 -4008534609
#1  3176486913 -3394302982
#2  3235589527 -4982919314
#3  1017229071  3793469468
#4  1039519144  2740983791
#5  2504311956  -364044414
#6    43002703 -1822527251
#7  2984147242 -1297084189
#8  2604109368 -2965381672
#9   178979971 -4855058881

@jreback
Copy link
Contributor

jreback commented Aug 31, 2016

looks like an int32 overflow if I have to guess

@jreback
Copy link
Contributor

jreback commented Aug 31, 2016

show .dtypes

@jreback
Copy link
Contributor

jreback commented Aug 31, 2016

cc @gfyoung

@jangorecki
Copy link
Author

x.dtypes
#KEY    int64
#X2     int64
#dtype: object

@jreback
Copy link
Contributor

jreback commented Aug 31, 2016

In [4]: int(5e9)-np.iinfo(np.int32).max
Out[4]: 2852516353

so seems suspicious.

@jreback
Copy link
Contributor

jreback commented Aug 31, 2016

as a work-around, I suspect this would work via chunks just fine, e.g.

pd.concat(pd.read_csv(.....chunksize=1e9))

@jreback
Copy link
Contributor

jreback commented Aug 31, 2016

if you would like to try something: https://github.com/pydata/pandas/blob/master/pandas/src/parser/tokenizer.c#L1341

almost everything is a c-int, which is 32-bit, so when this hits its max it rolls over I think (and becomes negative).

So a soln (and prob not a big deal) is to change int->int64_t for any row counters).

It may be more appropriate to use long long (or size_t) but I don't really know what is the idiom in c.

if you want to give that a try and see would be great.

@jreback jreback added Bug IO CSV read_csv, to_csv labels Aug 31, 2016
@jreback jreback changed the title pandas silently fails to properly read big csv BUG: max rows in read_csv Aug 31, 2016
@gfyoung
Copy link
Member

gfyoung commented Sep 1, 2016

@jreback : I think your prognosis is correct. The int -> int64_t patch for all line-count-related things could be the fix in the C code. I'm a little confused since he's using a 64-bit machine, so I'm not sure why it would max out at np.iinfo(np.int32).max.

@jangorecki : Does replacing int with int64_t do the trick? And out of curiosity, what happens if you explicitly pass in engine='python'?

@jreback
Copy link
Contributor

jreback commented Sep 1, 2016

iirc int on 64but machine is still a 32bit int generally
long might be 64bit but I am not sure this is guaranteed either - I think it's at least 32bit
only long long (which is int64_t) is 64 bit minimum

@jreback
Copy link
Contributor

jreback commented Sep 1, 2016

as an aside @jangorecki you seem to be getting only 1000 mb/min
this is pretty slow for io throughout - are you reading off of nfs by any chance?

@gfyoung
Copy link
Member

gfyoung commented Sep 1, 2016

@jreback : Perhaps...long is platform-dependent too IIRC. Regardless, I think we can agree that int64_t would ensure 64-bit.

@jangorecki
Copy link
Author

@jreback concat will probably requires twice as memory, unless it could concat by reference? timing shown in my code chunk did not refer to read_csv but to sort_values, anyway reading was slow, I haven't measure so I can't provide exact numbers. Can't check engine=python now, but your guess on int(5e9)-np.iinfo(np.int32).max looks valid.

@frol
Copy link

frol commented Sep 15, 2016

Just a two cents from me, you could use Dask DataFrames, which handle data in Pandas DataFrame chunks and can parallelize operations and perform out-of-core computations.

@wesm
Copy link
Member

wesm commented Sep 15, 2016

@frol this doesn't help if the user needs an in-memory pandas.DataFrame. Certain operations can indeed be performed out of core / in parallel with dask.dataframe, but it's a narrower scope.

int is 64-bit on most x86_64 Linux platforms, but 32-bit on Windows and some mac environments. We should make sure we are always using exact stdint.h types. Using int is OK for things you know will never be arbitrarily large.

@kyleabeauchamp
Copy link

I also recently hit this bug, where I had a large (3E9) row CSV that mysteriously ended up with the following shape after read_csv():

In [13]: x.shape
Out[13]: (2147483648, 1)

In [14]: np.iinfo(np.int32).max
Out[14]: 2147483647

@wesm
Copy link
Member

wesm commented Feb 23, 2017

You're the big winner :)

@kyleabeauchamp
Copy link

kyleabeauchamp commented Feb 23, 2017

FWIW, the naive chunked approach doesn't seem to overcome the overflow. Notice in the following snippet that the final chunk is basically np.iinfo(np.int32).max + 1 % chunksize:

    x = []
    for chunk in pd.read_csv(csv_filename, dtype="uint16", names=[KEY], chunksize=1000000000):
        xi = chunk[key].values
        print(len(xi))
        x.append(xi)
    x = np.concatenate(x)

Output:

1000000000
1000000000
147483648

FYI I'm using a gzipped csv in this analysis.

@gfyoung
Copy link
Member

gfyoung commented Jul 14, 2017

xref #16798 for related problem

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

8 participants