BUG: max rows in read_csv #14131

jangorecki · 2016-08-31T22:04:34Z

I have 5e9 rows csv file that I'm trying to load with pandas, which appears to load it only partially and produces 2852516352 rows data frame.

#Python 2.7.3 (default, Jun 22 2015, 19:33:41) 
#[GCC 4.6.3] on linux2
#Type "help", "copyright", "credits" or "license" for more information.
import pandas as pd
import os
import gc
import timeit
x = pd.read_csv("X5e9_2c.csv")
x.head()
#          KEY          X2
#0  2632858426 -4008534609
#1  3176486913 -3394302982
#2  3235589527 -4982919314
#3  1017229071  3793469468
#4  1039519144  2740983791
gc.collect()
#21
t_start = timeit.default_timer()
ans = x.sort_values('KEY')
print ans.shape[0]
#2852516352
t = timeit.default_timer() - t_start
x.shape[0]
#2852516352
#t
6073.774994134903

Expected behaviour would be either properly read csv or an error, so it would save ~100 minutes of processing in my case.

my pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-61-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 8.0.3
setuptools: 17.0
Cython: None
numpy: 1.11.1
scipy: 0.16.0
statsmodels: 0.6.1
xarray: None
IPython: None
sphinx: 1.3.1
patsy: 0.4.0
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: None
tables: None
numexpr: 2.4.3
matplotlib: 1.4.3
openpyxl: None
xlrd: 0.9.3
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: 0.9.1
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.7.3
boto: 2.38.0
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2016-08-31T22:09:42Z

what is the error?

jangorecki · 2016-08-31T22:10:24Z

There is no error at all, I get 2852516352 rows data frame, where I should get 5e9 rows data frame.

jreback · 2016-08-31T22:20:01Z

how do you know it contains 5e9 rows

how does it parse a small chunk (pass nrows=10)

assume you have 200 gb available (you need generally 2x memory to process) assuming 2x int64s per row

jangorecki · 2016-08-31T22:43:49Z

I was able to load that file in other tools and they produces 5e9 dataset.
I have 256GB mem, csv file is 103GB.
Here is nrows=10

x = pd.read_csv("X5e9_2c.csv", nrows=10)
x
          KEY          X2
#0  2632858426 -4008534609
#1  3176486913 -3394302982
#2  3235589527 -4982919314
#3  1017229071  3793469468
#4  1039519144  2740983791
#5  2504311956  -364044414
#6    43002703 -1822527251
#7  2984147242 -1297084189
#8  2604109368 -2965381672
#9   178979971 -4855058881

jreback · 2016-08-31T22:45:44Z

looks like an int32 overflow if I have to guess

jreback · 2016-08-31T22:46:10Z

show .dtypes

jreback · 2016-08-31T22:46:44Z

cc @gfyoung

jangorecki · 2016-08-31T23:02:17Z

x.dtypes
#KEY    int64
#X2     int64
#dtype: object

jreback · 2016-08-31T23:05:05Z

In [4]: int(5e9)-np.iinfo(np.int32).max
Out[4]: 2852516353

so seems suspicious.

jreback · 2016-08-31T23:06:31Z

as a work-around, I suspect this would work via chunks just fine, e.g.

pd.concat(pd.read_csv(.....chunksize=1e9))

jreback · 2016-08-31T23:13:01Z

if you would like to try something: https://github.com/pydata/pandas/blob/master/pandas/src/parser/tokenizer.c#L1341

almost everything is a c-int, which is 32-bit, so when this hits its max it rolls over I think (and becomes negative).

So a soln (and prob not a big deal) is to change int->int64_t for any row counters).

It may be more appropriate to use long long (or size_t) but I don't really know what is the idiom in c.

if you want to give that a try and see would be great.

gfyoung · 2016-09-01T02:01:29Z

@jreback : I think your prognosis is correct. The int -> int64_t patch for all line-count-related things could be the fix in the C code. I'm a little confused since he's using a 64-bit machine, so I'm not sure why it would max out at np.iinfo(np.int32).max.

@jangorecki : Does replacing int with int64_t do the trick? And out of curiosity, what happens if you explicitly pass in engine='python'?

jreback · 2016-09-01T02:04:18Z

iirc int on 64but machine is still a 32bit int generally
long might be 64bit but I am not sure this is guaranteed either - I think it's at least 32bit
only long long (which is int64_t) is 64 bit minimum

jreback · 2016-09-01T02:06:18Z

as an aside @jangorecki you seem to be getting only 1000 mb/min
this is pretty slow for io throughout - are you reading off of nfs by any chance?

gfyoung · 2016-09-01T02:07:15Z

@jreback : Perhaps...long is platform-dependent too IIRC. Regardless, I think we can agree that int64_t would ensure 64-bit.

jangorecki · 2016-09-01T14:27:57Z

@jreback concat will probably requires twice as memory, unless it could concat by reference? timing shown in my code chunk did not refer to read_csv but to sort_values, anyway reading was slow, I haven't measure so I can't provide exact numbers. Can't check engine=python now, but your guess on int(5e9)-np.iinfo(np.int32).max looks valid.

frol · 2016-09-15T07:08:45Z

Just a two cents from me, you could use Dask DataFrames, which handle data in Pandas DataFrame chunks and can parallelize operations and perform out-of-core computations.

wesm · 2016-09-15T13:18:22Z

@frol this doesn't help if the user needs an in-memory pandas.DataFrame. Certain operations can indeed be performed out of core / in parallel with dask.dataframe, but it's a narrower scope.

int is 64-bit on most x86_64 Linux platforms, but 32-bit on Windows and some mac environments. We should make sure we are always using exact stdint.h types. Using int is OK for things you know will never be arbitrarily large.

kyleabeauchamp · 2017-02-23T22:36:13Z

I also recently hit this bug, where I had a large (3E9) row CSV that mysteriously ended up with the following shape after read_csv():

In [13]: x.shape
Out[13]: (2147483648, 1)

In [14]: np.iinfo(np.int32).max
Out[14]: 2147483647

wesm · 2017-02-23T22:45:35Z

You're the big winner :)

kyleabeauchamp · 2017-02-23T23:11:08Z

FWIW, the naive chunked approach doesn't seem to overcome the overflow. Notice in the following snippet that the final chunk is basically np.iinfo(np.int32).max + 1 % chunksize:

    x = []
    for chunk in pd.read_csv(csv_filename, dtype="uint16", names=[KEY], chunksize=1000000000):
        xi = chunk[key].values
        print(len(xi))
        x.append(xi)
    x = np.concatenate(x)

Output:

1000000000
1000000000
147483648

FYI I'm using a gzipped csv in this analysis.

gfyoung · 2017-07-14T07:47:49Z

xref #16798 for related problem

jreback added Bug IO CSV read_csv, to_csv labels Aug 31, 2016

jreback changed the title ~~pandas silently fails to properly read big csv~~ BUG: max rows in read_csv Aug 31, 2016

jreback added Difficulty Intermediate labels Aug 31, 2016

jreback mentioned this issue Jul 6, 2017

Pandas core dumps when reading large CSV file using read_csv(..., low_memory=False) #16798

Closed

jreback added this to the Next Major Release milestone Jul 6, 2017

jbrockmendel removed the Difficulty Intermediate label Oct 21, 2019

jbrockmendel removed the Effort Low label Oct 21, 2019

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: max rows in read_csv #14131

BUG: max rows in read_csv #14131

jangorecki commented Aug 31, 2016 •

edited

Loading

jreback commented Aug 31, 2016

jangorecki commented Aug 31, 2016

jreback commented Aug 31, 2016

jangorecki commented Aug 31, 2016

jreback commented Aug 31, 2016

jreback commented Aug 31, 2016

jreback commented Aug 31, 2016

jangorecki commented Aug 31, 2016

jreback commented Aug 31, 2016

jreback commented Aug 31, 2016

jreback commented Aug 31, 2016

gfyoung commented Sep 1, 2016

jreback commented Sep 1, 2016

jreback commented Sep 1, 2016

gfyoung commented Sep 1, 2016

jangorecki commented Sep 1, 2016

frol commented Sep 15, 2016

wesm commented Sep 15, 2016

kyleabeauchamp commented Feb 23, 2017

wesm commented Feb 23, 2017

kyleabeauchamp commented Feb 23, 2017 •

edited

Loading

gfyoung commented Jul 14, 2017

BUG: max rows in read_csv #14131

BUG: max rows in read_csv #14131

Comments

jangorecki commented Aug 31, 2016 • edited Loading

jreback commented Aug 31, 2016

jangorecki commented Aug 31, 2016

jreback commented Aug 31, 2016

jangorecki commented Aug 31, 2016

jreback commented Aug 31, 2016

jreback commented Aug 31, 2016

jreback commented Aug 31, 2016

jangorecki commented Aug 31, 2016

jreback commented Aug 31, 2016

jreback commented Aug 31, 2016

jreback commented Aug 31, 2016

gfyoung commented Sep 1, 2016

jreback commented Sep 1, 2016

jreback commented Sep 1, 2016

gfyoung commented Sep 1, 2016

jangorecki commented Sep 1, 2016

frol commented Sep 15, 2016

wesm commented Sep 15, 2016

kyleabeauchamp commented Feb 23, 2017

wesm commented Feb 23, 2017

kyleabeauchamp commented Feb 23, 2017 • edited Loading

gfyoung commented Jul 14, 2017

jangorecki commented Aug 31, 2016 •

edited

Loading

kyleabeauchamp commented Feb 23, 2017 •

edited

Loading