Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] CSV parser got out of sync with chunker #39857

Closed
larry77 opened this issue Jan 30, 2024 · 9 comments · Fixed by #39892
Closed

[R] CSV parser got out of sync with chunker #39857

larry77 opened this issue Jan 30, 2024 · 9 comments · Fixed by #39892
Assignees
Milestone

Comments

@larry77
Copy link

larry77 commented Jan 30, 2024

Describe the bug, including details regarding any error messages, version, and platform.

Hello,
Unfortunately the example involves a large dataset and, according to my tests, it appears when the number of read lines goes above 1.6 million.

The data can be downloaded as a compressed file from (nothing dangerous in the link).

https://e.pcloud.link/publink/show?code=XZqHIeZokLxWCpx940hw3y45fsKqJPAVK0X

Using a script I have had for quite some time, I want to open the tsv (tab separated file) I get when I decompress the file and then save it as a parquet file without holding it (entirely) in memory.

library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp

data <- open_dataset("export.tsv",
  format = "tsv",
  skip_rows = 1, 
  schema = schema(
    AID_MEASURE_ID = string(), 
    DATE_CREATED = string(), 
    DATE_GRANTED = string(), 
    AA_PUBLISHED_DATE = string(), 
    SERVER_REF = string(), 
    AM_TITLE = string(), 
    AM_TITLE_EN = string(), 
    STATUS = string(), 
    AM_PROC_TYPE_CD = string(), 
    COFINANCE = string(), 
    OBJECTIVE = string(), 
    OTHER_OBJECTIVE_EN = string(), 
    AID_INSTRUMENT = string(), 
    OTHER_AID_INSTRUMENT_EN = string(), 
    BENEFICIARY_NAME = string(), 
    BENEFICIARY_NAME_ENGLISH = string(), 
    BENEFICIARY_NATIONAL_ID = string(), 
    BENEFICIARY_NAT_ID_TYPE_SD = string(), 
    BENEFICIARY_TYPE_SD = string(), 
    COUNTRY_SD = string(), 
    REGION_SD = string(), 
    SECTOR_SD = string(), 
    GRANTED_AMOUNT_FROM_EUR = double(), 
    NOMINAL_AMOUNT_EUR_FROM = double(), 
    GRANT_RANGE = string(),
    GRANTED_AMOUNT_RANGE_DESC=string(),
    GRANTING_AUTHORITY_NAME = string(), 
    GRANTING_AUTHORITY_NAME_EN = string(), 
    NUTS_CD = string(), 
    GRANTING_AUTHORITY_COUNTRY = string()
  )
  )

  
write_dataset(
  data,
  format = "parquet",
  path = ".",
  max_rows_per_file = 1e7
)
#> Error: Invalid: CSV parser got out of sync with chunker
    

sessionInfo()
#> R version 4.3.2 (2023-10-31)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Debian GNU/Linux 12 (bookworm)
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.11.0 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.11.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
#>  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
#>  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Europe/Brussels
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] arrow_14.0.0.2
#> 
#> loaded via a namespace (and not attached):
#>  [1] vctrs_0.6.4       cli_3.6.1         knitr_1.45        rlang_1.1.2      
#>  [5] xfun_0.41         purrr_1.0.2       styler_1.10.2     generics_0.1.3   
#>  [9] assertthat_0.2.1  glue_1.6.2        bit_4.0.5         htmltools_0.5.7  
#> [13] fansi_1.0.5       rmarkdown_2.25    R.cache_0.16.0    tibble_3.2.1     
#> [17] evaluate_0.23     fastmap_1.1.1     yaml_2.3.7        lifecycle_1.0.4  
#> [21] compiler_4.3.2    dplyr_1.1.3       fs_1.6.3          pkgconfig_2.0.3  
#> [25] R.oo_1.25.0       R.utils_2.12.2    digest_0.6.33     R6_2.5.1         
#> [29] utf8_1.2.4        reprex_2.0.2      tidyselect_1.2.0  pillar_1.9.0     
#> [33] magrittr_2.0.3    R.methodsS3_1.8.2 tools_4.3.2       withr_2.5.2      
#> [37] bit64_4.0.5

Created on 2024-01-30 with reprex v2.0.2

Any idea of what the issue may be? Thanks!

Component(s)

R

@thisisnic
Copy link
Member

Thanks for repoting this, @larry77!
Can you confirm which version of the R package you're using? And have you used this code+data with an earlier version and it worked, or is this the first time you're running this?

@larry77
Copy link
Author

larry77 commented Jan 31, 2024

Hello!
As per reprex, I use arrow 14.0.0.2. All I can say is that I have used an earlier version of R arrow on a slightly shorter dataset (same structure) and it worked. The present version also works on the shorter dataset. However, the present data set is not pathological: I can read it with read_csv from readr and it works. It seems the problem arose once the dataset grew above 1.6 million lines.

@thisisnic
Copy link
Member

@pitrou I took a look at the C++ code that raises this error, but couldn't quite figure out what had happened here - do you know what it might be?

@pitrou
Copy link
Member

pitrou commented Feb 1, 2024

Hmm, I can reproduce using PyArrow, I'll try to see if I can further diagnose this.

Note, however, that this data file will need to set newlines_in_values, because some cell values span multiple lines.

@pitrou
Copy link
Member

pitrou commented Feb 1, 2024

Ok, the error message is weird, but it is really a consequence of having newlines in values.

@pitrou
Copy link
Member

pitrou commented Feb 1, 2024

I'll put up a PR to improve the error message.

pitrou added a commit to pitrou/arrow that referenced this issue Feb 1, 2024
pitrou added a commit to pitrou/arrow that referenced this issue Feb 1, 2024
@pitrou
Copy link
Member

pitrou commented Feb 1, 2024

Note that, once you enable the newlines_in_values option, reading the CSV file should be successful. For example with PyArrow:

        AID_MEASURE_ID DATE_CREATED DATE_GRANTED  ...                         GRANTING_AUTHORITY_NAME_EN NUTS_CD GRANTING_AUTHORITY_COUNTRY
0             SA.42315     16/09/16     30/08/16  ...                     Ministry of Industry and Trade                            Czechia
1             SA.42315     16/09/16     26/08/16  ...                     Ministry of Industry and Trade                            Czechia
2             SA.42328     19/09/16     16/08/16  ...  Ministry of Industry and Trade, Department of ...                            Czechia
3             SA.41602     21/09/16     01/07/16  ...                                              VLAIO                            Belgium
4             SA.41602     26/09/16     15/07/16  ...                                              VLAIO                            Belgium
...                ...          ...          ...  ...                                                ...     ...                        ...
1677781      SA.100743     24/01/24     15/03/23  ...                   CCI for Munich and Upper Bavaria     DE2                    Germany
1677782      SA.100743     24/01/24     15/03/23  ...                   CCI for Munich and Upper Bavaria     DE2                    Germany
1677783      SA.100743     24/01/24     15/03/23  ...                   CCI for Munich and Upper Bavaria     DE2                    Germany
1677784      SA.100743     24/01/24     15/03/23  ...                   CCI for Munich and Upper Bavaria     DE2                    Germany
1677785      SA.100743     24/01/24     15/03/23  ...                   CCI for Munich and Upper Bavaria     DE2                    Germany

[1677786 rows x 30 columns]

@larry77
Copy link
Author

larry77 commented Feb 1, 2024 via email

@kou kou changed the title [R]: CSV parser got out of sync with chunker [R] CSV parser got out of sync with chunker Feb 2, 2024
pitrou added a commit to pitrou/arrow that referenced this issue Feb 5, 2024
pitrou added a commit to pitrou/arrow that referenced this issue Feb 5, 2024
pitrou added a commit to pitrou/arrow that referenced this issue Feb 6, 2024
pitrou added a commit that referenced this issue Feb 6, 2024
…tion (#39892)

### Rationale for this change

When writing the CSV reader, we thought that the parser not finding the same line limits as the chunker should never happen, hence the terse "chunker out of sync" error message.

It turns out that, if the input contains multiline cell values and the `newlines_in_values` option was not enabled, the chunker can happily delimit a block on a newline that's inside a quoted string. The parser will then see truncated data and will stop parsing, yielding a parsed size that's smaller than the first block (see added comment in the code).

### What changes are included in this PR?

* Add some parser tests that showcase the condition encountered in GH-39857
* Improve error message to guide users towards the solution

### Are these changes tested?

There's no functional change, the error message itself isn't tested.

### Are there any user-facing changes?

No.

* Closes: #39857

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@pitrou pitrou added this to the 16.0.0 milestone Feb 6, 2024
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
… condition (apache#39892)

### Rationale for this change

When writing the CSV reader, we thought that the parser not finding the same line limits as the chunker should never happen, hence the terse "chunker out of sync" error message.

It turns out that, if the input contains multiline cell values and the `newlines_in_values` option was not enabled, the chunker can happily delimit a block on a newline that's inside a quoted string. The parser will then see truncated data and will stop parsing, yielding a parsed size that's smaller than the first block (see added comment in the code).

### What changes are included in this PR?

* Add some parser tests that showcase the condition encountered in apacheGH-39857
* Improve error message to guide users towards the solution

### Are these changes tested?

There's no functional change, the error message itself isn't tested.

### Are there any user-facing changes?

No.

* Closes: apache#39857

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this issue Feb 28, 2024
… condition (apache#39892)

### Rationale for this change

When writing the CSV reader, we thought that the parser not finding the same line limits as the chunker should never happen, hence the terse "chunker out of sync" error message.

It turns out that, if the input contains multiline cell values and the `newlines_in_values` option was not enabled, the chunker can happily delimit a block on a newline that's inside a quoted string. The parser will then see truncated data and will stop parsing, yielding a parsed size that's smaller than the first block (see added comment in the code).

### What changes are included in this PR?

* Add some parser tests that showcase the condition encountered in apacheGH-39857
* Improve error message to guide users towards the solution

### Are these changes tested?

There's no functional change, the error message itself isn't tested.

### Are there any user-facing changes?

No.

* Closes: apache#39857

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
thisisnic pushed a commit to thisisnic/arrow that referenced this issue Mar 8, 2024
… condition (apache#39892)

### Rationale for this change

When writing the CSV reader, we thought that the parser not finding the same line limits as the chunker should never happen, hence the terse "chunker out of sync" error message.

It turns out that, if the input contains multiline cell values and the `newlines_in_values` option was not enabled, the chunker can happily delimit a block on a newline that's inside a quoted string. The parser will then see truncated data and will stop parsing, yielding a parsed size that's smaller than the first block (see added comment in the code).

### What changes are included in this PR?

* Add some parser tests that showcase the condition encountered in apacheGH-39857
* Improve error message to guide users towards the solution

### Are these changes tested?

There's no functional change, the error message itself isn't tested.

### Are there any user-facing changes?

No.

* Closes: apache#39857

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@dmontecino
Copy link

Hey thanks for this awesome package. Any news on this error?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants