[R] Can not parse file #34291

elgabbas · 2023-02-22T10:27:43Z

Hello,

I am trying to load large csv file (tab-delimited; 23 GB, 16M rows, 259 cols) using arrow R package. I get this error early enough while reading the file content.

Error in `read_delim_arrow()`: ! Invalid: CSV parse error: Row #834603: Expected 259 columns, got 322: 2417934775 DSS00439000014FB CC0_1_0 National Museum of Nat ...

This is the content of the line shown in the previous error:

2417934775\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tDSS00439000014FB\t\t\t\t\t\t\t\t\t\tCC0_1_0\t\t\t\t\tNational Museum of Natural History, Luxembourg\t\t\t\t\t\t\t\t\t\t\t\t\t\t\thttps://ror.org/05natt857\tMnhnL\t\t\tMNHNL-HERB-LUX\tHerbarium\t\tPRESERVED_SPECIMEN\t\t\tTaxon status for Luxembourg: [\"Least concern - IUCN (2001)\"]\tDSS00439000014FB\t20471\t\tLéopold Reichling\t\t\t\t\t\t\t\t\t\t\t\t\tPRESENT\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t1953-08-06T00:00:00\t\t\t\t1953\t8\t6\t1953-8-6/1953-8-6\t\tUnknown\t\t\t\t\t\t\t\t\tEUROPE\t\t\t\tLU\t\t\t\tGarnich\t\"Entre Garnich et Windhof, chemin longeant la lisière du bois dit \"\"Lange Rés\"\" sur marnes liasiques\t\t\t\t\t\t\t\t49.6275\t5.96049\t\t\t\tLUGR\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tOriginal\t\tLéopold Reichling\t\t\t\t\t\t\t\t2702084\t\t\t\t\t\tJuncus tenuis Willd.\t\t\t\t\t\t\t\tPlantae\tTracheophyta\tLiliopsida\tPoales\tJuncaceae\t\tJuncus\tJuncus\t\t\ttenuis\t\t\tSPECIES\t\t\t\tACCEPTED\t\t\t962f59bc-f762-11e1-a439-00145eb45e9a\tLU\t2023-01-24T22:54:17.514Z\t\t\t\t\t\t\t\tCOUNTRY_DERIVED_FROM_COORDINATES;CONTINENT_DERIVED_FROM_COORDINATES;COLLECTION_MATCH_FUZZY\tStillImage\ttrue\tfalse\t2702084\t2702084\t6\t7707728\t196\t1369\t5353\t2701072\t\t2702084\tJuncus tenuis\tJuncus tenuis Willd.\tJuncus tenuis\t\tEML\t2023-01-24T22:54:17.514Z\t2023-01-06T10:14:02.331Z\tfalse\t\tLUX\tLuxembourg\tLUX.3_1\tLuxembourg\tLUX.3.1_1\tCapellen\tLUX.3.1.4_1\tGarnich\tNE\t

Do you think that the problem is due to the use of 1, 2, or 3 quotes in the text? due to square brackets?
Can this because of the encoding?

Thanks.
Ahmed

EDIT: This is a reprex code for the issue:

Occ <- read_delim_arrow(file = "https://github.com/apache/arrow/files/10802973/Arrow_parse_Example.txt", delim = "\t")

Component(s)

R

The text was updated successfully, but these errors were encountered:

thisisnic · 2023-02-22T10:50:13Z

@elgabbas Please could you show us the exact line of code you're running to get the error?

Also, I think the line of data you've included there might be truncated; I ran stringr::str_count(text, ",") on it to count the commas, and there are 186; I'd be expecting to see 258 if the data had 259 columns.

elgabbas · 2023-02-22T10:53:08Z

Thanks @thisisnic ... I updated the issue text, now with the exact content of the first line having this problem.

This is the code I am using to read the file:

Occ <- read_delim_arrow(file = File_Path, delim = "\t")

Now using this code returns 258 (as expected)


"2417934775\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tDSS00439000014FB\t\t\t\t\t\t\t\t\t\tCC0_1_0\t\t\t\t\tNational Museum of Natural History, Luxembourg\t\t\t\t\t\t\t\t\t\t\t\t\t\t\thttps://ror.org/05natt857\tMnhnL\t\t\tMNHNL-HERB-LUX\tHerbarium\t\tPRESERVED_SPECIMEN\t\t\tTaxon status for Luxembourg: [\"Least concern - IUCN (2001)\"]\tDSS00439000014FB\t20471\t\tLéopold Reichling\t\t\t\t\t\t\t\t\t\t\t\t\tPRESENT\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t1953-08-06T00:00:00\t\t\t\t1953\t8\t6\t1953-8-6/1953-8-6\t\tUnknown\t\t\t\t\t\t\t\t\tEUROPE\t\t\t\tLU\t\t\t\tGarnich\t\"Entre Garnich et Windhof, chemin longeant la lisière du bois dit \"\"Lange Rés\"\" sur marnes liasiques\t\t\t\t\t\t\t\t49.6275\t5.96049\t\t\t\tLUGR\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tOriginal\t\tLéopold Reichling\t\t\t\t\t\t\t\t2702084\t\t\t\t\t\tJuncus tenuis Willd.\t\t\t\t\t\t\t\tPlantae\tTracheophyta\tLiliopsida\tPoales\tJuncaceae\t\tJuncus\tJuncus\t\t\ttenuis\t\t\tSPECIES\t\t\t\tACCEPTED\t\t\t962f59bc-f762-11e1-a439-00145eb45e9a\tLU\t2023-01-24T22:54:17.514Z\t\t\t\t\t\t\t\tCOUNTRY_DERIVED_FROM_COORDINATES;CONTINENT_DERIVED_FROM_COORDINATES;COLLECTION_MATCH_FUZZY\tStillImage\ttrue\tfalse\t2702084\t2702084\t6\t7707728\t196\t1369\t5353\t2701072\t\t2702084\tJuncus tenuis\tJuncus tenuis Willd.\tJuncus tenuis\t\tEML\t2023-01-24T22:54:17.514Z\t2023-01-06T10:14:02.331Z\tfalse\t\tLUX\tLuxembourg\tLUX.3_1\tLuxembourg\tLUX.3.1_1\tCapellen\tLUX.3.1.4_1\tGarnich\tNE\t" %>% 
  stringr::str_count("\t")

elgabbas · 2023-02-22T10:57:20Z

Please note that I read the same data into chunks using readr and vroom without similar error message. I would like to be able to load the full data without split, that's why I start using arrow.

eitsupi · 2023-02-22T11:06:23Z

I am wondering if the following example replicates this problem.

txt <- "a\tb\n1\t\t2"
readr::read_tsv(I(txt), show_col_types = FALSE)
#> Warning: One or more parsing issues, call `problems()` on your data frame for details,
#> e.g.:
#>   dat <- vroom(...)
#>   problems(dat)
#> # A tibble: 1 × 2
#>       a     b
#>   <dbl> <dbl>
#> 1     1     2
arrow::read_tsv_arrow(charToRaw(txt))
#> Error:
#> ! Invalid: CSV parse error: Expected 2 columns, got 3: 1     2

#> Backtrace:
#>     ▆
#>  1. └─arrow (local) `<fn>`(file = charToRaw(txt), delim = "\t")
#>  2.   └─base::tryCatch(...)
#>  3.     └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#>  4.       └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
#>  5.         └─value[[3L]](cond)
#>  6.           └─arrow:::augment_io_error_msg(e, call, schema = schema)
#>  7.             └─rlang::abort(msg, call = call)

^{Created on 2023-02-22 with reprex v2.0.2}

@elgabbas

Please note that I read the same data into chunks using readr and vroom without similar error message.

Did you get an warning message like the example above (Warning: One or more parsing issues, call `problems()` on your data frame for details) when you read that file by readr or vroom?

elgabbas · 2023-02-22T11:06:47Z

I can reproduce the problem using the following code and attached data
Arrow_parse_Example.txt

Occ <- read_delim_arrow(file = "https://github.com/apache/arrow/files/10802973/Arrow_parse_Example.txt", delim = "\t")
# Error in `read_delim_arrow()`: ! Invalid: CSV parse error: Row #2: Expected 259 columns, got 130: 2417934775																										DSS00439000014FB										CC0_1_0					National Museum of Nat ...

And similarly with read_tsv_arrow

Occ <- arrow::read_tsv_arrow(file = "https://github.com/apache/arrow/files/10802973/Arrow_parse_Example.txt")
# > Error: ! Invalid: CSV parse error: Row #2: Expected 259 columns, got 130: 2417934775																										DSS00439000014FB										CC0_1_0					National Museum of Nat ...

elgabbas · 2023-02-22T11:11:48Z

Thanks @eitsupi I get the exact output when i implement your code.

I do not think I get similar issues. I am using this argument col_types = readr::cols(.default = "c") in readr to read all variables as character.

I also tried to use a schema object with read_delim_arrow to load all columns as characters but the same error exists.

eitsupi · 2023-02-22T11:22:13Z

I think this issue mean that vroom::vroom (readr::read_delim) has the ability to interpret consecutive \t as a single delimiter, whereas read_delim_arrow does not.

elgabbas · 2023-02-22T11:24:43Z

@eitsupi I can not for sure tell what is the reason, but I do not think this is the problem. read_delim_arrow was able to read previous lines with consecutive \t without problems..

eitsupi · 2023-02-22T11:27:58Z

I can not for sure tell what is the reason, but I do not think this is the problem. read_delim_arrow was able to read previous lines with consecutive \t without problems..

@elgabbas Could you please upload such a file (has some rows)? The file you just uploaded seems to contain only a header and one row.

elgabbas · 2023-02-22T11:31:13Z

That's it... The problem happened with the second row only
Arrow_parse_Example.txt
If I remove all quotations, I was able to load the data without a problem..
Arrow_parse_Example2.txt

elgabbas · 2023-02-22T12:43:36Z

Thanks @eitsupi and @thisisnic for your response....

I think I know now the reason for this error, at least for this specific row example.

One of the fields contains 5 double quotations; e.g. "TEXT1 ""Quoted"". If I remove one, for example the last quotation, I was able to read this line correctly.

Is it possible to ignore ALL single or double quotations altogether (or ignore the unnecessary extra quotation) programmatically?

Can this be handled using the quote argument?

Thanks

EDIT:
Using quote = "" resulting in no errors, but adding escaping character \ before each double quotation. Is it possible to avoid this?

eitsupi · 2023-02-22T13:08:00Z

@elgabbas Thank you for uploading this file.

Unfortunately, however, it seems that the first row is failing to load on my end.
Could it be possible for you to show me the complete log at yours?

> arrow::read_tsv_arrow("Arrow_parse_Example.txt")
Error:
! Invalid: CSV parse error: Expected 259 columns, got 322: 2417931730                                                 DSS004390000131N                                                                         CC0_1_0                       National Museum of Nat ...
Run `rlang::last_error()` to see where the error occurred.

Since " is at the end of the first row, it seems that if I read by readr::read_tsv, the last column (eventType) of the first row will contain the contents of the second row.

> readr::read_tsv("Arrow_parse_Example.txt", show_col_types = FALSE)
# A tibble: 1 × 259                                                                                                  
      gbifID abstract accessR…¹ accru…² accru…³ accru…⁴ alter…⁵ audie…⁶ avail…⁷ bibli…⁸ confo…⁹ contr…˟ cover…˟ created
       <dbl> <lgl>    <lgl>     <lgl>   <lgl>   <lgl>   <lgl>   <lgl>   <lgl>   <lgl>   <lgl>   <lgl>   <lgl>   <lgl>  
1 2417931730 NA       NA        NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA     
# … with 245 more variables: creator <lgl>, date <lgl>, dateAccepted <lgl>, dateCopyrighted <lgl>,
#   dateSubmitted <lgl>, description <lgl>, educationLevel <lgl>, extent <lgl>, format <lgl>, hasFormat <lgl>,
#   hasPart <lgl>, hasVersion <lgl>, identifier <chr>, instructionalMethod <lgl>, isFormatOf <lgl>, isPartOf <lgl>,
#   isReferencedBy <lgl>, isReplacedBy <lgl>, isRequiredBy <lgl>, isVersionOf <lgl>, issued <lgl>, language <lgl>,
#   license <chr>, mediator <lgl>, medium <lgl>, modified <lgl>, provenance <lgl>, publisher <chr>, references <lgl>,
#   relation <lgl>, replaces <lgl>, requires <lgl>, rights <lgl>, rightsHolder <lgl>, source <lgl>, spatial <lgl>,
#   subject <lgl>, tableOfContents <lgl>, temporal <lgl>, title <lgl>, type <lgl>, valid <lgl>, institutionID <chr>, …
# ℹ Use `colnames()` to see all variable names

> readr::read_tsv("Arrow_parse_Example.txt", show_col_types = FALSE)$eventType
[1] "\n2417934775\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tDSS00439000014FB\t\t\t\t\t\t\t\t\t\tCC0_1_0\t\t\t\t\tNational Museum of Natural History, Luxembourg\t\t\t\t\t\t\t\t\t\t\t\t\t\t\thttps://ror.org/05natt857\tMnhnL\t\t\tMNHNL-HERB-LUX\tHerbarium\t\tPRESERVED_SPECIMEN\t\t\tTaxon status for Luxembourg: [Least concern - IUCN (2001)]\tDSS00439000014FB\t20471\t\tLéopold Reichling\t\t\t\t\t\t\t\t\t\t\t\t\tPRESENT\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t1953-08-06T00:00:00\t\t\t\t1953\t8\t6\t1953-8-6/1953-8-6\t\tUnknown\t\t\t\t\t\t\t\t\tEUROPE\t\t\t\tLU\t\t\t\tGarnich\tEntre Garnich et Windhof, chemin longeant la lisière du bois dit \"Lange Rés\" sur marnes liasiques"

elgabbas · 2023-02-22T13:19:20Z

Thanks @eitsupi

As I mentioned in the previous message, it seems the problem is due to an extra non-necessary quotation.
If I manually removed it (second example below: Arrow_parse_Example5.txt), I can load the data.

# This failed
Occ <- read_delim_arrow(file = "https://github.com/apache/arrow/files/10804095/Arrow_parse_Example4.txt", delim = "\t")
# Error in `read_delim_arrow()`: ! Invalid: CSV parse error: Row #3: Expected 3 columns, got 2: 2417934775	"TEXT1 ""Quoted"" TEXT2	49.6275

# This works
Occ <- read_delim_arrow(file = "https://github.com/apache/arrow/files/10804096/Arrow_parse_Example5.txt", delim = "\t")

The only difference between both files is the removal of extra double quotation.

Using quote = "", I was able to overcome this specific issue, but this is how the data look like now (which is not neat!):

Occ <- read_delim_arrow(file = "https://github.com/apache/arrow/files/10804095/Arrow_parse_Example4.txt", delim = "\t", quote = "")
# A tibble: 2 × 3
# V1 V2                                V3
# 2417934775 "TEXT1\"\"NoQuoted\"\" TEXT2"   49.6
# 2417934775 "\"TEXT1 \"\"Quoted\"\" TEXT2"  49.6

I think I am close to an acceptable answer! The question now is how to avoid the resulting escape backslash!

I can have something like:

Occ %>% 
  dplyr::mutate_all(~str_replace_all(., "\"", ''))

to remove the escaped double quotation; however, I am unsure how long it will take to implement this on 17M rows!

eitsupi · 2023-02-22T13:34:47Z

@elgabbas Glad you have found a solution.

Using quote = "", I was able to overcome this specific issue, but this is how the data look like now (which is not neat!):
Occ <- read_delim_arrow(file = "https://github.com/apache/arrow/files/10804095/Arrow_parse_Example4.txt", delim = "\t", quote = "")
# A tibble: 2 × 3
# V1 V2                                V3
# 2417934775 "TEXT1\"\"NoQuoted\"\" TEXT2"   49.6
# 2417934775 "\"TEXT1 \"\"Quoted\"\" TEXT2"  49.6
I think I am close to an acceptable answer! The question now is how to avoid the resulting escape backslash!

I suspect this is just tibble escaping the double quotes on the display for clarity.
When it is not tibble, it will appear like this. Is this what you are looking for?

> read_delim_arrow(file = "https://github.com/apache/arrow/files/10804095/Arrow_parse_Example4.txt", delim = "\t", quote = "") |> as.data.frame()
          V1                      V2      V3
1 2417934775 TEXT1""NoQuoted"" TEXT2 49.6275
2 2417934775 "TEXT1 ""Quoted"" TEXT2 49.6275

elgabbas · 2023-02-22T17:51:22Z

Thanks @eitsupi
So far, I could not save the data into another Arrow format to facilitate further analyses. I tried to write the dataset into a parquet or feather objects but RStudio always crashes because of memory issue. I have 8 core computer and 32 GB ram. Is there a way to write the data into e.g. feather format in a more efficient way without crashing?

eitsupi · 2023-02-23T13:28:14Z

So far, I could not save the data into another Arrow format to facilitate further analyses. I tried to write the dataset into a parquet or feather objects but RStudio always crashes because of memory issue. I have 8 core computer and 32 GB ram. Is there a way to write the data into e.g. feather format in a more efficient way without crashing?

Since you seem to be able to read all the data from CSV as a data frame, how about setting as_data_frame = FALSE to read as Arrow Table?
I think it will work with less memory.

For example, we can convert to an Arrow IPC file (Feather V2) dataset without going through a data frame as follows.

arrow::read_delim_arrow(
  "https://github.com/apache/arrow/files/10804095/Arrow_parse_Example4.txt",
  delim = "\t",
  quote = "",
  as_data_frame = FALSE
) |>
  arrow::write_dataset("test", format = "arrow")

elgabbas · 2023-02-23T17:50:49Z

Thanks @eitsupi ... This did not help in my case. Loading the data consumed high memory and crashed my PC.

One possible solution is to loop through values of one of the columns, filter the data based on this value, then save to disk manually for each value.. Will apply this and see

arrow::open_dataset(sources = Path, format = "csv", delim = "\t", quote = "") %>% 
     # Some filtering  %>% 
    arrow::write_dataset(path = OutPath, max_open_files = 100L, max_rows_per_file = 1000L, format = "arrow")

eitsupi · 2023-02-24T15:56:21Z

FYI, here is a comment about being able to convert a huge CSV file into a Parquet file using Python.
#12653 (comment)

I am not sure if the same thing is possible in R.
If you are in an environment where you can use pyarrow this might be worth a try.

Ph-ADD · 2024-02-13T23:10:39Z

In case it's useful for someone else, I had a similar problem. I ended up using xsv, a command line tool, to split the csv file. Followed the 1-line code in this gist. Had to download the zip file, extract it, move the xsv.exe file into the same directory as the job.csv file.

ZekeMarshall · 2025-01-17T05:18:05Z

So far, I could not save the data into another Arrow format to facilitate further analyses. I tried to write the dataset into a parquet or feather objects but RStudio always crashes because of memory issue. I have 8 core computer and 32 GB ram. Is there a way to write the data into e.g. feather format in a more efficient way without crashing?

Since you seem to be able to read all the data from CSV as a data frame, how about setting as_data_frame = FALSE to read as Arrow Table? I think it will work with less memory.

For example, we can convert to an Arrow IPC file (Feather V2) dataset without going through a data frame as follows.

arrow::read_delim_arrow(
"https://github.com/apache/arrow/files/10804095/Arrow_parse_Example4.txt",
delim = "\t",
quote = "",
as_data_frame = FALSE
) |>
arrow::write_dataset("test", format = "arrow")

Hi @eitsupi and all,

The above solution wherein the argument "delim = \t" is added worked for me when I received an almost identical error to @elgabbas.

Thanks!

elgabbas added the Type: usage Issue is a user question label Feb 22, 2023

thisisnic changed the title ~~Can not parse file~~ [R] Can not parse file Feb 22, 2023

thisisnic added the Component: R label Feb 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R] Can not parse file #34291

[R] Can not parse file #34291

elgabbas commented Feb 22, 2023 •

edited

Loading

thisisnic commented Feb 22, 2023

elgabbas commented Feb 22, 2023 •

edited

Loading

elgabbas commented Feb 22, 2023

eitsupi commented Feb 22, 2023 •

edited

Loading

elgabbas commented Feb 22, 2023 •

edited

Loading

elgabbas commented Feb 22, 2023

eitsupi commented Feb 22, 2023

elgabbas commented Feb 22, 2023

eitsupi commented Feb 22, 2023 •

edited

Loading

elgabbas commented Feb 22, 2023

elgabbas commented Feb 22, 2023 •

edited

Loading

eitsupi commented Feb 22, 2023

elgabbas commented Feb 22, 2023 •

edited

Loading

eitsupi commented Feb 22, 2023

elgabbas commented Feb 22, 2023

eitsupi commented Feb 23, 2023

elgabbas commented Feb 23, 2023

eitsupi commented Feb 24, 2023

Ph-ADD commented Feb 13, 2024

ZekeMarshall commented Jan 17, 2025

[R] Can not parse file #34291

[R] Can not parse file #34291

Comments

elgabbas commented Feb 22, 2023 • edited Loading

Component(s)

thisisnic commented Feb 22, 2023

elgabbas commented Feb 22, 2023 • edited Loading

elgabbas commented Feb 22, 2023

eitsupi commented Feb 22, 2023 • edited Loading

elgabbas commented Feb 22, 2023 • edited Loading

elgabbas commented Feb 22, 2023

eitsupi commented Feb 22, 2023

elgabbas commented Feb 22, 2023

eitsupi commented Feb 22, 2023 • edited Loading

elgabbas commented Feb 22, 2023

elgabbas commented Feb 22, 2023 • edited Loading

eitsupi commented Feb 22, 2023

elgabbas commented Feb 22, 2023 • edited Loading

eitsupi commented Feb 22, 2023

elgabbas commented Feb 22, 2023

eitsupi commented Feb 23, 2023

elgabbas commented Feb 23, 2023

eitsupi commented Feb 24, 2023

Ph-ADD commented Feb 13, 2024

ZekeMarshall commented Jan 17, 2025

elgabbas commented Feb 22, 2023 •

edited

Loading

elgabbas commented Feb 22, 2023 •

edited

Loading

eitsupi commented Feb 22, 2023 •

edited

Loading

elgabbas commented Feb 22, 2023 •

edited

Loading

eitsupi commented Feb 22, 2023 •

edited

Loading

elgabbas commented Feb 22, 2023 •

edited

Loading

elgabbas commented Feb 22, 2023 •

edited

Loading