Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Can not parse file #34291

Open
elgabbas opened this issue Feb 22, 2023 · 20 comments
Open

[R] Can not parse file #34291

elgabbas opened this issue Feb 22, 2023 · 20 comments
Labels
Component: R Type: usage Issue is a user question

Comments

@elgabbas
Copy link

elgabbas commented Feb 22, 2023

Hello,

I am trying to load large csv file (tab-delimited; 23 GB, 16M rows, 259 cols) using arrow R package. I get this error early enough while reading the file content.

Error in `read_delim_arrow()`: ! Invalid: CSV parse error: Row #834603: Expected 259 columns, got 322: 2417934775 DSS00439000014FB CC0_1_0 National Museum of Nat ...

This is the content of the line shown in the previous error:

2417934775\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tDSS00439000014FB\t\t\t\t\t\t\t\t\t\tCC0_1_0\t\t\t\t\tNational Museum of Natural History, Luxembourg\t\t\t\t\t\t\t\t\t\t\t\t\t\t\thttps://ror.org/05natt857\tMnhnL\t\t\tMNHNL-HERB-LUX\tHerbarium\t\tPRESERVED_SPECIMEN\t\t\tTaxon status for Luxembourg: [\"Least concern - IUCN (2001)\"]\tDSS00439000014FB\t20471\t\tLéopold Reichling\t\t\t\t\t\t\t\t\t\t\t\t\tPRESENT\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t1953-08-06T00:00:00\t\t\t\t1953\t8\t6\t1953-8-6/1953-8-6\t\tUnknown\t\t\t\t\t\t\t\t\tEUROPE\t\t\t\tLU\t\t\t\tGarnich\t\"Entre Garnich et Windhof, chemin longeant la lisière du bois dit \"\"Lange Rés\"\" sur marnes liasiques\t\t\t\t\t\t\t\t49.6275\t5.96049\t\t\t\tLUGR\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tOriginal\t\tLéopold Reichling\t\t\t\t\t\t\t\t2702084\t\t\t\t\t\tJuncus tenuis Willd.\t\t\t\t\t\t\t\tPlantae\tTracheophyta\tLiliopsida\tPoales\tJuncaceae\t\tJuncus\tJuncus\t\t\ttenuis\t\t\tSPECIES\t\t\t\tACCEPTED\t\t\t962f59bc-f762-11e1-a439-00145eb45e9a\tLU\t2023-01-24T22:54:17.514Z\t\t\t\t\t\t\t\tCOUNTRY_DERIVED_FROM_COORDINATES;CONTINENT_DERIVED_FROM_COORDINATES;COLLECTION_MATCH_FUZZY\tStillImage\ttrue\tfalse\t2702084\t2702084\t6\t7707728\t196\t1369\t5353\t2701072\t\t2702084\tJuncus tenuis\tJuncus tenuis Willd.\tJuncus tenuis\t\tEML\t2023-01-24T22:54:17.514Z\t2023-01-06T10:14:02.331Z\tfalse\t\tLUX\tLuxembourg\tLUX.3_1\tLuxembourg\tLUX.3.1_1\tCapellen\tLUX.3.1.4_1\tGarnich\tNE\t

Do you think that the problem is due to the use of 1, 2, or 3 quotes in the text? due to square brackets?
Can this because of the encoding?

Thanks.
Ahmed


EDIT: This is a reprex code for the issue:

Occ <- read_delim_arrow(file = "https://github.com/apache/arrow/files/10802973/Arrow_parse_Example.txt", delim = "\t")

Component(s)

R

@elgabbas elgabbas added the Type: usage Issue is a user question label Feb 22, 2023
@thisisnic thisisnic changed the title Can not parse file [R] Can not parse file Feb 22, 2023
@thisisnic
Copy link
Member

@elgabbas Please could you show us the exact line of code you're running to get the error?

Also, I think the line of data you've included there might be truncated; I ran stringr::str_count(text, ",") on it to count the commas, and there are 186; I'd be expecting to see 258 if the data had 259 columns.

@elgabbas
Copy link
Author

elgabbas commented Feb 22, 2023

Thanks @thisisnic ... I updated the issue text, now with the exact content of the first line having this problem.

This is the code I am using to read the file:

Occ <- read_delim_arrow(file = File_Path, delim = "\t")

Now using this code returns 258 (as expected)


"2417934775\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tDSS00439000014FB\t\t\t\t\t\t\t\t\t\tCC0_1_0\t\t\t\t\tNational Museum of Natural History, Luxembourg\t\t\t\t\t\t\t\t\t\t\t\t\t\t\thttps://ror.org/05natt857\tMnhnL\t\t\tMNHNL-HERB-LUX\tHerbarium\t\tPRESERVED_SPECIMEN\t\t\tTaxon status for Luxembourg: [\"Least concern - IUCN (2001)\"]\tDSS00439000014FB\t20471\t\tLéopold Reichling\t\t\t\t\t\t\t\t\t\t\t\t\tPRESENT\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t1953-08-06T00:00:00\t\t\t\t1953\t8\t6\t1953-8-6/1953-8-6\t\tUnknown\t\t\t\t\t\t\t\t\tEUROPE\t\t\t\tLU\t\t\t\tGarnich\t\"Entre Garnich et Windhof, chemin longeant la lisière du bois dit \"\"Lange Rés\"\" sur marnes liasiques\t\t\t\t\t\t\t\t49.6275\t5.96049\t\t\t\tLUGR\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tOriginal\t\tLéopold Reichling\t\t\t\t\t\t\t\t2702084\t\t\t\t\t\tJuncus tenuis Willd.\t\t\t\t\t\t\t\tPlantae\tTracheophyta\tLiliopsida\tPoales\tJuncaceae\t\tJuncus\tJuncus\t\t\ttenuis\t\t\tSPECIES\t\t\t\tACCEPTED\t\t\t962f59bc-f762-11e1-a439-00145eb45e9a\tLU\t2023-01-24T22:54:17.514Z\t\t\t\t\t\t\t\tCOUNTRY_DERIVED_FROM_COORDINATES;CONTINENT_DERIVED_FROM_COORDINATES;COLLECTION_MATCH_FUZZY\tStillImage\ttrue\tfalse\t2702084\t2702084\t6\t7707728\t196\t1369\t5353\t2701072\t\t2702084\tJuncus tenuis\tJuncus tenuis Willd.\tJuncus tenuis\t\tEML\t2023-01-24T22:54:17.514Z\t2023-01-06T10:14:02.331Z\tfalse\t\tLUX\tLuxembourg\tLUX.3_1\tLuxembourg\tLUX.3.1_1\tCapellen\tLUX.3.1.4_1\tGarnich\tNE\t" %>% 
  stringr::str_count("\t")

@elgabbas
Copy link
Author

Please note that I read the same data into chunks using readr and vroom without similar error message. I would like to be able to load the full data without split, that's why I start using arrow.

@eitsupi
Copy link
Contributor

eitsupi commented Feb 22, 2023

I am wondering if the following example replicates this problem.

txt <- "a\tb\n1\t\t2"
readr::read_tsv(I(txt), show_col_types = FALSE)
#> Warning: One or more parsing issues, call `problems()` on your data frame for details,
#> e.g.:
#>   dat <- vroom(...)
#>   problems(dat)
#> # A tibble: 1 × 2
#>       a     b
#>   <dbl> <dbl>
#> 1     1     2
arrow::read_tsv_arrow(charToRaw(txt))
#> Error:
#> ! Invalid: CSV parse error: Expected 2 columns, got 3: 1     2

#> Backtrace:
#>     ▆
#>  1. └─arrow (local) `<fn>`(file = charToRaw(txt), delim = "\t")
#>  2.   └─base::tryCatch(...)
#>  3.     └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#>  4.       └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
#>  5.         └─value[[3L]](cond)
#>  6.           └─arrow:::augment_io_error_msg(e, call, schema = schema)
#>  7.             └─rlang::abort(msg, call = call)

Created on 2023-02-22 with reprex v2.0.2

@elgabbas

Please note that I read the same data into chunks using readr and vroom without similar error message.

Did you get an warning message like the example above (Warning: One or more parsing issues, call `problems()` on your data frame for details) when you read that file by readr or vroom?

@elgabbas
Copy link
Author

elgabbas commented Feb 22, 2023

I can reproduce the problem using the following code and attached data
Arrow_parse_Example.txt

Occ <- read_delim_arrow(file = "https://github.com/apache/arrow/files/10802973/Arrow_parse_Example.txt", delim = "\t")
# Error in `read_delim_arrow()`: ! Invalid: CSV parse error: Row #2: Expected 259 columns, got 130: 2417934775																										DSS00439000014FB										CC0_1_0					National Museum of Nat ...

And similarly with read_tsv_arrow

Occ <- arrow::read_tsv_arrow(file = "https://github.com/apache/arrow/files/10802973/Arrow_parse_Example.txt")
# > Error: ! Invalid: CSV parse error: Row #2: Expected 259 columns, got 130: 2417934775																										DSS00439000014FB										CC0_1_0					National Museum of Nat ...

@elgabbas
Copy link
Author

Thanks @eitsupi I get the exact output when i implement your code.

I do not think I get similar issues. I am using this argument col_types = readr::cols(.default = "c") in readr to read all variables as character.

I also tried to use a schema object with read_delim_arrow to load all columns as characters but the same error exists.

@eitsupi
Copy link
Contributor

eitsupi commented Feb 22, 2023

I think this issue mean that vroom::vroom (readr::read_delim) has the ability to interpret consecutive \t as a single delimiter, whereas read_delim_arrow does not.

@elgabbas
Copy link
Author

@eitsupi I can not for sure tell what is the reason, but I do not think this is the problem. read_delim_arrow was able to read previous lines with consecutive \t without problems..

@eitsupi
Copy link
Contributor

eitsupi commented Feb 22, 2023

I can not for sure tell what is the reason, but I do not think this is the problem. read_delim_arrow was able to read previous lines with consecutive \t without problems..

@elgabbas Could you please upload such a file (has some rows)? The file you just uploaded seems to contain only a header and one row.

@elgabbas
Copy link
Author

That's it... The problem happened with the second row only
Arrow_parse_Example.txt
If I remove all quotations, I was able to load the data without a problem..
Arrow_parse_Example2.txt

@elgabbas
Copy link
Author

elgabbas commented Feb 22, 2023

Thanks @eitsupi and @thisisnic for your response....

I think I know now the reason for this error, at least for this specific row example.

One of the fields contains 5 double quotations; e.g. "TEXT1 ""Quoted"". If I remove one, for example the last quotation, I was able to read this line correctly.

Is it possible to ignore ALL single or double quotations altogether (or ignore the unnecessary extra quotation) programmatically?

Can this be handled using the quote argument?

Thanks


EDIT:
Using quote = "" resulting in no errors, but adding escaping character \ before each double quotation. Is it possible to avoid this?

@eitsupi
Copy link
Contributor

eitsupi commented Feb 22, 2023

@elgabbas Thank you for uploading this file.

Unfortunately, however, it seems that the first row is failing to load on my end.
Could it be possible for you to show me the complete log at yours?

> arrow::read_tsv_arrow("Arrow_parse_Example.txt")
Error:
! Invalid: CSV parse error: Expected 259 columns, got 322: 2417931730                                                 DSS004390000131N                                                                         CC0_1_0                       National Museum of Nat ...
Run `rlang::last_error()` to see where the error occurred.

Since " is at the end of the first row, it seems that if I read by readr::read_tsv, the last column (eventType) of the first row will contain the contents of the second row.

> readr::read_tsv("Arrow_parse_Example.txt", show_col_types = FALSE)
# A tibble: 1 × 259                                                                                                  
      gbifID abstract accessR…¹ accru…² accru…³ accru…⁴ alter…⁵ audie…⁶ avail…⁷ bibli…⁸ confo…⁹ contr…˟ cover…˟ created
       <dbl> <lgl>    <lgl>     <lgl>   <lgl>   <lgl>   <lgl>   <lgl>   <lgl>   <lgl>   <lgl>   <lgl>   <lgl>   <lgl>  
1 2417931730 NA       NA        NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA     
# … with 245 more variables: creator <lgl>, date <lgl>, dateAccepted <lgl>, dateCopyrighted <lgl>,
#   dateSubmitted <lgl>, description <lgl>, educationLevel <lgl>, extent <lgl>, format <lgl>, hasFormat <lgl>,
#   hasPart <lgl>, hasVersion <lgl>, identifier <chr>, instructionalMethod <lgl>, isFormatOf <lgl>, isPartOf <lgl>,
#   isReferencedBy <lgl>, isReplacedBy <lgl>, isRequiredBy <lgl>, isVersionOf <lgl>, issued <lgl>, language <lgl>,
#   license <chr>, mediator <lgl>, medium <lgl>, modified <lgl>, provenance <lgl>, publisher <chr>, references <lgl>,
#   relation <lgl>, replaces <lgl>, requires <lgl>, rights <lgl>, rightsHolder <lgl>, source <lgl>, spatial <lgl>,
#   subject <lgl>, tableOfContents <lgl>, temporal <lgl>, title <lgl>, type <lgl>, valid <lgl>, institutionID <chr>, …
# ℹ Use `colnames()` to see all variable names

> readr::read_tsv("Arrow_parse_Example.txt", show_col_types = FALSE)$eventType
[1] "\n2417934775\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tDSS00439000014FB\t\t\t\t\t\t\t\t\t\tCC0_1_0\t\t\t\t\tNational Museum of Natural History, Luxembourg\t\t\t\t\t\t\t\t\t\t\t\t\t\t\thttps://ror.org/05natt857\tMnhnL\t\t\tMNHNL-HERB-LUX\tHerbarium\t\tPRESERVED_SPECIMEN\t\t\tTaxon status for Luxembourg: [Least concern - IUCN (2001)]\tDSS00439000014FB\t20471\t\tLéopold Reichling\t\t\t\t\t\t\t\t\t\t\t\t\tPRESENT\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t1953-08-06T00:00:00\t\t\t\t1953\t8\t6\t1953-8-6/1953-8-6\t\tUnknown\t\t\t\t\t\t\t\t\tEUROPE\t\t\t\tLU\t\t\t\tGarnich\tEntre Garnich et Windhof, chemin longeant la lisière du bois dit \"Lange Rés\" sur marnes liasiques"

@elgabbas
Copy link
Author

elgabbas commented Feb 22, 2023

Thanks @eitsupi

As I mentioned in the previous message, it seems the problem is due to an extra non-necessary quotation.
If I manually removed it (second example below: Arrow_parse_Example5.txt), I can load the data.

# This failed
Occ <- read_delim_arrow(file = "https://github.com/apache/arrow/files/10804095/Arrow_parse_Example4.txt", delim = "\t")
# Error in `read_delim_arrow()`: ! Invalid: CSV parse error: Row #3: Expected 3 columns, got 2: 2417934775	"TEXT1 ""Quoted"" TEXT2	49.6275

# This works
Occ <- read_delim_arrow(file = "https://github.com/apache/arrow/files/10804096/Arrow_parse_Example5.txt", delim = "\t")

The only difference between both files is the removal of extra double quotation.

Using quote = "", I was able to overcome this specific issue, but this is how the data look like now (which is not neat!):

Occ <- read_delim_arrow(file = "https://github.com/apache/arrow/files/10804095/Arrow_parse_Example4.txt", delim = "\t", quote = "")
# A tibble: 2 × 3
# V1 V2                                V3
# 2417934775 "TEXT1\"\"NoQuoted\"\" TEXT2"   49.6
# 2417934775 "\"TEXT1 \"\"Quoted\"\" TEXT2"  49.6

I think I am close to an acceptable answer! The question now is how to avoid the resulting escape backslash!

I can have something like:

Occ %>% 
  dplyr::mutate_all(~str_replace_all(., "\"", ''))

to remove the escaped double quotation; however, I am unsure how long it will take to implement this on 17M rows!

@eitsupi
Copy link
Contributor

eitsupi commented Feb 22, 2023

@elgabbas Glad you have found a solution.

Using quote = "", I was able to overcome this specific issue, but this is how the data look like now (which is not neat!):

Occ <- read_delim_arrow(file = "https://github.com/apache/arrow/files/10804095/Arrow_parse_Example4.txt", delim = "\t", quote = "")
# A tibble: 2 × 3
# V1 V2                                V3
# 2417934775 "TEXT1\"\"NoQuoted\"\" TEXT2"   49.6
# 2417934775 "\"TEXT1 \"\"Quoted\"\" TEXT2"  49.6

I think I am close to an acceptable answer! The question now is how to avoid the resulting escape backslash!

I suspect this is just tibble escaping the double quotes on the display for clarity.
When it is not tibble, it will appear like this. Is this what you are looking for?

> read_delim_arrow(file = "https://github.com/apache/arrow/files/10804095/Arrow_parse_Example4.txt", delim = "\t", quote = "") |> as.data.frame()
          V1                      V2      V3
1 2417934775 TEXT1""NoQuoted"" TEXT2 49.6275
2 2417934775 "TEXT1 ""Quoted"" TEXT2 49.6275

@elgabbas
Copy link
Author

Thanks @eitsupi
So far, I could not save the data into another Arrow format to facilitate further analyses. I tried to write the dataset into a parquet or feather objects but RStudio always crashes because of memory issue. I have 8 core computer and 32 GB ram. Is there a way to write the data into e.g. feather format in a more efficient way without crashing?

@eitsupi
Copy link
Contributor

eitsupi commented Feb 23, 2023

So far, I could not save the data into another Arrow format to facilitate further analyses. I tried to write the dataset into a parquet or feather objects but RStudio always crashes because of memory issue. I have 8 core computer and 32 GB ram. Is there a way to write the data into e.g. feather format in a more efficient way without crashing?

Since you seem to be able to read all the data from CSV as a data frame, how about setting as_data_frame = FALSE to read as Arrow Table?
I think it will work with less memory.

For example, we can convert to an Arrow IPC file (Feather V2) dataset without going through a data frame as follows.

arrow::read_delim_arrow(
  "https://github.com/apache/arrow/files/10804095/Arrow_parse_Example4.txt",
  delim = "\t",
  quote = "",
  as_data_frame = FALSE
) |>
  arrow::write_dataset("test", format = "arrow")

@elgabbas
Copy link
Author

Thanks @eitsupi ... This did not help in my case. Loading the data consumed high memory and crashed my PC.

One possible solution is to loop through values of one of the columns, filter the data based on this value, then save to disk manually for each value.. Will apply this and see

arrow::open_dataset(sources = Path, format = "csv", delim = "\t", quote = "") %>% 
     # Some filtering  %>% 
    arrow::write_dataset(path = OutPath, max_open_files = 100L, max_rows_per_file = 1000L, format = "arrow")

@eitsupi
Copy link
Contributor

eitsupi commented Feb 24, 2023

FYI, here is a comment about being able to convert a huge CSV file into a Parquet file using Python.
#12653 (comment)

I am not sure if the same thing is possible in R.
If you are in an environment where you can use pyarrow this might be worth a try.

@Ph-ADD
Copy link

Ph-ADD commented Feb 13, 2024

In case it's useful for someone else, I had a similar problem. I ended up using xsv, a command line tool, to split the csv file. Followed the 1-line code in this gist. Had to download the zip file, extract it, move the xsv.exe file into the same directory as the job.csv file.

@ZekeMarshall
Copy link

So far, I could not save the data into another Arrow format to facilitate further analyses. I tried to write the dataset into a parquet or feather objects but RStudio always crashes because of memory issue. I have 8 core computer and 32 GB ram. Is there a way to write the data into e.g. feather format in a more efficient way without crashing?

Since you seem to be able to read all the data from CSV as a data frame, how about setting as_data_frame = FALSE to read as Arrow Table? I think it will work with less memory.

For example, we can convert to an Arrow IPC file (Feather V2) dataset without going through a data frame as follows.

arrow::read_delim_arrow(
"https://github.com/apache/arrow/files/10804095/Arrow_parse_Example4.txt",
delim = "\t",
quote = "",
as_data_frame = FALSE
) |>
arrow::write_dataset("test", format = "arrow")

Hi @eitsupi and all,

The above solution wherein the argument "delim = \t" is added worked for me when I received an almost identical error to @elgabbas.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: R Type: usage Issue is a user question
Projects
None yet
Development

No branches or pull requests

5 participants