-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[R] Can not parse file #34291
Comments
@elgabbas Please could you show us the exact line of code you're running to get the error? Also, I think the line of data you've included there might be truncated; I ran |
Thanks @thisisnic ... I updated the issue text, now with the exact content of the first line having this problem. This is the code I am using to read the file:
Now using this code returns 258 (as expected)
|
Please note that I read the same data into chunks using |
I am wondering if the following example replicates this problem. txt <- "a\tb\n1\t\t2"
readr::read_tsv(I(txt), show_col_types = FALSE)
#> Warning: One or more parsing issues, call `problems()` on your data frame for details,
#> e.g.:
#> dat <- vroom(...)
#> problems(dat)
#> # A tibble: 1 × 2
#> a b
#> <dbl> <dbl>
#> 1 1 2
arrow::read_tsv_arrow(charToRaw(txt))
#> Error:
#> ! Invalid: CSV parse error: Expected 2 columns, got 3: 1 2
#> Backtrace:
#> ▆
#> 1. └─arrow (local) `<fn>`(file = charToRaw(txt), delim = "\t")
#> 2. └─base::tryCatch(...)
#> 3. └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#> 4. └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
#> 5. └─value[[3L]](cond)
#> 6. └─arrow:::augment_io_error_msg(e, call, schema = schema)
#> 7. └─rlang::abort(msg, call = call) Created on 2023-02-22 with reprex v2.0.2
Did you get an warning message like the example above ( |
I can reproduce the problem using the following code and attached data
And similarly with
|
Thanks @eitsupi I get the exact output when i implement your code. I do not think I get similar issues. I am using this argument I also tried to use a schema object with |
I think this issue mean that |
@eitsupi I can not for sure tell what is the reason, but I do not think this is the problem. |
@elgabbas Could you please upload such a file (has some rows)? The file you just uploaded seems to contain only a header and one row. |
That's it... The problem happened with the second row only |
Thanks @eitsupi and @thisisnic for your response.... I think I know now the reason for this error, at least for this specific row example. One of the fields contains 5 double quotations; e.g. Is it possible to ignore ALL single or double quotations altogether (or ignore the unnecessary extra quotation) programmatically? Can this be handled using the Thanks EDIT: |
@elgabbas Thank you for uploading this file. Unfortunately, however, it seems that the first row is failing to load on my end. > arrow::read_tsv_arrow("Arrow_parse_Example.txt")
Error:
! Invalid: CSV parse error: Expected 259 columns, got 322: 2417931730 DSS004390000131N CC0_1_0 National Museum of Nat ...
Run `rlang::last_error()` to see where the error occurred. Since > readr::read_tsv("Arrow_parse_Example.txt", show_col_types = FALSE)
# A tibble: 1 × 259
gbifID abstract accessR…¹ accru…² accru…³ accru…⁴ alter…⁵ audie…⁶ avail…⁷ bibli…⁸ confo…⁹ contr…˟ cover…˟ created
<dbl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
1 2417931730 NA NA NA NA NA NA NA NA NA NA NA NA NA
# … with 245 more variables: creator <lgl>, date <lgl>, dateAccepted <lgl>, dateCopyrighted <lgl>,
# dateSubmitted <lgl>, description <lgl>, educationLevel <lgl>, extent <lgl>, format <lgl>, hasFormat <lgl>,
# hasPart <lgl>, hasVersion <lgl>, identifier <chr>, instructionalMethod <lgl>, isFormatOf <lgl>, isPartOf <lgl>,
# isReferencedBy <lgl>, isReplacedBy <lgl>, isRequiredBy <lgl>, isVersionOf <lgl>, issued <lgl>, language <lgl>,
# license <chr>, mediator <lgl>, medium <lgl>, modified <lgl>, provenance <lgl>, publisher <chr>, references <lgl>,
# relation <lgl>, replaces <lgl>, requires <lgl>, rights <lgl>, rightsHolder <lgl>, source <lgl>, spatial <lgl>,
# subject <lgl>, tableOfContents <lgl>, temporal <lgl>, title <lgl>, type <lgl>, valid <lgl>, institutionID <chr>, …
# ℹ Use `colnames()` to see all variable names
> readr::read_tsv("Arrow_parse_Example.txt", show_col_types = FALSE)$eventType
[1] "\n2417934775\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tDSS00439000014FB\t\t\t\t\t\t\t\t\t\tCC0_1_0\t\t\t\t\tNational Museum of Natural History, Luxembourg\t\t\t\t\t\t\t\t\t\t\t\t\t\t\thttps://ror.org/05natt857\tMnhnL\t\t\tMNHNL-HERB-LUX\tHerbarium\t\tPRESERVED_SPECIMEN\t\t\tTaxon status for Luxembourg: [Least concern - IUCN (2001)]\tDSS00439000014FB\t20471\t\tLéopold Reichling\t\t\t\t\t\t\t\t\t\t\t\t\tPRESENT\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t1953-08-06T00:00:00\t\t\t\t1953\t8\t6\t1953-8-6/1953-8-6\t\tUnknown\t\t\t\t\t\t\t\t\tEUROPE\t\t\t\tLU\t\t\t\tGarnich\tEntre Garnich et Windhof, chemin longeant la lisière du bois dit \"Lange Rés\" sur marnes liasiques" |
Thanks @eitsupi As I mentioned in the previous message, it seems the problem is due to an extra non-necessary quotation.
The only difference between both files is the removal of extra double quotation. Using
I think I am close to an acceptable answer! The question now is how to avoid the resulting escape backslash! I can have something like:
to remove the escaped double quotation; however, I am unsure how long it will take to implement this on 17M rows! |
@elgabbas Glad you have found a solution.
I suspect this is just tibble escaping the double quotes on the display for clarity. > read_delim_arrow(file = "https://github.com/apache/arrow/files/10804095/Arrow_parse_Example4.txt", delim = "\t", quote = "") |> as.data.frame()
V1 V2 V3
1 2417934775 TEXT1""NoQuoted"" TEXT2 49.6275
2 2417934775 "TEXT1 ""Quoted"" TEXT2 49.6275 |
Thanks @eitsupi |
Since you seem to be able to read all the data from CSV as a data frame, how about setting For example, we can convert to an Arrow IPC file (Feather V2) dataset without going through a data frame as follows. arrow::read_delim_arrow(
"https://github.com/apache/arrow/files/10804095/Arrow_parse_Example4.txt",
delim = "\t",
quote = "",
as_data_frame = FALSE
) |>
arrow::write_dataset("test", format = "arrow") |
Thanks @eitsupi ... This did not help in my case. Loading the data consumed high memory and crashed my PC. One possible solution is to loop through values of one of the columns, filter the data based on this value, then save to disk manually for each value.. Will apply this and see
|
FYI, here is a comment about being able to convert a huge CSV file into a Parquet file using Python. I am not sure if the same thing is possible in R. |
Hi @eitsupi and all, The above solution wherein the argument "delim = \t" is added worked for me when I received an almost identical error to @elgabbas. Thanks! |
Hello,
I am trying to load large csv file (tab-delimited; 23 GB, 16M rows, 259 cols) using arrow R package. I get this error early enough while reading the file content.
This is the content of the line shown in the previous error:
Do you think that the problem is due to the use of 1, 2, or 3 quotes in the text? due to square brackets?
Can this because of the encoding?
Thanks.
Ahmed
EDIT: This is a reprex code for the issue:
Component(s)
R
The text was updated successfully, but these errors were encountered: