Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quoting bug #37

Closed
cpkio opened this issue Sep 4, 2020 · 1 comment
Closed

Quoting bug #37

cpkio opened this issue Sep 4, 2020 · 1 comment

Comments

@cpkio
Copy link

cpkio commented Sep 4, 2020

I have found out that quote symbol as a first symbol of a field can break the reader, if this quote is not a field quoting, but an ordinary quote. Tested with v1.13.1.

The source data is:

Код формы;Наименование показателя;Единицы измерения;Комментарий
1-МТС;"""░░░░░░ - ░░░░"", ░ ░░░ ░░░░░ ░░░░░░░░░░░░░ ░░░ ░░ ░░░░░░░░░░ ░░░░░░(░░░. ░░░.)";░░░. ░░░.;
1-МТС;"""░░░░░░ - ░░░░"", ░ ░░░ ░░░░░ ░░░░░░░░░░░░░ ░░░ ░░ ░░░░░░░░░░ ░░░░░░(░░.)";░░.;
1-МТС;"""░░░░░░ - ░░░░""; ░░░░░░░░░░ (░░░░░░░░░░) ░░░░░░░░░ ░░░░░░░░░░░░░░░░░░░░ ░░░░░░░, (░░░. ░░░.)";░░░. ░░░.;
1-МТС;"""░░░░░░ - ░░░░""; ░░░░░░░ ░░░░░ ░ ░░░ (░░░░░░░░░) ░░ ░░░░░ ░░░░, (░░.)";░░.;
1-МТС;"""░░░░"" ░ ░░░░░░ ░░░░░░░░░░░░░, ░ ░░░ ░░░░░ ░░░░░░░░░░░░░ ░░░ ░░ ░░░░░░░░░░ ░░░░░░(░░░. ░░░.)";░░░. ░░░.;
1-МТС;"""░░░░"" ░ ░░░░░░ ░░░░░░░░░░░░░, ░ ░░░ ░░░░░ ░░░░░░░░░░░░░ ░░░ ░░ ░░░░░░░░░░ ░░░░░░(░░.)";░░.;
1-МТС;"""░░░░"" ░ ░░░░░░ ░░░░░░░░░░░░░; ░░░░░░░░░░ (░░░░░░░░░░) ░░░░░░░░░ ░░░░░░░░░░░░░░░░░░░░ ░░░░░░░, (░░░. ░░░.)";░░░. ░░░.;
1-МТС;"""░░░░"" ░ ░░░░░░ ░░░░░░░░░░░░░; ░░░░░░░ ░░░░░ ░ ░░░ (░░░░░░░░░) ░░ ░░░░░ ░░░░, (░░.)";░░.;

This is a semicolon-separated csv. I've replaced the text with blocks for convinience.

Let's update this file test_trgt.csv:

csvq -Q -d ; -D ; 'update test_trgt set `Комментарий` = '+++' where `Код формы` = '1-МТС'

The file will become this:

Код формы;Наименование показателя;Единицы измерения;Комментарий
1-МТС;"░░░░░░ - ░░░░", ░ ░░░ ░░░░░ ░░░░░░░░░░░░░ ░░░ ░░ ░░░░░░░░░░ ░░░░░░(░░░. ░░░.);░░░. ░░░.;+++
1-МТС;"░░░░░░ - ░░░░", ░ ░░░ ░░░░░ ░░░░░░░░░░░░░ ░░░ ░░ ░░░░░░░░░░ ░░░░░░(░░.);░░.;+++
1-МТС;"""░░░░░░ - ░░░░""; ░░░░░░░░░░ (░░░░░░░░░░) ░░░░░░░░░ ░░░░░░░░░░░░░░░░░░░░ ░░░░░░░, (░░░. ░░░.)";░░░. ░░░.;+++
1-МТС;"""░░░░░░ - ░░░░""; ░░░░░░░ ░░░░░ ░ ░░░ (░░░░░░░░░) ░░ ░░░░░ ░░░░, (░░.)";░░.;+++
1-МТС;"░░░░" ░ ░░░░░░ ░░░░░░░░░░░░░, ░ ░░░ ░░░░░ ░░░░░░░░░░░░░ ░░░ ░░ ░░░░░░░░░░ ░░░░░░(░░░. ░░░.);░░░. ░░░.;+++
1-МТС;"░░░░" ░ ░░░░░░ ░░░░░░░░░░░░░, ░ ░░░ ░░░░░ ░░░░░░░░░░░░░ ░░░ ░░ ░░░░░░░░░░ ░░░░░░(░░.);░░.;+++
1-МТС;"""░░░░"" ░ ░░░░░░ ░░░░░░░░░░░░░; ░░░░░░░░░░ (░░░░░░░░░░) ░░░░░░░░░ ░░░░░░░░░░░░░░░░░░░░ ░░░░░░░, (░░░. ░░░.)";░░░. ░░░.;+++
1-МТС;"""░░░░"" ░ ░░░░░░ ░░░░░░░░░░░░░; ░░░░░░░ ░░░░░ ░ ░░░ (░░░░░░░░░) ░░ ░░░░░ ░░░░, (░░.)";░░.;+++

As you can see, on the second line the field quotes and double quotes (as quoted quotes) have been removed.

Let's try to read this resulting file:

csvq -d ';' 'select * from test_trgt'

This will result an error: [L:1 C:15] data parse error in file … test_trgt.csv: line 2, column 21: unexpected " in field.

I have tried any variants of -k and -Q options, but cannot make it read again correctly. The file is correct, field does not need to be quoted, but quote mark as a first field symbol seems to be breaking it.

@mithrandie
Copy link
Owner

The command option "–enclose-all(-Q)" has an effect on newly created files specified by the "--out" option and in CREATE TABLE statements, and for existing files, the information determined at the time of loading is used.

$ csvq -d ';' fields test_trgt.csv

                 Fields in test_trgt.csv
-----------------------------------------------------
 Type: Table
 Path: /tmp/test_trgt.csv
 Format: CSV     Delimiter: ';'   Enclose All: false
 Encoding: UTF8  LineBreak: LF    Header: true
 Status: Fixed
 Fields:
   1. Код формы
   2. Наименование показателя
   3. Единицы измерения
   4. Комментарий

As a workaround, you can use ALTER TABLE statement to set ENCLOSE_ALL attribute for the updated file.

$ csvq -d ';' 'update t set t.4 = "+++" from test_trgt t; alter table test_trgt set enclose_all to true;'
8 records updated on "/tmp/test_trgt.csv".

                   Attributes Updated in test_trgt
-----------------------------------------------------
 Path: /tmp/test_trgt.csv
 Format: CSV     Delimiter: ';'   Enclose All: true
 Encoding: UTF8  LineBreak: LF    Header: true

Commit: file "/tmp/test_trgt.csv" is updated.
$ cat test_trgt.csv
"Код формы";"Наименование показателя";"Единицы измерения";"Комментарий"
"1-МТС";"""░░░░░░ - ░░░░"", ░ ░░░ ░░░░░ ░░░░░░░░░░░░░ ░░░ ░░ ░░░░░░░░░░ ░░░░░░(░░░. ░░░.)";"░░░. ░░░.";"+++"
"1-МТС";"""░░░░░░ - ░░░░"", ░ ░░░ ░░░░░ ░░░░░░░░░░░░░ ░░░ ░░ ░░░░░░░░░░ ░░░░░░(░░.)";"░░.";"+++"
"1-МТС";"""░░░░░░ - ░░░░""; ░░░░░░░░░░ (░░░░░░░░░░) ░░░░░░░░░ ░░░░░░░░░░░░░░░░░░░░ ░░░░░░░, (░░░. ░░░.)";"░░░. ░░░.";"+++"
"1-МТС";"""░░░░░░ - ░░░░""; ░░░░░░░ ░░░░░ ░ ░░░ (░░░░░░░░░) ░░ ░░░░░ ░░░░, (░░.)";"░░.";"+++"
"1-МТС";"""░░░░"" ░ ░░░░░░ ░░░░░░░░░░░░░, ░ ░░░ ░░░░░ ░░░░░░░░░░░░░ ░░░ ░░ ░░░░░░░░░░ ░░░░░░(░░░. ░░░.)";"░░░. ░░░.";"+++"
"1-МТС";"""░░░░"" ░ ░░░░░░ ░░░░░░░░░░░░░, ░ ░░░ ░░░░░ ░░░░░░░░░░░░░ ░░░ ░░ ░░░░░░░░░░ ░░░░░░(░░.)";"░░.";"+++"
"1-МТС";"""░░░░"" ░ ░░░░░░ ░░░░░░░░░░░░░; ░░░░░░░░░░ (░░░░░░░░░░) ░░░░░░░░░ ░░░░░░░░░░░░░░░░░░░░ ░░░░░░░, (░░░. ░░░.)";"░░░. ░░░.";"+++"
"1-МТС";"""░░░░"" ░ ░░░░░░ ░░░░░░░░░░░░░; ░░░░░░░ ░░░░░ ░ ░░░ (░░░░░░░░░) ░░ ░░░░░ ░░░░, (░░.)";"░░.";"+++"

When saving a file with ENCLOSE_ALL set to false, the values in the file are automatically enclosed only if its value contains a character specified as a delimiter.
Normally this is not a problem. However, if a value starts with a double quote, the value is assumed to be enclosed in double quotes, which causes a problem with the interpretation of the csv.

I will fix to automatically enclose values containing double-quotes.

mithrandie added a commit that referenced this issue Sep 5, 2020
- Enclose csv values containing double-quotes automatically. ([Github #37](#37))
mithrandie added a commit that referenced this issue Sep 5, 2020
- Enclose csv values containing double-quotes automatically. ([Github #37](#37))

- Discontinued to distribute binary for the following OS/Arch.
  - darwin/386
  - linux/mips
  - linux/mips64
  - linux/mips64le
  - linux/mipsle
  - linux/s390x
mithrandie added a commit that referenced this issue Sep 5, 2020
- Enclose csv values containing double-quotes automatically. ([Github #37](#37))

- Discontinued to distribute binary for the following OS/Arch.
  - darwin/386
  - linux/mips
  - linux/mips64
  - linux/mips64le
  - linux/mipsle
  - linux/s390x
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants