Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"CREATE TABLE" does not append newline to header row in new CSV file. #22

Closed
derekmahar opened this issue Dec 14, 2019 · 11 comments
Closed

Comments

@derekmahar
Copy link
Contributor

derekmahar commented Dec 14, 2019

In the CSV file that it creates, csvq SQL statement CREATE TABLE does not append a newline character to the header that it generates in the first row:

$ csvq --repository /tmp "CREATE TABLE test(a, b, c)"
file "/tmp/test" is created.
Commit: file "/tmp/test" is created.
$ cat /tmp/test
a,b,c$ rm /tmp/test

Note that the $ prompt immediately follows the last column name in the header instead of appearing on the next line because CREATE TABLE does not append a newline character to the header row.

As a consequence of this issue, should a program other than csvq append a data row to the new table without first appending a newline to the header (or prepending a newline to the data row), the data row will immediately follow the header on the same row:

$ csvq --repository /tmp "CREATE TABLE test(a, b, c)"
file "/tmp/test" is created.
Commit: file "/tmp/test" is created.
$ printf "1,2,3\n" >> /tmp/test
$ cat /tmp/test
a,b,c1,2,3
$ rm /tmp/test

A simple workaround to this issue is to append a newline to the new CSV file after CREATE TABLE creates it, but before appending any data rows:

$ csvq --repository /tmp "CREATE TABLE test(a, b, c)"
file "/tmp/test" is created.
Commit: file "/tmp/test" is created.
$ echo >> /tmp/test
$ printf "1,2,3\n" >> /tmp/test
$ cat /tmp/test
a,b,c
1,2,3
$ rm /tmp/test

Another solution would be to prepend each new data row with a newline, but omit the trailing newline:

$ csvq --repository /tmp "CREATE TABLE test(a, b, c)"
file "/tmp/test" is created.
Commit: file "/tmp/test" is created.
$ printf "\n1,2,3" >> /tmp/test
$ printf "\n4,5,6" >> /tmp/test
$ cat /tmp/test
a,b,c
1,2,3
4,5,6$ rm /tmp/test

Consistent with CREATE TABLE, SQL statement INSERT INTO also follows this convention of prepending each data row with newline:

$ csvq --repository /tmp "CREATE TABLE test(a, b, c)"
file "/tmp/test" is created.
Commit: file "/tmp/test" is created.
$ csvq --repository /tmp "INSERT INTO test VALUES (1,2,3)"
1 record inserted on "/tmp/test".
Commit: file "/tmp/test" is updated.
$ csvq --repository /tmp "INSERT INTO test VALUES (4,5,6)"
1 record inserted on "/tmp/test".
Commit: file "/tmp/test" is updated.
$ cat /tmp/test
a,b,c
1,2,3
4,5,6$ rm /tmp/test

To avoid this issue altogether, instead of using CREATE TABLE to create the CSV file and write the header row, we could "manually" create the file and write the header row with a trailing newline:

$ printf "a,b,c\n" > /tmp/test
$ printf "1,2,3\n" >> /tmp/test
$ cat /tmp/test
a,b,c
1,2,3
$ rm /tmp/test

There are several disadvantages to this manual approach:

  1. Redirection overwrites an existing CSV file
  2. Redirection cannot be applied to another SQL database like SQLite
  3. Does not benefit from transaction management
  4. Requires a code change to change the raw CSV header structure

The newline convention that csvq SQL statements CREATE TABLE and INSERT INTO follows, while internally consistent, is not consistent with external shell commands.

@mithrandie
Copy link
Owner

mithrandie commented Dec 14, 2019

First, the current behavior of csvq is as follows.

  • SELECT query appends a line break at the end of the output except for some exceptions.
  • Update queries, including CREATE TABLE query, don’t append line breaks.

In the csvq, SELECT query is positioned as a statement that outputs the result and passes it to the outside. On the other hand, other queries do not do such a special processing because these queries complete operations within a transaction of csvq and do not link directly with the outside.

If we strictly follow the posix standard, update queries should also append line breaks at the end of the files. However, line breaks are now often used as line separators rather than line endings, and I can't decide which behavior is better. At least, I think that adding a line break at the end of a file if needed is easier than removing a line break at the end of a file when not need.

By the way, I don’t understand exactly what you want to do. Transaction management of csvq affects only operations with in the csvq processing, and the external commands should not be relevant.

@derekmahar
Copy link
Contributor Author

derekmahar commented Dec 14, 2019

If we strictly follow the posix standard, update queries should also append line breaks at the end of the files.

Does the POSIX standard specify this as a convention? RFC 4180 states that a line break (CRLF) is optional on the last line of a CSV file, but doesn't specify a line break is optional for the header row. csvq follows the former convention, but it's unclear whether csvq violates the latter.

However, line breaks are now often used as line separators rather than line endings, and I can't decide which behavior is better.

What is the difference between line separator and line ending in this context? What newline convention do other CVS processing tools follow? Is csvq consistent with these tools?

At least, I think that adding a line break at the end of a file if needed is easier than removing a line break at the end of a file when not need.

Yes, I agree.

By the way, I don’t understand exactly what you want to do. Transaction management of csvq affects only operations with in the csvq processing, and the external commands should not be relevant.

I guess my point about transaction management is not relevant here because my examples mix csvq and external shell operations which are not subject to csvq transactions. However, I think the other disadvantages that I list are relevant.

@derekmahar
Copy link
Contributor Author

What newline convention do other CVS processing tools follow? Is csvq consistent with these tools?

Miller doesn't complain about empty CSV files that csvq creates:

$ csvq --repository /tmp "CREATE TABLE test(a, b, c)"
file "/tmp/test" is created.
Commit: file "/tmp/test" is created.
$ mlr --csv cat /tmp/test
$ rm /tmp/test

Miller can also process non-empty CSV files that csvq creates, though unlike csvq, Miller appends a trailing line break to the last data row:

$ csvq --repository /tmp "CREATE TABLE test(a, b, c)"
file "/tmp/test" is created.
Commit: file "/tmp/test" is created.
$ csvq --repository /tmp "INSERT INTO test VALUES (1,2,3)"
1 record inserted on "/tmp/test".
Commit: file "/tmp/test" is updated.
$ mlr --csv cat /tmp/test
a,b,c
1,2,3
$ rm /tmp/test

Unlike csvq SQL command CREATE TABLE, it seems that Miller has no mechanism for creating a new CSV file.

@mithrandie
Copy link
Owner

The POSIX standard defines about a line as follows.

https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_206

3.206 Line

A sequence of zero or more non- characters plus a terminating character.

The RFC 4180 defines about csv format as follows.

https://tools.ietf.org/html/rfc4180

  1. The last record in the file may or may not have an ending line break.

It’s header Line is as normal lines.

  1. There maybe an optional header line appearing as the first line of the file with the same format as normal record lines.

Csv is used by various tools including GUI as well as shell commands. It is not uncommon for users to write related processes themselves.
It can’t be fitted all tools, and I think the important thing is internal consistency.

@derekmahar
Copy link
Contributor Author

The POSIX standard defines about a line as follows.

https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_206

3.206 Line
A sequence of zero or more non- characters plus a terminating character.

Okay, then it seems as you pointed out earlier, csvq does not follow this convention.

It’s header Line is as normal lines.

  1. There maybe an optional header line appearing as the first line of the file with the same format as normal record lines.

I must have missed this point in RFC 4180. csvq also treats the header line as a "normal" line so it apparently does follow the line break conventions of RFC 4180.

Csv is used by various tools including GUI as well as shell commands. It is not uncommon for users to write related processes themselves.

Users like me, for example. :)

It can’t be fitted all tools, and I think the important thing is internal consistency.

Yes, I agree, but I think a CSV tool should at least try to follow the conventions that most tools follow. While csvq (and RFC 4180) violate the POSIX line definition, csvq fortunately does follow the line break conventions of RFC 4180 which is arguably more important.

@derekmahar
Copy link
Contributor Author

derekmahar commented Dec 14, 2019

csvkit also doesn't complain about CSV files that csvq creates:

$ csvq --repository /tmp "CREATE TABLE test(a, b, c)"
$ cat /tmp/test | csvclean
No errors.
$ cat /tmp/test | csvlook
| a | b | c |
| - | - | - |
$ csvq --repository /tmp "INSERT INTO test VALUES (1,2,3)"
1 record inserted on "/tmp/test".
Commit: file "/tmp/test" is updated.
$ cat /tmp/test | csvclean
No errors.
$ cat /tmp/test | csvlook
|    a | b | c |
| ---- | - | - |
| True | 2 | 3 |
$ rm /tmp/test

By default, csvkit command csvformat appends a line break to header and data rows in a CSV file:

$ csvq --repository /tmp "CREATE TABLE test(a, b, c)"
file "/tmp/test" is created.
Commit: file "/tmp/test" is created.
$ cat /tmp/test | csvformat
a,b,c
$ csvq --repository /tmp "INSERT INTO test VALUES (1,2,3)"
1 record inserted on "/tmp/test".
Commit: file "/tmp/test" is updated.
$ cat /tmp/test | csvformat
a,b,c
1,2,3
$ rm /tmp/test

Like Miller, csvkit doesn't have a command to create a CSV file with a given a header.

@derekmahar
Copy link
Contributor Author

csvtool also doesn't complain about CSV files that csvq generates, though like Miller and csvkit, it appends a line break to header and data rows:

$ csvq --repository /tmp "CREATE TABLE test(a, b, c)"
file "/tmp/test" is created.
Commit: file "/tmp/test" is created.
$ csvtool cat /tmp/test
a,b,c
$ csvq --repository /tmp "INSERT INTO test VALUES (1,2,3)"
1 record inserted on "/tmp/test".
Commit: file "/tmp/test" is updated.
$ csvtool cat /tmp/test
$ rm /tmp/test

@derekmahar
Copy link
Contributor Author

derekmahar commented Dec 14, 2019

  • While csvq follows a different line break convention than other popular CSV processing tools, these tools can process files that csvq creates.
  • While csvq doesn't follow the line definition in the POSIX standard, it does follow the CSV structure that RFC 4180 specifies.
  • I will adapt my shell scripts to the line break convention that csvq follows.

@mithrandie
Copy link
Owner

I changed my mind. The following changes will be implemented in the next release.

  • Update queries append a line break at the end of a file by default.
  • Add a command option to specify whether to add a line break or not. It is not yet determined whether this option affects the output to standard output.

@mithrandie mithrandie reopened this Dec 16, 2019
@derekmahar
Copy link
Contributor Author

derekmahar commented Dec 16, 2019

Thank you. This idea is a sensible compromise.

mithrandie added a commit that referenced this issue Jan 12, 2020
#22])

With this change, the command option "--strip-ending-line-break" has been added.
mithrandie added a commit that referenced this issue Jan 12, 2020
- Append a newline at the end of created and updated files. ([Github #22](#22))
- Add the command option "--strip-ending-line-break".
@derekmahar
Copy link
Contributor Author

Thank you for this fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants