Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: CSV import and data consistency issue or encoding issue #4676

Closed
1 of 2 tasks
alitrack opened this issue Apr 2, 2022 · 2 comments · Fixed by #4701
Closed
1 of 2 tasks

bug: CSV import and data consistency issue or encoding issue #4676

alitrack opened this issue Apr 2, 2022 · 2 comments · Fixed by #4701
Assignees
Labels
A-query Area: databend query C-bug Category: something isn't working

Comments

@alitrack
Copy link

alitrack commented Apr 2, 2022

Search before asking

  • I had searched in the issues and found no similar issues.

Version

0.7.3

What's Wrong?

import from csv may be garbled .

image

What You Expected?

image

How to Reproduce?

  • create table
create table books
(
    title VARCHAR(255),
    author VARCHAR(255),
    date VARCHAR(255)
);
  • import data
echo curl -H \"insert_sql:insert into book_db.books format CSV\" -H \"skip_header:0\" -H \"field_delimiter:','\" -H \"record_delimiter:'\n'\" -F  \"upload=@./books_gbk.csv\" -XPUT http://127.0.0.1:8081/v1/streaming_load|bash

books.zip

Anything Else?

  • need encoding check
  • ontime has the same issue, because data of year 2001 and month 10 have no ascii
  • had better support compressed csv, for example gzip, bzip2
  • unaccent support

image

Are you willing to submit PR?

  • Yes I am willing to submit a PR!
@alitrack alitrack added the C-bug Category: something isn't working label Apr 2, 2022
@BohuTANG BohuTANG added the A-query Area: databend query label Apr 2, 2022
@sundy-li
Copy link
Member

sundy-li commented Apr 2, 2022

  1. You can use iconv -f GBK -t UTF-8 a_gbk.csv -o b_utf8.csv to convert into utf8 format, databend can't detect the character encoding.

  2. Compressiongzip, bzip2 can be supported in the future.

@alitrack
Copy link
Author

alitrack commented Apr 3, 2022

  1. You can use iconv -f GBK -t UTF-8 a_gbk.csv -o b_utf8.csv to convert into utf8 format, databend can't detect the character encoding.
  2. Compressiongzip, bzip2 can be supported in the future.

if only support UTF-8, had better mention in the manual, and the tutorial also should do the conversion, the ontime csv is iso-8859-1, has non ascii.

btw, books_gbk.csv is converted with iconv for testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-query Area: databend query C-bug Category: something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants