Skip to content
This repository has been archived by the owner on May 25, 2022. It is now read-only.

CSV parser does not handle multiline entries #312

Closed
atoulme opened this issue Nov 22, 2021 · 2 comments · Fixed by #425
Closed

CSV parser does not handle multiline entries #312

atoulme opened this issue Nov 22, 2021 · 2 comments · Fixed by #425
Labels
enhancement New feature or request

Comments

@atoulme
Copy link

atoulme commented Nov 22, 2021

The CSV parser is unable to handle CSV entries that span multiple lines, when a field of the CSV contains newline characters.

@djaglowski
Copy link
Member

@atoulme, do you have any suggestions for how it should behave?

@atoulme
Copy link
Author

atoulme commented Dec 24, 2021

Here is below a workaround I have applied for the time. Ideally, I'd like the csv_parser to be able to perform this without the need to remove the header line, and implicitly handle the multiline element.

    filelog:
        include: [ /output/*.csv ]
        start_at: beginning
        multiline:
            line_start_pattern: "^\"[^\"]"
        operators:
            # remove the header line
            -   id: remove_header
                type: filter
                expr: '$$body matches "^AuthorID,Author,Date,Content,Attachments,Reactions$"'
                output: csv
            # parse each line as a record
            -   id: csv
                type: csv_parser
                header: AuthorID,Author,Date,Content,Attachments,Reactions
                timestamp:
                    parse_from: Date
                    layout_type: epoch
                    layout: s
                    preserve: true

Ideally, I would like the csv_parser to accept a multiline marker for the beginning of an entry.
If you look at the one I use right now, "^\"[^\"]", I indicate that I consider a new entry as a line starting with a double quote and not immediately followed by a double quote - csv escapes double quotes by doubling them, so this entry would be correctly parsed:

Header1,Header2
"foo","
""bar"""
"bob","alice"

This is not perfect. If a line starts with a double quote, then it won't be picked up:

Header1,Header2
"""foo""","bar"

However, for my case the first item in the line is always a number, so I escape this condition.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants