CSV parser does not handle multiline entries #312

atoulme · 2021-11-22T17:18:34Z

The CSV parser is unable to handle CSV entries that span multiple lines, when a field of the CSV contains newline characters.

djaglowski · 2021-11-28T13:21:07Z

@atoulme, do you have any suggestions for how it should behave?

atoulme · 2021-12-24T00:28:56Z

Here is below a workaround I have applied for the time. Ideally, I'd like the csv_parser to be able to perform this without the need to remove the header line, and implicitly handle the multiline element.

    filelog:
        include: [ /output/*.csv ]
        start_at: beginning
        multiline:
            line_start_pattern: "^\"[^\"]"
        operators:
            # remove the header line
            -   id: remove_header
                type: filter
                expr: '$$body matches "^AuthorID,Author,Date,Content,Attachments,Reactions$"'
                output: csv
            # parse each line as a record
            -   id: csv
                type: csv_parser
                header: AuthorID,Author,Date,Content,Attachments,Reactions
                timestamp:
                    parse_from: Date
                    layout_type: epoch
                    layout: s
                    preserve: true

Ideally, I would like the csv_parser to accept a multiline marker for the beginning of an entry.
If you look at the one I use right now, "^\"[^\"]", I indicate that I consider a new entry as a line starting with a double quote and not immediately followed by a double quote - csv escapes double quotes by doubling them, so this entry would be correctly parsed:

Header1,Header2
"foo","
""bar"""
"bob","alice"

This is not perfect. If a line starts with a double quote, then it won't be picked up:

Header1,Header2
"""foo""","bar"

However, for my case the first item in the line is always a number, so I escape this condition.

djaglowski added the enhancement New feature or request label Jan 4, 2022

This was referenced Mar 9, 2022

Add support for parsing multiline csv records #425

Merged

CSV parser is too eager when reading files being written to #423

Closed

djaglowski closed this as completed in #425 Mar 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSV parser does not handle multiline entries #312

CSV parser does not handle multiline entries #312

atoulme commented Nov 22, 2021

djaglowski commented Nov 28, 2021

atoulme commented Dec 24, 2021 •

edited

Loading

CSV parser does not handle multiline entries #312

CSV parser does not handle multiline entries #312

Comments

atoulme commented Nov 22, 2021

djaglowski commented Nov 28, 2021

atoulme commented Dec 24, 2021 • edited Loading

atoulme commented Dec 24, 2021 •

edited

Loading