Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invoke fuse to ensure output of heterogeneous csv #1271

Closed
mccanne opened this issue Sep 14, 2020 · 1 comment · Fixed by #1908
Closed

Invoke fuse to ensure output of heterogeneous csv #1271

mccanne opened this issue Sep 14, 2020 · 1 comment · Fixed by #1908
Assignees
Milestone

Comments

@mccanne
Copy link
Collaborator

mccanne commented Sep 14, 2020

The CSV writer should be able to write output zng data that comes from different record types. Including the new fuse processor at the end of the ZQL pipeline ensures this is possible today. However, fuse requires making two passes through the data, which has a performance cost and delays the immediate stream of output. Power users that are confident their data already conforms to a single record definition may want to avoid this penalty.

As a group we discussed adding a flag to determine this behavior when CSV output format is requested. In one mode it would always implicitly add fuse to the pipeline even if the user didn't request it, ensuring successful CSV output no matter what. There was consensus that this behavior would be invoked by the Brim app for CSV export. The other mode would follow the current behavior where only a single pass is made through the data and output stops as soon as a record is encountered in the stream that doesn't match the schema for the header already printed, at which point the user would see a message that effectively tells them to rework their query or explicitly add fuse. There still seemed to be some room for debate on whether zq at the command line should also default to the "always fuse" behavior planned for the Brim app or if the zq default should flip to this more "power user" mode.

brim-bot pushed a commit to brimdata/zui that referenced this issue Sep 17, 2020
…by philrz

This is an auto-generated commit with a zq dependency update. The zq PR
brimdata/zed#1300, authored by @philrz,
has been merged.

Output format changes: Add "csv", remove "types"

While verifying brimdata/zed#1237, I noticed that CSV is not yet listed among the output formats. I wondered if maybe we were intentionally holding off on revealing it until we address brimdata/zed#1271, but it seems useful enough in its present form that I'm proposing here that we reveal it now.

I'd also recalled seeing @mccanne mention recently that `types` was removed as an output format. Indeed, as of `zq` commit `4bce00d`:

```
$ zq -version
Version: v0.21.0-27-g4bce00d
```

Therefore I'm also taking that out while I'm at it.
@philrz philrz changed the title heterogeneous csv Invoke fuse to ensure output of heterogeneous csv Sep 22, 2020
@philrz philrz added this to the Brim v0.19.0 milestone Sep 22, 2020
@philrz philrz modified the milestones: Brim v0.19.0, Brim v0.20.0 Sep 28, 2020
@nwt nwt self-assigned this Nov 18, 2020
@philrz philrz modified the milestones: Brim v0.20.0, Brim v0.21.0 Dec 2, 2020
@philrz philrz modified the milestones: Brim v0.21.0, Brim v0.22.0 Dec 17, 2020
brim-bot pushed a commit to brimdata/zui that referenced this issue Dec 30, 2020
This is an auto-generated commit with a zq dependency update. The zq PR
brimdata/zed#1880, authored by @nwt,
has been merged.

Extract Fuser from proc/fuse.Proc

Groundwork for brimdata/zed#1271.
@nwt nwt closed this as completed in #1908 Dec 30, 2020
brim-bot pushed a commit to brimdata/zui that referenced this issue Dec 30, 2020
This is an auto-generated commit with a zq dependency update. The zq PR
brimdata/zed#1908, authored by @nwt,
has been merged.

Invoke fuse for CSV output

Send records through a `fuse.Fuser` for
1. /search?format=csv` in `zqd`
2. `zq -f csv` unless `-csvfuse=false`

Depends on brimdata/zed#1880.

Closes brimdata/zed#1271.
@philrz
Copy link
Contributor

philrz commented Dec 31, 2020

Verified in zq commit 1e501e85.

Circling back to the previous behavior, in the last GA zq release tagged v0.26.0, invoking -f csv with heterogeneous data halted the output. Using the zq-sample-data:

$ zq -version
Version: v0.26.0

$ zq -f csv pe.log.gz stats.log.gz 
_path,ts,peer,mem,pkts_proc,bytes_recv,pkts_dropped,pkts_link,pkt_lag,events_proc,events_queued,active_tcp_conns,active_udp_conns,active_icmp_conns,tcp_conns,udp_conns,icmp_conns,timers,active_timers,files,active_files,dns_requests,active_dns_requests,reassem_tcp_size,reassem_file_size,reassem_frag_size,reassem_unknown_size
stats,2018-03-24T17:15:20.600725Z,zeek,74,26,29375,,,,404,11,1,0,0,1,0,0,36,32,0,0,0,0,1528,0,0,0
csv output requires uniform records but different types encountered

Now with the benefit of the enhancement, the output proceeds to completion.

$ zq -version
Version: v0.26.0-25-g1e501e85

$ zq -f csv pe.log.gz stats.log.gz 
_path,ts,id,machine,compile_ts,os,subsystem,is_exe,is_64bit,uses_aslr,uses_dep,uses_code_integrity,uses_seh,has_import_table,has_export_table,has_cert_table,has_debug_data,section_names,peer,mem,pkts_proc,bytes_recv,pkts_dropped,pkts_link,pkt_lag,events_proc,events_queued,active_tcp_conns,active_udp_conns,active_icmp_conns,tcp_conns,udp_conns,icmp_conns,timers,active_timers,files,active_files,dns_requests,active_dns_requests,reassem_tcp_size,reassem_file_size,reassem_frag_size,reassem_unknown_size
stats,2018-03-24T17:15:20.600725Z,,,,,,,,,,,,,,,,,zeek,74,26,29375,,,,404,11,1,0,0,1,0,0,36,32,0,0,0,0,1528,0,0,0
pe,2018-03-24T17:15:54.475076Z,FC6cOXTjuh6OdYwu5,I386,2010-07-12T21:46:18Z,Windows 95 or NT 4.0,WINDOWS_GUI,T,F,F,F,F,T,T,F,F,F,".text,.data,.rdata,.bss,.idata",,,,,,,,,,,,,,,,,,,,,,,,,
pe,2018-03-24T17:19:37.127059Z,FBRCYv3eG0d8TEWHy9,AMD64,2011-02-10T08:03:04Z,Windows 7 or Server 2008 R2,WINDOWS_GUI,T,T,T,T,F,T,T,T,T,T,".text,.data,.pdata,.rsrc,.reloc",,,,,,,,,,,,,,,,,,,,,,,,,
pe,2018-03-24T17:19:37.068955Z,FD0dMoWO5DNexKwNb,AMD64,2010-11-20T09:45:06Z,Windows 7 or Server 2008 R2,WINDOWS_GUI,T,T,T,T,F,T,T,F,T,T,".text,.data,.pdata,.rsrc,.reloc",,,,,,,,,,,,,,,,,,,,,,,,,
pe,2018-03-24T17:19:36.94092Z,F4ehaa3sa8zKtCWcF9,AMD64,2010-11-20T09:45:00Z,Windows 7 or Server 2008 R2,WINDOWS_GUI,T,T,T,F,F,T,T,F,T,T,".text,.data,.pdata,.rsrc,.reloc",,,,,,,,,,,,,,,,,,,,,,,,,
...
stats,2018-03-24T17:35:20.601137Z,,,,,,,,,,,,,,,,,zeek,282,5467567,3398705931,,,,1535999,1535998,4239,146,305,193639,4731,2510,879701,25895,35230,88,6,0,455128,0,0,0

And if the user is confident their data should conform to a single record definition and hence want to avoid the two passes through the data to guarantee a fuse'd schema, they can invoke the -csvfuse=false option, which in this case would once again halt the output as we saw before.

$ zq -f csv -csvfuse=false pe.log.gz stats.log.gz 
_path,ts,peer,mem,pkts_proc,bytes_recv,pkts_dropped,pkts_link,pkt_lag,events_proc,events_queued,active_tcp_conns,active_udp_conns,active_icmp_conns,tcp_conns,udp_conns,icmp_conns,timers,active_timers,files,active_files,dns_requests,active_dns_requests,reassem_tcp_size,reassem_file_size,reassem_frag_size,reassem_unknown_size
stats,2018-03-24T17:15:20.600725Z,zeek,74,26,29375,,,,404,11,1,0,0,1,0,0,36,32,0,0,0,0,1528,0,0,0
csv output requires uniform records but different types encountered

Thanks @nwt!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants