You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been doing some recent work transforming data using Zed to prepare it for time-series plotting. The attached sample data fuel.csv illustrates the challenge.
$ cat fuel.csv
SURVEY_DATE,PRODUCT_ID,PRODUCT_NAME,PRICE,VAT,EXCISE,NET,CHANGE
2005-01-03,1,"Euro-Super 95",1115.75,185.96,558.64,371.15,-1.57
2005-01-03,2,"Automotive gas oil",1018.28,169.71,403.21,445.36,-0.33
2005-01-03,3,"Heating gas oil",948.5,158.08,403.21,387.21,-22.55
2005-01-03,5,LPG,552.5,92.08,156.62,303.8,0.22
2005-01-03,6,"Residual fuel oil",553.25,50.3,166.84,336.11,-12.21
2005-01-03,8,"Heavy fuel oil",229.52,0,31.39,198.13,-5.37
2005-01-10,1,"Euro-Super 95",1088,181.33,558.64,348.03,-27.75
2005-01-10,2,"Automotive gas oil",1004.39,167.4,403.21,433.78,-13.89
2005-01-10,3,"Heating gas oil",947.94,157.99,403.21,386.74,-0.56
2005-01-10,5,LPG,552.57,92.09,156.62,303.86,0.07
2005-01-10,6,"Residual fuel oil",554.22,50.38,166.84,337,0.97
2005-01-10,8,"Heavy fuel oil",238.37,0,31.39,206.98,8.85
As we can see, the SURVEY_DATE column repeats six times, once for each category of fuel. However, many time-series tools have a much easier time reading such data if a "wide" schema is used with a single time/date stamp and multiple metrics per row. So in this case if we wanted to isolate just the PRICE data, we'd prefer it to look something like:
$ zq -version
Version: v1.7.0-2-g36dd506e
$ zq -f csv '
map(|{PRODUCT_NAME:PRICE}|) by SURVEY_DATE
| over map with d=SURVEY_DATE => (
yield {key:[key],value}
| collect(this)
| unflatten(this)
| put SURVEY_DATE:=d
)' fuel.csv
LPG,Euro-Super 95,Heavy fuel oil,Heating gas oil,Residual fuel oil,Automotive gas oil,SURVEY_DATE
552.5,1115.75,229.52,948.5,553.25,1018.28,2005-01-03
552.57,1088,238.37,947.94,554.22,1004.39,2005-01-10
That multi-line Zed idiom has been handy and I've since used it in multiple contexts (e.g., the https://github.com/brimdata/grafana-zed-datasource README). However, it's not the most self-documenting thing and I'm concerned that it might make the Zed language look needlessly complex to new users. Therefore I'd propose an aggregate function that provides this functionality. As a starting idea for a name, I'd propose widen() since the end data format is sometimes referred to as a "wide schema" in time-series materials.
It has also been noted by the team that as an alternative to delivering this in a purpose-built aggregation function, we could wrap the idiom in a parameterized subgraph (#4152).
The text was updated successfully, but these errors were encountered:
I've been doing some recent work transforming data using Zed to prepare it for time-series plotting. The attached sample data fuel.csv illustrates the challenge.
As we can see, the
SURVEY_DATE
column repeats six times, once for each category of fuel. However, many time-series tools have a much easier time reading such data if a "wide" schema is used with a single time/date stamp and multiple metrics per row. So in this case if we wanted to isolate just thePRICE
data, we'd prefer it to look something like:That multi-line Zed idiom has been handy and I've since used it in multiple contexts (e.g., the https://github.com/brimdata/grafana-zed-datasource README). However, it's not the most self-documenting thing and I'm concerned that it might make the Zed language look needlessly complex to new users. Therefore I'd propose an aggregate function that provides this functionality. As a starting idea for a name, I'd propose
widen()
since the end data format is sometimes referred to as a "wide schema" in time-series materials.It has also been noted by the team that as an alternative to delivering this in a purpose-built aggregation function, we could wrap the idiom in a parameterized subgraph (#4152).
The text was updated successfully, but these errors were encountered: