-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-40891: [JS] Store Dates as TimestampMillisecond #40892
Conversation
js/src/type.ts
Outdated
@@ -406,7 +406,7 @@ type Timestamps = Type.Timestamp | Type.TimestampSecond | Type.TimestampMillisec | |||
/** @ignore */ | |||
interface Timestamp_<T extends Timestamps = Timestamps> extends DataType<T> { | |||
TArray: Int32Array; | |||
TValue: number; | |||
TValue: number | Date; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't the TValue
type be overridden just for the TimestampMillisecond
class below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted to number again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to change the type from number to Date? IIRC it's on users to convert numbers to the corresponding timezone, so returning number | Date
makes the type-safe way to do that more difficult for users.
If we always return numbers, users can convert to Dates (if necessary) themselves. IIRC constructing date instances is relatively expensive esp. if using custom locales, so users should opt-in to that behavior if they need it?
We do construct dates for DateMillisecond already so I figured it would make sense to do the same for TimestampMillisecond.
I did not measure it but wit surprises me a bit since I expected it to just be a lightweight wrapper around timestamps. I'd be okay with just retuning numbers to be fast but also to introduce an iterator for vectors that returns convenient types in JS that may be slower but easier to work with (in a follow up PR). Creating a million Dates takes about 5ms in bun on my machine (M1 Mac). I node, 40ms! Getting 1M Dates from a TimestampMillisecond vector instead of timestamps takes 26 instead of 15ms (in bun on my machine). In node 54ms instead of 14ms. So yeah, much slower in node. |
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 2caec86. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
Merge after apache#40892. This pull request also changes Dates to return timestamps instead of Date instances (similar to Timestamps and for the same reason. * GitHub Issue: apache#40959
Fixes apache#40891 Tested with ```ts const date = new Date("2023-03-29T12:34:56Z"); console.log("original", date) console.log("=> vec") const vec = arrow.vectorFromArray([date]) console.log(vec.toArray()) console.log(vec.toJSON()) console.log(vec.type) console.log(vec.get(0)) console.log("=> vec2") const vec2 = arrow.vectorFromArray([date], new arrow.DateMillisecond) console.log(vec2.toArray()) console.log(vec.toJSON()) console.log(vec2.type) console.log(vec2.get(0)) console.log("=> table") const table = arrow.tableFromJSON([{ date }]) console.log(table.toArray()) console.log(table.schema.fields[0].type) console.log(table.getChildAt(0)?.get(0)) console.log("=> table2") const table2 = arrow.tableFromIPC(arrow.tableToIPC(table)); console.log(table2.toArray()) console.log(table2.schema.fields[0].type) console.log(table2.getChildAt(0)?.get(0)) console.log("=> table3") const table3 = new arrow.Table({ dates: vec2 }) console.log(table3.toArray()) console.log(table3.schema.fields[0].type) console.log(table3.getChildAt(0)?.get(0)) ``` ``` => table [ {"date": Wed Mar 29 2023 08:34:56 GMT-0400 (Eastern Daylight Time)} ] TimestampMillisecond { typeId: 10, unit: 1, timezone: undefined, toString: [Function: toString], ArrayType: [class Int32Array], [Symbol(Symbol.toStringTag)]: "Timestamp", children: null, OffsetArrayType: [class Int32Array], } 2023-03-29T12:34:56.000Z => table2 [ {"date": Wed Mar 29 2023 08:34:56 GMT-0400 (Eastern Daylight Time)} ] Timestamp_ { typeId: 10, unit: 1, timezone: null, toString: [Function: toString], ArrayType: [class Int32Array], children: null, OffsetArrayType: [class Int32Array], } 2023-03-29T12:34:56.000Z => table3 [ {"dates": Wed Mar 29 2023 08:34:56 GMT-0400 (Eastern Daylight Time)} ] DateMillisecond { typeId: 8, unit: 1, toString: [Function: toString], ArrayType: [class Int32Array], [Symbol(Symbol.toStringTag)]: "Date", children: null, OffsetArrayType: [class Int32Array], } 2023-03-29T12:34:56.000Z ``` * GitHub Issue: apache#40891
Merge after apache#40892. This pull request also changes Dates to return timestamps instead of Date instances (similar to Timestamps and for the same reason. * GitHub Issue: apache#40959
Fixes apache#40891 Tested with ```ts const date = new Date("2023-03-29T12:34:56Z"); console.log("original", date) console.log("=> vec") const vec = arrow.vectorFromArray([date]) console.log(vec.toArray()) console.log(vec.toJSON()) console.log(vec.type) console.log(vec.get(0)) console.log("=> vec2") const vec2 = arrow.vectorFromArray([date], new arrow.DateMillisecond) console.log(vec2.toArray()) console.log(vec.toJSON()) console.log(vec2.type) console.log(vec2.get(0)) console.log("=> table") const table = arrow.tableFromJSON([{ date }]) console.log(table.toArray()) console.log(table.schema.fields[0].type) console.log(table.getChildAt(0)?.get(0)) console.log("=> table2") const table2 = arrow.tableFromIPC(arrow.tableToIPC(table)); console.log(table2.toArray()) console.log(table2.schema.fields[0].type) console.log(table2.getChildAt(0)?.get(0)) console.log("=> table3") const table3 = new arrow.Table({ dates: vec2 }) console.log(table3.toArray()) console.log(table3.schema.fields[0].type) console.log(table3.getChildAt(0)?.get(0)) ``` ``` => table [ {"date": Wed Mar 29 2023 08:34:56 GMT-0400 (Eastern Daylight Time)} ] TimestampMillisecond { typeId: 10, unit: 1, timezone: undefined, toString: [Function: toString], ArrayType: [class Int32Array], [Symbol(Symbol.toStringTag)]: "Timestamp", children: null, OffsetArrayType: [class Int32Array], } 2023-03-29T12:34:56.000Z => table2 [ {"date": Wed Mar 29 2023 08:34:56 GMT-0400 (Eastern Daylight Time)} ] Timestamp_ { typeId: 10, unit: 1, timezone: null, toString: [Function: toString], ArrayType: [class Int32Array], children: null, OffsetArrayType: [class Int32Array], } 2023-03-29T12:34:56.000Z => table3 [ {"dates": Wed Mar 29 2023 08:34:56 GMT-0400 (Eastern Daylight Time)} ] DateMillisecond { typeId: 8, unit: 1, toString: [Function: toString], ArrayType: [class Int32Array], [Symbol(Symbol.toStringTag)]: "Date", children: null, OffsetArrayType: [class Int32Array], } 2023-03-29T12:34:56.000Z ``` * GitHub Issue: apache#40891
Merge after apache#40892. This pull request also changes Dates to return timestamps instead of Date instances (similar to Timestamps and for the same reason. * GitHub Issue: apache#40959
Fixes apache#40891 Tested with ```ts const date = new Date("2023-03-29T12:34:56Z"); console.log("original", date) console.log("=> vec") const vec = arrow.vectorFromArray([date]) console.log(vec.toArray()) console.log(vec.toJSON()) console.log(vec.type) console.log(vec.get(0)) console.log("=> vec2") const vec2 = arrow.vectorFromArray([date], new arrow.DateMillisecond) console.log(vec2.toArray()) console.log(vec.toJSON()) console.log(vec2.type) console.log(vec2.get(0)) console.log("=> table") const table = arrow.tableFromJSON([{ date }]) console.log(table.toArray()) console.log(table.schema.fields[0].type) console.log(table.getChildAt(0)?.get(0)) console.log("=> table2") const table2 = arrow.tableFromIPC(arrow.tableToIPC(table)); console.log(table2.toArray()) console.log(table2.schema.fields[0].type) console.log(table2.getChildAt(0)?.get(0)) console.log("=> table3") const table3 = new arrow.Table({ dates: vec2 }) console.log(table3.toArray()) console.log(table3.schema.fields[0].type) console.log(table3.getChildAt(0)?.get(0)) ``` ``` => table [ {"date": Wed Mar 29 2023 08:34:56 GMT-0400 (Eastern Daylight Time)} ] TimestampMillisecond { typeId: 10, unit: 1, timezone: undefined, toString: [Function: toString], ArrayType: [class Int32Array], [Symbol(Symbol.toStringTag)]: "Timestamp", children: null, OffsetArrayType: [class Int32Array], } 2023-03-29T12:34:56.000Z => table2 [ {"date": Wed Mar 29 2023 08:34:56 GMT-0400 (Eastern Daylight Time)} ] Timestamp_ { typeId: 10, unit: 1, timezone: null, toString: [Function: toString], ArrayType: [class Int32Array], children: null, OffsetArrayType: [class Int32Array], } 2023-03-29T12:34:56.000Z => table3 [ {"dates": Wed Mar 29 2023 08:34:56 GMT-0400 (Eastern Daylight Time)} ] DateMillisecond { typeId: 8, unit: 1, toString: [Function: toString], ArrayType: [class Int32Array], [Symbol(Symbol.toStringTag)]: "Date", children: null, OffsetArrayType: [class Int32Array], } 2023-03-29T12:34:56.000Z ``` * GitHub Issue: apache#40891
Merge after apache#40892. This pull request also changes Dates to return timestamps instead of Date instances (similar to Timestamps and for the same reason. * GitHub Issue: apache#40959
Fixes apache#40891 Tested with ```ts const date = new Date("2023-03-29T12:34:56Z"); console.log("original", date) console.log("=> vec") const vec = arrow.vectorFromArray([date]) console.log(vec.toArray()) console.log(vec.toJSON()) console.log(vec.type) console.log(vec.get(0)) console.log("=> vec2") const vec2 = arrow.vectorFromArray([date], new arrow.DateMillisecond) console.log(vec2.toArray()) console.log(vec.toJSON()) console.log(vec2.type) console.log(vec2.get(0)) console.log("=> table") const table = arrow.tableFromJSON([{ date }]) console.log(table.toArray()) console.log(table.schema.fields[0].type) console.log(table.getChildAt(0)?.get(0)) console.log("=> table2") const table2 = arrow.tableFromIPC(arrow.tableToIPC(table)); console.log(table2.toArray()) console.log(table2.schema.fields[0].type) console.log(table2.getChildAt(0)?.get(0)) console.log("=> table3") const table3 = new arrow.Table({ dates: vec2 }) console.log(table3.toArray()) console.log(table3.schema.fields[0].type) console.log(table3.getChildAt(0)?.get(0)) ``` ``` => table [ {"date": Wed Mar 29 2023 08:34:56 GMT-0400 (Eastern Daylight Time)} ] TimestampMillisecond { typeId: 10, unit: 1, timezone: undefined, toString: [Function: toString], ArrayType: [class Int32Array], [Symbol(Symbol.toStringTag)]: "Timestamp", children: null, OffsetArrayType: [class Int32Array], } 2023-03-29T12:34:56.000Z => table2 [ {"date": Wed Mar 29 2023 08:34:56 GMT-0400 (Eastern Daylight Time)} ] Timestamp_ { typeId: 10, unit: 1, timezone: null, toString: [Function: toString], ArrayType: [class Int32Array], children: null, OffsetArrayType: [class Int32Array], } 2023-03-29T12:34:56.000Z => table3 [ {"dates": Wed Mar 29 2023 08:34:56 GMT-0400 (Eastern Daylight Time)} ] DateMillisecond { typeId: 8, unit: 1, toString: [Function: toString], ArrayType: [class Int32Array], [Symbol(Symbol.toStringTag)]: "Date", children: null, OffsetArrayType: [class Int32Array], } 2023-03-29T12:34:56.000Z ``` * GitHub Issue: apache#40891
Merge after apache#40892. This pull request also changes Dates to return timestamps instead of Date instances (similar to Timestamps and for the same reason. * GitHub Issue: apache#40959
Fixes apache#40891 Tested with ```ts const date = new Date("2023-03-29T12:34:56Z"); console.log("original", date) console.log("=> vec") const vec = arrow.vectorFromArray([date]) console.log(vec.toArray()) console.log(vec.toJSON()) console.log(vec.type) console.log(vec.get(0)) console.log("=> vec2") const vec2 = arrow.vectorFromArray([date], new arrow.DateMillisecond) console.log(vec2.toArray()) console.log(vec.toJSON()) console.log(vec2.type) console.log(vec2.get(0)) console.log("=> table") const table = arrow.tableFromJSON([{ date }]) console.log(table.toArray()) console.log(table.schema.fields[0].type) console.log(table.getChildAt(0)?.get(0)) console.log("=> table2") const table2 = arrow.tableFromIPC(arrow.tableToIPC(table)); console.log(table2.toArray()) console.log(table2.schema.fields[0].type) console.log(table2.getChildAt(0)?.get(0)) console.log("=> table3") const table3 = new arrow.Table({ dates: vec2 }) console.log(table3.toArray()) console.log(table3.schema.fields[0].type) console.log(table3.getChildAt(0)?.get(0)) ``` ``` => table [ {"date": Wed Mar 29 2023 08:34:56 GMT-0400 (Eastern Daylight Time)} ] TimestampMillisecond { typeId: 10, unit: 1, timezone: undefined, toString: [Function: toString], ArrayType: [class Int32Array], [Symbol(Symbol.toStringTag)]: "Timestamp", children: null, OffsetArrayType: [class Int32Array], } 2023-03-29T12:34:56.000Z => table2 [ {"date": Wed Mar 29 2023 08:34:56 GMT-0400 (Eastern Daylight Time)} ] Timestamp_ { typeId: 10, unit: 1, timezone: null, toString: [Function: toString], ArrayType: [class Int32Array], children: null, OffsetArrayType: [class Int32Array], } 2023-03-29T12:34:56.000Z => table3 [ {"dates": Wed Mar 29 2023 08:34:56 GMT-0400 (Eastern Daylight Time)} ] DateMillisecond { typeId: 8, unit: 1, toString: [Function: toString], ArrayType: [class Int32Array], [Symbol(Symbol.toStringTag)]: "Date", children: null, OffsetArrayType: [class Int32Array], } 2023-03-29T12:34:56.000Z ``` * GitHub Issue: apache#40891
Merge after apache#40892. This pull request also changes Dates to return timestamps instead of Date instances (similar to Timestamps and for the same reason. * GitHub Issue: apache#40959
Fixes #40891
Tested with