perf: option to improve Date Sorting by pre-parsing date items only once #1691

ghiscoding · 2024-09-24T02:37:49Z

Motivation and Context

It all started from this Angular-Slickgrid, and reproduceable, issue

The current Date Sorting can be extremely slow when dates are string that need to be parsed and converted to real Dates (i.e. US Date Format) before the Sorting process can then work and that is probably something like an O(exp n10). This PR is improving perf to about O(n10) by removing the multiple parsing of the same dates, and so we can lose the negative side effect of an exponential execution. First of all I'm guessing it's about n10 because the default browser .sort() will revisit the same items multiple times, let's say 10 times, to complete its sorting process, so I'm just using n10 as a arbitrary number (in reality this number is probably a lot bigger depending on your dataset size).

Now the fact that we revisit the same items multiple times has the very bad side effect of reparsing the same date items over and over (for example when sorting 100 items, I saw the same item being revisited about 10 times to complete the sorting process and I'm assuming this is a lot worst with much larger dataset). With this PR, we can scale that down to O(n10) by pre-parsing the date items only once and so eliminating the exponential side effect. In other words, the slow part is when we parse string dates into real Date object which is required for the Sort to work. So with current implementation, most of the time spent is for parsing dates from a string to a Date type but that is not the only time lost, the default browser .sort() method which is what we use to sort the data, will revisit the same item dataContext over and over. Because the same item is being revisited many times, it will reparse the same item dates over and over which is extremely bad for performance...

So what can we do to make this faster with a more reasonable time? and by reasonable time I mean that we can get roughly the same perf as sorting a number column (which is currently a lot quicker than sorting a date column). The implementation is explained below.

How is this implemented

Provide a new grid option named preParseDateColumns that can be either a boolean or a string

boolean (i.e. parse "start" date string and reassign it as a Date object on the same "start" prop)
string (i.e. set to "__", it will parse a "start" date string and assign it as a Date object to a new "__start" prop)

For example, let say that we have this dataset and we set preParseDateColumns: '__' and we have this array

data = [
  { id: 0, start: '02/28/24', finish: '03/02/24' },
  { id: 1, start: '01/14/24', finish: '02/13/24' },
]

What we'll do is that as soon as a date column triggers a Sort, it will then start the pre-parsing execution by looping through each items and parse each item string date(s) and convert them to Date objects and assign it back to the dataset (by mutation), and by doing this pre-parse, we are telling the application to only do this pre-parsing just once which makes it much quicker on further Sort executions because it doesn't have to parse the same dates anymore (sorting Date objects is a lot quicker because a Date object contains a UNIX timestamp that we then use for the Sort and that step is very fast).

So, if we defined preParseDateColumns: '__' grid option (defined as a string), then our updated items array would be the following (where Date below is a Date object of the input dates):

data = [
-  { id: 0, start: '02/28/24', finish: '03/02/24' },
-  { id: 1, start: '01/14/24', finish: '02/13/24' },
+  { id: 0, start: '02/28/24', finish: '03/02/24', __start: Date, __finish: Date },
+  { id: 1, start: '01/14/24', finish: '02/13/24', __start: Date, __finish: Date },
]

However if we defined preParseDateColumns: true (boolean) then our updated items array would rather be:

data = [
-  { id: 0, start: '02/28/24', finish: '03/02/24' },
-  { id: 1, start: '01/14/24', finish: '02/13/24' },
+  { id: 0, start: Date, finish: Date },
+  { id: 1, start: Date, finish: Date },
]

Finals

So what perf do we get out of this? With a dataset of 50k rows with 2 columns start/finish, we get these perf improvements

# before
ASC sort: 15011.368896484375 ms
DESC sort: 4121.112060546875 ms

---

# after code change (becomes 2 steps)
# 1. parse all dataset items
mutate: 1396.18994140625 ms

# 2. sort
ASC sort: 188.916015625 ms
DESC sort: 87.218994140625 ms

So we drop from 15sec to about 1.5sec which is quite a drop and 10x better perf, the other thing to note is that the large waiting time to sort without pre-parsing will be carried on every other sort, however with the new approach, we only need to pre-parse once. Also the other thing to note is that this pre-parsing becomes beneficial with very large dataset because the issue exponential.

Summary

Why does this help? Mainly because the native browser .sort() will revisit the same items many times until they are all sorted (for example, when sorting 100 items, I saw it revisiting some items like 10 times, I could imagine this being a lot more visits when our dataset is 50k or even more) and with the current Slickgrid implementation, it will parse them over and over, and as you can imagine parsing the same items multiple times it is especially bad for performance (exponentially bad if we could say). However, if we parse each of these items only a single time and never more than once (with the code above), then it becomes a lot more performant. This becomes O(n10), however what we currently have is probably closer to an O(exp n10) and a exponential perf is always bad and we should avoid it as much as possible.

Considerations

User might not see much differences with small dataset but will see a huge difference with large dataset because of it's exponential nature. So I think a good thing to do would be to show a console warning if the option is not enabled (because it's not enabled by default) when the dataset is larger than about 5K.

Perf Preview

TODOs

stackblitz · 2024-09-24T02:37:52Z

Run & review this pull request in StackBlitz Codeflow.

codecov · 2024-09-24T02:41:57Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.8%. Comparing base (45ff6b0) to head (b83fbc3).
Report is 24 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff            @@
##           master   #1691     +/-   ##
========================================
+ Coverage    99.8%   99.8%   +0.1%     
========================================
  Files         187     187             
  Lines       31075   31149     +74     
  Branches     9754    9779     +25     
========================================
+ Hits        30986   31060     +74     
  Misses         89      89

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ghiscoding · 2024-09-24T03:03:35Z

@zewa666 so this is what I came up with, it's a decent improvement and all the reasoning is described above. So for 50k, we can see a drop from 15sec to 1.5sec which is quite decent drop.

Next step, the pre-parsing could be a bit slow on the initial load and a potential improvement could be to use Web Workers to do the work in the background which I'll probably take a look at that next.

For now the grid option above is disabled, if I could get it working with Web Workers (never really done any of that yet) then I might enable it by default OR not since it mutates the dataset, I might be better to just document it instead.

Can this be improved further? I guess we could improve it more by using other type of sort (like quick sort or whatever else) but I'm not sure that we would be able to use that in combo with the native browser .sort() method and so I think it would overcomplicate the code and I would prefer to stick with the default .sort() for that reason. Another side note, most other sort types (quick sort, radix, etc...) are written for number sorting and I'm not even sure it would be possible to deviate from the default .sort() for that reason, so I might as well just go with the above approach from above which is to pre-parse all date items and still get a decent perf gain.

zewa666 · 2024-09-24T14:52:24Z

that looks good. radix could perhaps squeeze out a bit more than bubble sort. alternatively doing the sorting on gpu with stuff like gpu.js might be an option. but 1.5sec sounds absolutely reasonable for me.

ghiscoding · 2024-09-24T15:00:05Z

that looks good. radix could perhaps squeeze out a bit more than bubble sort. alternatively doing the sorting on gpu with stuff like gpu.js might be an option. but 1.5sec sounds absolutely reasonable for me.

yeah but the thing is that all of these sort types are made to sort numbers only and that wouldn't work for all grid column types. It would work for Dates since at the end I'm converting the parsed Date to a Unix timestamp for sorting, but still the question is more about, how would I even integrate this with the native DataView.sort()?

slickgrid-universal/packages/common/src/core/slickDataview.ts

Lines 330 to 337 in 79d1995

    
           /** Sort Method to use by the DataView */ 
        
           sort(comparer: (a: TData, b: TData) => number, ascending?: boolean): void { 
        
             this.sortAsc = ascending; 
        
             this.sortComparer = comparer; 
        
             if (ascending === false) { 
        
               this.items.reverse(); 
        
             } 
        
             this.items.sort(comparer);

But anyway, I think my approach is acceptable as you mentioned, so I might just keep this. This new approach does have "side effect" though, which is to mutate the dataset by adding (or overwriting) extra date properties that is then used to sort but that should be ok, I'm already mutating for other features like Tree Data, Row Detail and barely anyone complained (at least they don't when they understand why I do it, which is mainly for perf reasons). Another thing I have to look into is when we do CRUD, it would need to reparse that row which I can do easily 1 at a time.

zewa666 · 2024-09-24T15:02:24Z

but again having those extra steps is acceptable when working with large amounts of data imo.

ghiscoding · 2024-09-24T15:20:58Z

if you know how we could connect other sort types in a generic way to the default DataView sort, then I'd be interested to receive a PR for sure. Or did you referenced these "extra steps" as my new approach in this PR?

zewa666 · 2024-09-24T17:48:44Z

yeah I meant your approach with preparing the data for easier/faster sorting

- also display console warning when dataset is larger than 5K items and pre-parse isn't enabled

perf: improve Date Sorting by optionally pre-parsing date items

6678139

ghiscoding marked this pull request as draft September 24, 2024 02:55

ghiscoding changed the title ~~perf: improve Date Sorting by optionally pre-parsing date items~~ perf: improve Date Sorting by optionally pre-parsing date items, from O(log n2) to O(n2) Sep 24, 2024

ghiscoding changed the title ~~perf: improve Date Sorting by optionally pre-parsing date items, from O(log n2) to O(n2)~~ perf: improve Date Sorting by pre-parsing date items, from O(log n2) to O(n2) Sep 24, 2024

ghiscoding changed the title ~~perf: improve Date Sorting by pre-parsing date items, from O(log n2) to O(n2)~~ perf: improve Date Sorting by pre-parsing date items only once Sep 24, 2024

chore: move preParseDateItems() to CollectionService

eb5a4f4

ghiscoding mentioned this pull request Sep 25, 2024

fix: Date parsing shouldn't fail & dateIso type should accept time #1695

Merged

ghiscoding added 7 commits September 25, 2024 19:42

Merge branch 'master' into perf/improve-date-sorting-speed

a6797dd

chore: add unit test coverage

885c124

chore: remove unused import

218fa77

chore: add missing unit test & remove grid options usage in collection

e8a0d67

chore: pre-parse dates only on first date column sort

db99343

chore: rename preParse function in unit tests

a10b0e1

Merge branch 'master' into perf/improve-date-sorting-speed

8310e5d

ghiscoding changed the title ~~perf: improve Date Sorting by pre-parsing date items only once~~ perf: option to improve Date Sorting by pre-parsing date items only once Sep 27, 2024

ghiscoding added 4 commits September 27, 2024 21:41

chore: get rid of jest-extended by using built-in matchers

e8585cc

docs: add pre-parse dates documentation

498fecc

chore: pre-parse date strings on 1st date column sort call

a635ba1

- also display console warning when dataset is larger than 5K items and pre-parse isn't enabled

chore: add Cypress tests for pre-parsing of dates

b83fbc3

ghiscoding marked this pull request as ready for review September 28, 2024 06:06

ghiscoding merged commit c197415 into master Sep 28, 2024
9 checks passed

ghiscoding deleted the perf/improve-date-sorting-speed branch September 28, 2024 06:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: option to improve Date Sorting by pre-parsing date items only once #1691

perf: option to improve Date Sorting by pre-parsing date items only once #1691

ghiscoding commented Sep 24, 2024 •

edited

Loading

stackblitz bot commented Sep 24, 2024

codecov bot commented Sep 24, 2024 •

edited

Loading

ghiscoding commented Sep 24, 2024 •

edited

Loading

zewa666 commented Sep 24, 2024

ghiscoding commented Sep 24, 2024 •

edited

Loading

zewa666 commented Sep 24, 2024

ghiscoding commented Sep 24, 2024 •

edited

Loading

zewa666 commented Sep 24, 2024

perf: option to improve Date Sorting by pre-parsing date items only once #1691

perf: option to improve Date Sorting by pre-parsing date items only once #1691

Conversation

ghiscoding commented Sep 24, 2024 • edited Loading

Motivation and Context

How is this implemented

Finals

Summary

Considerations

Perf Preview

TODOs

stackblitz bot commented Sep 24, 2024

codecov bot commented Sep 24, 2024 • edited Loading

Codecov Report

ghiscoding commented Sep 24, 2024 • edited Loading

zewa666 commented Sep 24, 2024

ghiscoding commented Sep 24, 2024 • edited Loading

zewa666 commented Sep 24, 2024

ghiscoding commented Sep 24, 2024 • edited Loading

zewa666 commented Sep 24, 2024

ghiscoding commented Sep 24, 2024 •

edited

Loading

codecov bot commented Sep 24, 2024 •

edited

Loading

ghiscoding commented Sep 24, 2024 •

edited

Loading

ghiscoding commented Sep 24, 2024 •

edited

Loading

ghiscoding commented Sep 24, 2024 •

edited

Loading