Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: option to improve Date Sorting by pre-parsing date items only once #1691

Merged
merged 13 commits into from
Sep 28, 2024

Conversation

ghiscoding
Copy link
Owner

@ghiscoding ghiscoding commented Sep 24, 2024

Motivation and Context

It all started from this Angular-Slickgrid, and reproduceable, issue

The current Date Sorting can be extremely slow when dates are string that need to be parsed and converted to real Dates (i.e. US Date Format) before the Sorting process can then work and that is probably something like an O(exp n10). This PR is improving perf to about O(n10) by removing the multiple parsing of the same dates, and so we can lose the negative side effect of an exponential execution. First of all I'm guessing it's about n10 because the default browser .sort() will revisit the same items multiple times, let's say 10 times, to complete its sorting process, so I'm just using n10 as a arbitrary number (in reality this number is probably a lot bigger depending on your dataset size).

Now the fact that we revisit the same items multiple times has the very bad side effect of reparsing the same date items over and over (for example when sorting 100 items, I saw the same item being revisited about 10 times to complete the sorting process and I'm assuming this is a lot worst with much larger dataset). With this PR, we can scale that down to O(n10) by pre-parsing the date items only once and so eliminating the exponential side effect. In other words, the slow part is when we parse string dates into real Date object which is required for the Sort to work. So with current implementation, most of the time spent is for parsing dates from a string to a Date type but that is not the only time lost, the default browser .sort() method which is what we use to sort the data, will revisit the same item dataContext over and over. Because the same item is being revisited many times, it will reparse the same item dates over and over which is extremely bad for performance...

So what can we do to make this faster with a more reasonable time? and by reasonable time I mean that we can get roughly the same perf as sorting a number column (which is currently a lot quicker than sorting a date column). The implementation is explained below.

How is this implemented

Provide a new grid option named preParseDateColumns that can be either a boolean or a string

  • boolean (i.e. parse "start" date string and reassign it as a Date object on the same "start" prop)
  • string (i.e. set to "__", it will parse a "start" date string and assign it as a Date object to a new "__start" prop)

For example, let say that we have this dataset and we set preParseDateColumns: '__' and we have this array

data = [
  { id: 0, start: '02/28/24', finish: '03/02/24' },
  { id: 1, start: '01/14/24', finish: '02/13/24' },
]

What we'll do is that as soon as a date column triggers a Sort, it will then start the pre-parsing execution by looping through each items and parse each item string date(s) and convert them to Date objects and assign it back to the dataset (by mutation), and by doing this pre-parse, we are telling the application to only do this pre-parsing just once which makes it much quicker on further Sort executions because it doesn't have to parse the same dates anymore (sorting Date objects is a lot quicker because a Date object contains a UNIX timestamp that we then use for the Sort and that step is very fast).

So, if we defined preParseDateColumns: '__' grid option (defined as a string), then our updated items array would be the following (where Date below is a Date object of the input dates):

data = [
-  { id: 0, start: '02/28/24', finish: '03/02/24' },
-  { id: 1, start: '01/14/24', finish: '02/13/24' },
+  { id: 0, start: '02/28/24', finish: '03/02/24', __start: Date, __finish: Date },
+  { id: 1, start: '01/14/24', finish: '02/13/24', __start: Date, __finish: Date },
]

However if we defined preParseDateColumns: true (boolean) then our updated items array would rather be:

data = [
-  { id: 0, start: '02/28/24', finish: '03/02/24' },
-  { id: 1, start: '01/14/24', finish: '02/13/24' },
+  { id: 0, start: Date, finish: Date },
+  { id: 1, start: Date, finish: Date },
]

Finals

So what perf do we get out of this? With a dataset of 50k rows with 2 columns start/finish, we get these perf improvements

# before
ASC sort: 15011.368896484375 ms
DESC sort: 4121.112060546875 ms

---

# after code change (becomes 2 steps)
# 1. parse all dataset items
mutate: 1396.18994140625 ms

# 2. sort
ASC sort: 188.916015625 ms
DESC sort: 87.218994140625 ms

So we drop from 15sec to about 1.5sec which is quite a drop and 10x better perf, the other thing to note is that the large waiting time to sort without pre-parsing will be carried on every other sort, however with the new approach, we only need to pre-parse once. Also the other thing to note is that this pre-parsing becomes beneficial with very large dataset because the issue exponential.

Summary

Why does this help? Mainly because the native browser .sort() will revisit the same items many times until they are all sorted (for example, when sorting 100 items, I saw it revisiting some items like 10 times, I could imagine this being a lot more visits when our dataset is 50k or even more) and with the current Slickgrid implementation, it will parse them over and over, and as you can imagine parsing the same items multiple times it is especially bad for performance (exponentially bad if we could say). However, if we parse each of these items only a single time and never more than once (with the code above), then it becomes a lot more performant. This becomes O(n10), however what we currently have is probably closer to an O(exp n10) and a exponential perf is always bad and we should avoid it as much as possible.

Considerations

User might not see much differences with small dataset but will see a huge difference with large dataset because of it's exponential nature. So I think a good thing to do would be to show a console warning if the option is not enabled (because it's not enabled by default) when the dataset is larger than about 5K.

Perf Preview

brave_cePbwkvlKP

TODOs

  • move code from vanilla-grid-bundle to one of the service for reusability, maybe the CollectionService
  • split re-parsing into 2 functions so that we can reparse a single item
    • call single item parse after calling any CRUD and expose it to external user (show in docs)
  • must work with Editors.date with boolean or string as pre-parse setting
  • probably better to parse just before sorting any date column to avoid having to wait even longer when loading the grid
    • the function will be public so that the user can call it at any time if he wants (internally we have a bool flag to know if it already ran)
  • can we use Web Workers? instead, we can simply expose the preParsing function to the external user and let them decide whenever they want to preParse (or keep default preParsing on 1st Date sort like mentioned above)
  • add unit tests
  • add E2E tests
  • add Documentations
  • remove perf console log
  • warnLargeUnparsedDates() for maybe larger than 5K
  • note: this won't work with Grid Preset to preset a date column sort

Copy link

stackblitz bot commented Sep 24, 2024

Review PR in StackBlitz Codeflow Run & review this pull request in StackBlitz Codeflow.

Copy link

codecov bot commented Sep 24, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.8%. Comparing base (45ff6b0) to head (b83fbc3).
Report is 24 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff            @@
##           master   #1691     +/-   ##
========================================
+ Coverage    99.8%   99.8%   +0.1%     
========================================
  Files         187     187             
  Lines       31075   31149     +74     
  Branches     9754    9779     +25     
========================================
+ Hits        30986   31060     +74     
  Misses         89      89             

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ghiscoding ghiscoding marked this pull request as draft September 24, 2024 02:55
@ghiscoding
Copy link
Owner Author

ghiscoding commented Sep 24, 2024

@zewa666 so this is what I came up with, it's a decent improvement and all the reasoning is described above. So for 50k, we can see a drop from 15sec to 1.5sec which is quite decent drop.

Next step, the pre-parsing could be a bit slow on the initial load and a potential improvement could be to use Web Workers to do the work in the background which I'll probably take a look at that next.

For now the grid option above is disabled, if I could get it working with Web Workers (never really done any of that yet) then I might enable it by default OR not since it mutates the dataset, I might be better to just document it instead.

Can this be improved further? I guess we could improve it more by using other type of sort (like quick sort or whatever else) but I'm not sure that we would be able to use that in combo with the native browser .sort() method and so I think it would overcomplicate the code and I would prefer to stick with the default .sort() for that reason. Another side note, most other sort types (quick sort, radix, etc...) are written for number sorting and I'm not even sure it would be possible to deviate from the default .sort() for that reason, so I might as well just go with the above approach from above which is to pre-parse all date items and still get a decent perf gain.

@ghiscoding ghiscoding changed the title perf: improve Date Sorting by optionally pre-parsing date items perf: improve Date Sorting by optionally pre-parsing date items, from O(log n2) to O(n2) Sep 24, 2024
@ghiscoding ghiscoding changed the title perf: improve Date Sorting by optionally pre-parsing date items, from O(log n2) to O(n2) perf: improve Date Sorting by pre-parsing date items, from O(log n2) to O(n2) Sep 24, 2024
@ghiscoding ghiscoding changed the title perf: improve Date Sorting by pre-parsing date items, from O(log n2) to O(n2) perf: improve Date Sorting by pre-parsing date items only once Sep 24, 2024
@zewa666
Copy link
Contributor

zewa666 commented Sep 24, 2024

that looks good. radix could perhaps squeeze out a bit more than bubble sort. alternatively doing the sorting on gpu with stuff like gpu.js might be an option. but 1.5sec sounds absolutely reasonable for me.

@ghiscoding
Copy link
Owner Author

ghiscoding commented Sep 24, 2024

that looks good. radix could perhaps squeeze out a bit more than bubble sort. alternatively doing the sorting on gpu with stuff like gpu.js might be an option. but 1.5sec sounds absolutely reasonable for me.

yeah but the thing is that all of these sort types are made to sort numbers only and that wouldn't work for all grid column types. It would work for Dates since at the end I'm converting the parsed Date to a Unix timestamp for sorting, but still the question is more about, how would I even integrate this with the native DataView.sort()?

/** Sort Method to use by the DataView */
sort(comparer: (a: TData, b: TData) => number, ascending?: boolean): void {
this.sortAsc = ascending;
this.sortComparer = comparer;
if (ascending === false) {
this.items.reverse();
}
this.items.sort(comparer);

But anyway, I think my approach is acceptable as you mentioned, so I might just keep this. This new approach does have "side effect" though, which is to mutate the dataset by adding (or overwriting) extra date properties that is then used to sort but that should be ok, I'm already mutating for other features like Tree Data, Row Detail and barely anyone complained (at least they don't when they understand why I do it, which is mainly for perf reasons). Another thing I have to look into is when we do CRUD, it would need to reparse that row which I can do easily 1 at a time.

@zewa666
Copy link
Contributor

zewa666 commented Sep 24, 2024

but again having those extra steps is acceptable when working with large amounts of data imo.

@ghiscoding
Copy link
Owner Author

ghiscoding commented Sep 24, 2024

if you know how we could connect other sort types in a generic way to the default DataView sort, then I'd be interested to receive a PR for sure. Or did you referenced these "extra steps" as my new approach in this PR?

@zewa666
Copy link
Contributor

zewa666 commented Sep 24, 2024

yeah I meant your approach with preparing the data for easier/faster sorting

@ghiscoding ghiscoding changed the title perf: improve Date Sorting by pre-parsing date items only once perf: option to improve Date Sorting by pre-parsing date items only once Sep 27, 2024
@ghiscoding ghiscoding marked this pull request as ready for review September 28, 2024 06:06
@ghiscoding ghiscoding merged commit c197415 into master Sep 28, 2024
9 checks passed
@ghiscoding ghiscoding deleted the perf/improve-date-sorting-speed branch September 28, 2024 06:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants