-
-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: option to improve Date Sorting by pre-parsing date items only once #1691
Conversation
Run & review this pull request in StackBlitz Codeflow. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #1691 +/- ##
========================================
+ Coverage 99.8% 99.8% +0.1%
========================================
Files 187 187
Lines 31075 31149 +74
Branches 9754 9779 +25
========================================
+ Hits 30986 31060 +74
Misses 89 89 ☔ View full report in Codecov by Sentry. |
@zewa666 so this is what I came up with, it's a decent improvement and all the reasoning is described above. So for 50k, we can see a drop from 15sec to 1.5sec which is quite decent drop. Next step, the pre-parsing could be a bit slow on the initial load and a potential improvement could be to use Web Workers to do the work in the background which I'll probably take a look at that next. For now the grid option above is disabled, if I could get it working with Web Workers (never really done any of that yet) then I might enable it by default OR not since it mutates the dataset, I might be better to just document it instead. Can this be improved further? I guess we could improve it more by using other type of sort (like quick sort or whatever else) but I'm not sure that we would be able to use that in combo with the native browser |
that looks good. radix could perhaps squeeze out a bit more than bubble sort. alternatively doing the sorting on gpu with stuff like gpu.js might be an option. but 1.5sec sounds absolutely reasonable for me. |
yeah but the thing is that all of these sort types are made to sort numbers only and that wouldn't work for all grid column types. It would work for Dates since at the end I'm converting the parsed Date to a Unix timestamp for sorting, but still the question is more about, how would I even integrate this with the native slickgrid-universal/packages/common/src/core/slickDataview.ts Lines 330 to 337 in 79d1995
But anyway, I think my approach is acceptable as you mentioned, so I might just keep this. This new approach does have "side effect" though, which is to mutate the dataset by adding (or overwriting) extra date properties that is then used to sort but that should be ok, I'm already mutating for other features like Tree Data, Row Detail and barely anyone complained (at least they don't when they understand why I do it, which is mainly for perf reasons). Another thing I have to look into is when we do CRUD, it would need to reparse that row which I can do easily 1 at a time. |
but again having those extra steps is acceptable when working with large amounts of data imo. |
if you know how we could connect other sort types in a generic way to the default DataView sort, then I'd be interested to receive a PR for sure. Or did you referenced these "extra steps" as my new approach in this PR? |
yeah I meant your approach with preparing the data for easier/faster sorting |
- also display console warning when dataset is larger than 5K items and pre-parse isn't enabled
Motivation and Context
It all started from this Angular-Slickgrid, and reproduceable, issue
The current Date Sorting can be extremely slow when dates are string that need to be parsed and converted to real Dates (i.e. US Date Format) before the Sorting process can then work and that is probably something like an O(exp n10). This PR is improving perf to about O(n10) by removing the multiple parsing of the same dates, and so we can lose the negative side effect of an exponential execution. First of all I'm guessing it's about n10 because the default browser
.sort()
will revisit the same items multiple times, let's say 10 times, to complete its sorting process, so I'm just using n10 as a arbitrary number (in reality this number is probably a lot bigger depending on your dataset size).Now the fact that we revisit the same items multiple times has the very bad side effect of reparsing the same date items over and over (for example when sorting 100 items, I saw the same item being revisited about 10 times to complete the sorting process and I'm assuming this is a lot worst with much larger dataset). With this PR, we can scale that down to O(n10) by pre-parsing the date items only once and so eliminating the exponential side effect. In other words, the slow part is when we parse string dates into real
Date
object which is required for the Sort to work. So with current implementation, most of the time spent is for parsing dates from a string to aDate
type but that is not the only time lost, the default browser.sort()
method which is what we use to sort the data, will revisit the same item dataContext over and over. Because the same item is being revisited many times, it will reparse the same item dates over and over which is extremely bad for performance...So what can we do to make this faster with a more reasonable time? and by reasonable time I mean that we can get roughly the same perf as sorting a number column (which is currently a lot quicker than sorting a date column). The implementation is explained below.
How is this implemented
Provide a new grid option named
preParseDateColumns
that can be either aboolean
or astring
boolean
(i.e. parse "start" date string and reassign it as aDate
object on the same "start" prop)string
(i.e. set to "__", it will parse a "start" date string and assign it as aDate
object to a new "__start" prop)For example, let say that we have this dataset and we set
preParseDateColumns: '__'
and we have this arrayWhat we'll do is that as soon as a date column triggers a Sort, it will then start the pre-parsing execution by looping through each items and parse each item string date(s) and convert them to
Date
objects and assign it back to the dataset (by mutation), and by doing this pre-parse, we are telling the application to only do this pre-parsing just once which makes it much quicker on further Sort executions because it doesn't have to parse the same dates anymore (sortingDate
objects is a lot quicker because aDate
object contains a UNIX timestamp that we then use for the Sort and that step is very fast).So, if we defined
preParseDateColumns: '__'
grid option (defined as a string), then our updated items array would be the following (where Date below is aDate
object of the input dates):However if we defined
preParseDateColumns: true
(boolean) then our updated items array would rather be:Finals
So what perf do we get out of this? With a dataset of 50k rows with 2 columns start/finish, we get these perf improvements
So we drop from 15sec to about 1.5sec which is quite a drop and 10x better perf, the other thing to note is that the large waiting time to sort without pre-parsing will be carried on every other sort, however with the new approach, we only need to pre-parse once. Also the other thing to note is that this pre-parsing becomes beneficial with very large dataset because the issue exponential.
Summary
Why does this help? Mainly because the native browser
.sort()
will revisit the same items many times until they are all sorted (for example, when sorting 100 items, I saw it revisiting some items like 10 times, I could imagine this being a lot more visits when our dataset is 50k or even more) and with the current Slickgrid implementation, it will parse them over and over, and as you can imagine parsing the same items multiple times it is especially bad for performance (exponentially bad if we could say). However, if we parse each of these items only a single time and never more than once (with the code above), then it becomes a lot more performant. This becomes O(n10), however what we currently have is probably closer to an O(exp n10) and a exponential perf is always bad and we should avoid it as much as possible.Considerations
User might not see much differences with small dataset but will see a huge difference with large dataset because of it's exponential nature. So I think a good thing to do would be to show a console warning if the option is not enabled (because it's not enabled by default) when the dataset is larger than about 5K.
Perf Preview
TODOs
can we use Web Workers?instead, we can simply expose the preParsing function to the external user and let them decide whenever they want to preParse (or keep default preParsing on 1st Date sort like mentioned above)