improve performance of count_data #728

ZhongChaoqiang · 2021-04-28T07:07:27Z

What problem does this PR solve?

When we precisely count data for a large table, it will cost minutes or hours.

What is changed and how does it work?

Actually,we just need the count of data.So we just need transfer the count of data from server to client, but not the detailed data.
In our test, it will 10x faster than before.

Tests

Unit test
Manual test (add detailed scripts or steps below)

create a table with millions of data.
use "count_data -c -f"

Related changes

Need to update the documentation
Need to be included in the release note

ZhongChaoqiang · 2021-05-07T03:26:13Z

@neverchanje @levy5307
Can you help to review?Thanks

foreverneverer · 2021-05-11T07:34:01Z

@ZhongChaoqiang This is good idea. we have discussed it and think that:

add code into on_scan will make this on_scan more bloat, we now planing refactor the scan logic
use scan rpc to count data actually seem not to be a elegant design，would it be better to add a new RPC and only reuse the code of on_scan( which need 1 to refactor )
cplus shell will be abandoned, the client suggest add into https://github.com/pegasus-kv/admin-cli

levy5307 · 2021-05-11T08:34:08Z

@ZhongChaoqiang This is good idea. we have discussed it and think that:

add code into on_scan will make this on_scan more bloat, we now planing refactor the scan logic

use scan rpc to count data actually seem not to be a elegant design，would it be better to add a new RPC and only reuse the code of on_scan( which need 1 to refactor )

cplus shell will be abandoned, the client suggest add into https://github.com/pegasus-kv/admin-cli

I aggree with @shuo-jia.

Besides count data precisely is not a common used scenario. Count estimates is enough in most cases

ZhongChaoqiang · 2021-05-13T06:29:57Z

@shuo-jia @levy5307
Thanks your review.
In our scenario, count data precisely is a frequent operation after we bulkload sst files.And the table has a large amount of data, so it often takes a long time.
So the performance of count data precisely is very helpful for us.
Refactor of scan is better idea.Do I need to optimize the code to another RPC, or close this PR?

levy5307 · 2021-05-13T07:27:11Z

@shuo-jia @levy5307
Thanks your review.
In our scenario, count data precisely is a frequent operation after we bulkload sst files.And the table has a large amount of data, so it often takes a long time.
So the performance of count data precisely is very helpful for us.
Refactor of scan is better idea.Do I need to optimize the code to another RPC, or close this PR?

Yes, I think it's good to open a new pull request to add another new rpc. @ZhongChaoqiang

ZhongChaoqiang · 2021-05-20T12:25:17Z

@shuo-jia @levy5307
Thanks your review.
In our scenario, count data precisely is a frequent operation after we bulkload sst files.And the table has a large amount of data, so it often takes a long time.
So the performance of count data precisely is very helpful for us.
Refactor of scan is better idea.Do I need to optimize the code to another RPC, or close this PR?

Yes, I think it's good to open a new pull request to add another new rpc. @ZhongChaoqiang

@levy5307 @shuo-jia
我们开发这个功能除了可以优化count_data的性能外，还有另外一个场景，就是快速查询某个scan条件(例如指定范围或前缀条件等的scan)的kv数量。
如果使用单独的RPC，感觉和scan重复的功能太多了。所以，可以再帮忙看看，是不是还是放在现在的scan功能中会更合适一些呢？这样代码会简洁很多。谢谢！

foreverneverer · 2021-05-21T06:26:55Z

@shuo-jia @levy5307
Thanks your review.
In our scenario, count data precisely is a frequent operation after we bulkload sst files.And the table has a large amount of data, so it often takes a long time.
So the performance of count data precisely is very helpful for us.
Refactor of scan is better idea.Do I need to optimize the code to another RPC, or close this PR?

Yes, I think it's good to open a new pull request to add another new rpc. @ZhongChaoqiang

@levy5307 @shuo-jia
我们开发这个功能除了可以优化count_data的性能外，还有另外一个场景，就是快速查询某个scan条件(例如指定范围或前缀条件等的scan)的kv数量。
如果使用单独的RPC，感觉和scan重复的功能太多了。所以，可以再帮忙看看，是不是还是放在现在的scan功能中会更合适一些呢？这样代码会简洁很多。谢谢！

对于重复的代码，可以考虑抽出来以复用，这样是否可以？@ZhongChaoqiang

ZhongChaoqiang · 2021-05-21T07:10:40Z

@shuo-jia @levy5307
Thanks your review.
In our scenario, count data precisely is a frequent operation after we bulkload sst files.And the table has a large amount of data, so it often takes a long time.
So the performance of count data precisely is very helpful for us.
Refactor of scan is better idea.Do I need to optimize the code to another RPC, or close this PR?

Yes, I think it's good to open a new pull request to add another new rpc. @ZhongChaoqiang

@levy5307 @shuo-jia
我们开发这个功能除了可以优化count_data的性能外，还有另外一个场景，就是快速查询某个scan条件(例如指定范围或前缀条件等的scan)的kv数量。
如果使用单独的RPC，感觉和scan重复的功能太多了。所以，可以再帮忙看看，是不是还是放在现在的scan功能中会更合适一些呢？这样代码会简洁很多。谢谢！

对于重复的代码，可以考虑抽出来以复用，这样是否可以？@ZhongChaoqiang

主要还不是代码的问题。由于我们多个scan的接口都有用到，例如get_scanner/async_get_scanner/get_unordered_scanners/async_get_unordered_scanners，如果count功能不走scan这条路线的话，就可能要新增多个count的接口，以对应原来的scan的功能了，因为count接口的查询条件要和scan接口的保持一致。感觉是不是更复杂了？@shuo-jia

foreverneverer · 2021-05-21T07:33:01Z

@shuo-jia @levy5307
Thanks your review.
In our scenario, count data precisely is a frequent operation after we bulkload sst files.And the table has a large amount of data, so it often takes a long time.
So the performance of count data precisely is very helpful for us.
Refactor of scan is better idea.Do I need to optimize the code to another RPC, or close this PR?

Yes, I think it's good to open a new pull request to add another new rpc. @ZhongChaoqiang

@levy5307 @shuo-jia
我们开发这个功能除了可以优化count_data的性能外，还有另外一个场景，就是快速查询某个scan条件(例如指定范围或前缀条件等的scan)的kv数量。
如果使用单独的RPC，感觉和scan重复的功能太多了。所以，可以再帮忙看看，是不是还是放在现在的scan功能中会更合适一些呢？这样代码会简洁很多。谢谢！

对于重复的代码，可以考虑抽出来以复用，这样是否可以？@ZhongChaoqiang

主要还不是代码的问题。由于我们多个scan的接口都有用到，例如get_scanner/async_get_scanner/get_unordered_scanners/async_get_unordered_scanners，如果count功能不走scan这条路线的话，就可能要新增多个count的接口，以对应原来的scan的功能了，因为count接口的查询条件要和scan接口的保持一致。感觉是不是更复杂了？@shuo-jia

不太明白，scan一个rpc，count一个rpc，count和scan共用一套"迭代器"，scan的client接口完全可以保持不变；只需要添加count的api不就行了？

acelyc111 · 2022-06-29T03:54:34Z

@ZhongChaoqiang 你好，这项工作可以继续吗？我们的使用场景中对这个功能也有比较强烈的需求

acelyc111 · 2022-07-25T08:32:15Z

共用rpc我觉得倒没有问题。问题点应该在count数值应该放在哪里，序列化在 struct get_scanner_request 里面不是很合适，在 scan_request 里面加一个标记表示是否只取count，并在 scan_response 把count结果带出来或许更合适。
新增的字段都是 optional 的，当server端是老版本，即不支持这个新增字段的判断时，会使用老接口逻辑把kv原始数据带出去，新的client可以以此判断server端是否支持，如果不支持，业务逻辑层可以返回失败。
更干净的做法当然还是新增一个独立的rpc，但是get_scanner还是可以复用的。

acelyc111 · 2022-09-06T03:13:14Z

Has been implemented by #1091, so I close it.
Thanks @ZhongChaoqiang

ZhongChaoqiang and others added 2 commits April 28, 2021 09:33

improve performance of count_data

1400d3e

Merge branch 'master' into count_data

ea92396

ZhongChaoqiang marked this pull request as draft April 29, 2021 01:29

ZhongChaoqiang marked this pull request as ready for review April 30, 2021 01:32

improve performance of count_data

dacdc60

Smityz assigned Smityz and unassigned Smityz Sep 13, 2021

acelyc111 pushed a commit that referenced this pull request Jun 23, 2022

feat: add throttling for read size and read qps (#728)

84ddb72

github-actions bot added cpp thrift labels Jul 25, 2022

GehaFearless mentioned this pull request Aug 1, 2022

feat: improve performance of count_data #1091

Merged

acelyc111 closed this Sep 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve performance of count_data #728

improve performance of count_data #728

ZhongChaoqiang commented Apr 28, 2021 •

edited by acelyc111

Loading

ZhongChaoqiang commented May 7, 2021

foreverneverer commented May 11, 2021

levy5307 commented May 11, 2021 •

edited

Loading

ZhongChaoqiang commented May 13, 2021

levy5307 commented May 13, 2021

ZhongChaoqiang commented May 20, 2021

foreverneverer commented May 21, 2021

ZhongChaoqiang commented May 21, 2021

foreverneverer commented May 21, 2021

acelyc111 commented Jun 29, 2022

acelyc111 commented Jul 25, 2022

acelyc111 commented Sep 6, 2022

improve performance of count_data #728

improve performance of count_data #728

Conversation

ZhongChaoqiang commented Apr 28, 2021 • edited by acelyc111 Loading

What problem does this PR solve?

What is changed and how does it work?

Tests

Related changes

ZhongChaoqiang commented May 7, 2021

foreverneverer commented May 11, 2021

levy5307 commented May 11, 2021 • edited Loading

ZhongChaoqiang commented May 13, 2021

levy5307 commented May 13, 2021

ZhongChaoqiang commented May 20, 2021

foreverneverer commented May 21, 2021

ZhongChaoqiang commented May 21, 2021

foreverneverer commented May 21, 2021

acelyc111 commented Jun 29, 2022

acelyc111 commented Jul 25, 2022

acelyc111 commented Sep 6, 2022

ZhongChaoqiang commented Apr 28, 2021 •

edited by acelyc111

Loading

levy5307 commented May 11, 2021 •

edited

Loading