Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve performance of count_data #728

Closed
wants to merge 3 commits into from

Conversation

ZhongChaoqiang
Copy link
Contributor

@ZhongChaoqiang ZhongChaoqiang commented Apr 28, 2021

What problem does this PR solve?

When we precisely count data for a large table, it will cost minutes or hours.

What is changed and how does it work?

Actually,we just need the count of data.So we just need transfer the count of data from server to client, but not the detailed data.
In our test, it will 10x faster than before.

Tests
  • Unit test
  • Manual test (add detailed scripts or steps below)
  1. create a table with millions of data.
  2. use "count_data -c -f"
Related changes
  • Need to update the documentation
  • Need to be included in the release note

@ZhongChaoqiang ZhongChaoqiang marked this pull request as draft April 29, 2021 01:29
@ZhongChaoqiang ZhongChaoqiang marked this pull request as ready for review April 30, 2021 01:32
@ZhongChaoqiang
Copy link
Contributor Author

@neverchanje @levy5307
Can you help to review?Thanks

@foreverneverer
Copy link
Contributor

@ZhongChaoqiang This is good idea. we have discussed it and think that:

  1. add code into on_scan will make this on_scan more bloat, we now planing refactor the scan logic
  2. use scan rpc to count data actually seem not to be a elegant design,would it be better to add a new RPC and only reuse the code of on_scan( which need 1 to refactor )
  3. cplus shell will be abandoned, the client suggest add into https://github.com/pegasus-kv/admin-cli

@levy5307
Copy link
Contributor

levy5307 commented May 11, 2021

@ZhongChaoqiang This is good idea. we have discussed it and think that:

  1. add code into on_scan will make this on_scan more bloat, we now planing refactor the scan logic
  2. use scan rpc to count data actually seem not to be a elegant design,would it be better to add a new RPC and only reuse the code of on_scan( which need 1 to refactor )
  3. cplus shell will be abandoned, the client suggest add into https://github.com/pegasus-kv/admin-cli

I aggree with @shuo-jia.

Besides count data precisely is not a common used scenario. Count estimates is enough in most cases

@ZhongChaoqiang
Copy link
Contributor Author

@shuo-jia @levy5307
Thanks your review.
In our scenario, count data precisely is a frequent operation after we bulkload sst files.And the table has a large amount of data, so it often takes a long time.
So the performance of count data precisely is very helpful for us.
Refactor of scan is better idea.Do I need to optimize the code to another RPC, or close this PR?

@levy5307
Copy link
Contributor

@shuo-jia @levy5307
Thanks your review.
In our scenario, count data precisely is a frequent operation after we bulkload sst files.And the table has a large amount of data, so it often takes a long time.
So the performance of count data precisely is very helpful for us.
Refactor of scan is better idea.Do I need to optimize the code to another RPC, or close this PR?

Yes, I think it's good to open a new pull request to add another new rpc. @ZhongChaoqiang

@ZhongChaoqiang
Copy link
Contributor Author

@shuo-jia @levy5307
Thanks your review.
In our scenario, count data precisely is a frequent operation after we bulkload sst files.And the table has a large amount of data, so it often takes a long time.
So the performance of count data precisely is very helpful for us.
Refactor of scan is better idea.Do I need to optimize the code to another RPC, or close this PR?

Yes, I think it's good to open a new pull request to add another new rpc. @ZhongChaoqiang

@levy5307 @shuo-jia
我们开发这个功能除了可以优化count_data的性能外,还有另外一个场景,就是快速查询某个scan条件(例如指定范围或前缀条件等的scan)的kv数量。
如果使用单独的RPC,感觉和scan重复的功能太多了。所以,可以再帮忙看看,是不是还是放在现在的scan功能中会更合适一些呢?这样代码会简洁很多。谢谢!

@foreverneverer
Copy link
Contributor

@shuo-jia @levy5307
Thanks your review.
In our scenario, count data precisely is a frequent operation after we bulkload sst files.And the table has a large amount of data, so it often takes a long time.
So the performance of count data precisely is very helpful for us.
Refactor of scan is better idea.Do I need to optimize the code to another RPC, or close this PR?

Yes, I think it's good to open a new pull request to add another new rpc. @ZhongChaoqiang

@levy5307 @shuo-jia
我们开发这个功能除了可以优化count_data的性能外,还有另外一个场景,就是快速查询某个scan条件(例如指定范围或前缀条件等的scan)的kv数量。
如果使用单独的RPC,感觉和scan重复的功能太多了。所以,可以再帮忙看看,是不是还是放在现在的scan功能中会更合适一些呢?这样代码会简洁很多。谢谢!

对于重复的代码,可以考虑抽出来以复用,这样是否可以?@ZhongChaoqiang

@ZhongChaoqiang
Copy link
Contributor Author

@shuo-jia @levy5307
Thanks your review.
In our scenario, count data precisely is a frequent operation after we bulkload sst files.And the table has a large amount of data, so it often takes a long time.
So the performance of count data precisely is very helpful for us.
Refactor of scan is better idea.Do I need to optimize the code to another RPC, or close this PR?

Yes, I think it's good to open a new pull request to add another new rpc. @ZhongChaoqiang

@levy5307 @shuo-jia
我们开发这个功能除了可以优化count_data的性能外,还有另外一个场景,就是快速查询某个scan条件(例如指定范围或前缀条件等的scan)的kv数量。
如果使用单独的RPC,感觉和scan重复的功能太多了。所以,可以再帮忙看看,是不是还是放在现在的scan功能中会更合适一些呢?这样代码会简洁很多。谢谢!

对于重复的代码,可以考虑抽出来以复用,这样是否可以?@ZhongChaoqiang

主要还不是代码的问题。由于我们多个scan的接口都有用到,例如get_scanner/async_get_scanner/get_unordered_scanners/async_get_unordered_scanners,如果count功能不走scan这条路线的话,就可能要新增多个count的接口,以对应原来的scan的功能了,因为count接口的查询条件要和scan接口的保持一致。感觉是不是更复杂了?@shuo-jia

@foreverneverer
Copy link
Contributor

@shuo-jia @levy5307
Thanks your review.
In our scenario, count data precisely is a frequent operation after we bulkload sst files.And the table has a large amount of data, so it often takes a long time.
So the performance of count data precisely is very helpful for us.
Refactor of scan is better idea.Do I need to optimize the code to another RPC, or close this PR?

Yes, I think it's good to open a new pull request to add another new rpc. @ZhongChaoqiang

@levy5307 @shuo-jia
我们开发这个功能除了可以优化count_data的性能外,还有另外一个场景,就是快速查询某个scan条件(例如指定范围或前缀条件等的scan)的kv数量。
如果使用单独的RPC,感觉和scan重复的功能太多了。所以,可以再帮忙看看,是不是还是放在现在的scan功能中会更合适一些呢?这样代码会简洁很多。谢谢!

对于重复的代码,可以考虑抽出来以复用,这样是否可以?@ZhongChaoqiang

主要还不是代码的问题。由于我们多个scan的接口都有用到,例如get_scanner/async_get_scanner/get_unordered_scanners/async_get_unordered_scanners,如果count功能不走scan这条路线的话,就可能要新增多个count的接口,以对应原来的scan的功能了,因为count接口的查询条件要和scan接口的保持一致。感觉是不是更复杂了?@shuo-jia

不太明白,scan一个rpc,count一个rpc,count和scan共用一套"迭代器",scan的client接口完全可以保持不变;只需要添加count的api不就行了?

@Smityz Smityz assigned Smityz and unassigned Smityz Sep 13, 2021
@acelyc111
Copy link
Member

@ZhongChaoqiang 你好,这项工作可以继续吗?我们的使用场景中对这个功能也有比较强烈的需求

@acelyc111
Copy link
Member

共用rpc我觉得倒没有问题。问题点应该在count数值应该放在哪里,序列化在 struct get_scanner_request 里面不是很合适,在 scan_request 里面加一个标记表示是否只取count,并在 scan_response 把count结果带出来或许更合适。
新增的字段都是 optional 的,当server端是老版本,即不支持这个新增字段的判断时,会使用老接口逻辑把kv原始数据带出去,新的client可以以此判断server端是否支持,如果不支持,业务逻辑层可以返回失败。
更干净的做法当然还是新增一个独立的rpc,但是get_scanner还是可以复用的。

@acelyc111
Copy link
Member

Has been implemented by #1091, so I close it.
Thanks @ZhongChaoqiang

@acelyc111 acelyc111 closed this Sep 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants