Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Robust IO cost measurements in estimator #5995

Open
7 of 11 tasks
Tracked by #8819
jakmeier opened this issue Jan 4, 2022 · 13 comments
Open
7 of 11 tasks
Tracked by #8819

Robust IO cost measurements in estimator #5995

jakmeier opened this issue Jan 4, 2022 · 13 comments
Assignees
Labels
A-params-estimator Area: runtime params estimator C-tracking-issue Category: a tracking issue T-contract-runtime Team: issues relevant to the contract runtime team

Comments

@jakmeier
Copy link
Contributor

jakmeier commented Jan 4, 2022

We have to make sure our IO measurements for the parameter estimator are:

  • isolated from what runs before (e.g. setup) and afterwards
  • immune to artifacts caused by the storage architecture

To get there, I think I need to first understand our storage layers better and we should probably reflect on what the goal of the measurement is.

Todos

  1. We have to understand the factors that cause IO.
    • Can we control the background threads for IO?
    • How does the shape of the Trie affect storage accesses?
    • What storage access pattern produce what IO patterns?
    • What is the cheapest way for an attacker to produce lots of IO?
    • (and many more questions)
  2. Define what we want to measure
    • What are the assumptions behind IO measurements? E.g. which caches are enabled? What is the prior state?
  3. Ensure the measurements in all estimations follow the defined assumptions and are appropriately isolated

Background

When running the parameter estimator inside QEMU with the ICount metric, it reports IO bytes read and written. This is done by intercepting syscall which should give the most accurate picture for how much actual IO is triggered.

However, there is a number of challenges with this approach. Simply looking at the counters before and after the execution of some code does not give an accurate picture. There are at least 3 effects we should take into account.

  1. IO is not produced evenly throughout execution. Often, IO comes in large blocks with zero IO in between. This is due to caches and write buffers in different layers.
  2. The prior state of the database changes how expensive it is to access an entry in that DB. Both the shape of the Trie as well as the shape of the RocksDB LSM-Tree are relevant here.
  3. RocksDB handles IO in background threads, including flushes/write-backs as well as compaction.

Prior Status Quo

Our IO measurements already disable (some) caches in an attempt to address point 1. Point 2 is somewhat addressed by using the same state every time, which at least makes the estimations consistent. Point 3 is not well understood right now, especially in the context of the estimator (at least by me).

Work items

Proposal / ideas

@jakmeier jakmeier added T-contract-runtime Team: issues relevant to the contract runtime team A-params-estimator Area: runtime params estimator labels Jan 4, 2022
@jakmeier jakmeier self-assigned this Jan 4, 2022
@bowenwang1996
Copy link
Collaborator

The work that @Longarithm did on visualizing the trie could be helpful here

@jakmeier
Copy link
Contributor Author

jakmeier commented Jan 7, 2022

Some thoughts with regards to background threads:

  • IO happens in background threads for good reasons. Moving it to the main thread would maybe make our estimations more consistent but ultimately less accurate. (I.e. the main thread in our estimator lab setup is now slowed down by things it will not do in practice.)
  • Another approach is to control when exactly IO happens instead of where.
    • cancel_all_background_work with wait=true should allow to block until IO tasks have finished, which is useful after setup work
    • Manually triggering compaction is also worth mentioning here. But careful: this messes with the compaction policy and might lead to unrealistic shapes of the LSM tree.
  • With IO offloaded from the main thread, this cost is in many cases orthogonal to execution cost. Thus, I could also see a strategy to estimate IO cost independently from execution cost and take whichever of the two is higher as the final estimation.

@bowenwang1996
Copy link
Collaborator

With IO offloaded from the main thread, this cost is in many cases orthogonal to execution cost. Thus, I could also see a strategy to estimate IO cost independently from execution cost and take whichever of the two is higher as the final estimation.

Yeah I think that makes sense

@jakmeier
Copy link
Contributor Author

I measured RocksDB insertion performance for sequential keys with 1kB and 100kB sized values.
The number of insertions per iteration is 1M and 10k so that each add up to 1GB per iteration.
Then I measured 100 iterations. Here is the summary:

avg min max
1kB gcloud 18.828 s 16.776 s 22.121 s
100kB gcloud 17.619 s 16.816 s 20.667 s
1kB gcloud, forced compaction 31.692 s 27.294 s 33.224 s
100kB gcloud, forced compaction 32.444 s 31.779 s 33.138 s
1kB local 7.783 s 3.395 s 19.513 s
100kB local 4.945 s 1.517 s 14.731 s
1kB local, forced compaction 10.873 s 6.761 s 18.747 s
100kB local, forced compaction 9.427 s 4.022 s 20.302 s

The cases with forced compaction include a manually triggered compaction after the insertions, adding to the total time.

Local machine uses consumer NVMe SSD + AMD Ryzen 5 3600 6-Core Processor with CPU Mhz range 2200 - 4703.
Cloud uses Google-managed SSD persistent disk + 4 vCPU Intel Cascade Lake, Intel(R) Xeon(R) CPU @ 2.80GHz.

Statistical considerations

Reported are arithmetic mean (avg), minimum and maximum measurements for a 1GB iteration.

However, the distributions are definitely not Gaussian, judging by the histograms I looked at. The right tale is much longer. So take the average with a grain of salt. The mean would generally be quite a bit below the average.

On the other hand, there are no gaps in the distibution. I.e. all values between min and max have some chance to be observed and it decreases almost monotonically towards the larger times.

Source code for rerunning

Source code of benchmark is on a separate branch so far

Example command:

# Assumes data_dump_0001.bin has been generated and should be reused for all iterations
for run in {1..100}; do cargo run --release -p runtime-params-estimator --features required -- --home ~/.near --metric time --costs RocksDbInsertValueKByte --pr-data-path data_dump_0001.bin --debug rocksdb; done  >> rocksdb_insert_value_kb.txt
awk -f summary.awk rocksdb_insert_value_kb.txt

AWK script to format output:

# summary.awk
/setup/{c=0};
/Cost/{
        # Convert seconds and ms to common base
        if($3~/ms/){ c = substr($3,1,length($3)-2) }
        else{ c = 1000 * substr($3,1,length($3)-1) }
};
/level/{if (c==0){lb[$5]=$1;} else {la[$5]=$1}}
/Finished/{
        counts++;
        sums+=c;
        if(max<c) { max=c; }
        if(min>c || min==0) { min=c; }
}
END {
        printf "%3d | %7.3f s | %7.3f s | %7.3f s |\n", counts, sums/counts/1000, min/1000, max/1000
}

Interpretation

We shouldn't read too much into the exact numbers here, since these will vary with workload and hardware. But to gain some intuition, each second reported here corresponds to 1 microsecond per 1kB, or 1ms per 1MB. If we were to translate this to gas cost, the 33.224s measurement would lead to 3.3224 TGas per MB inserted.

Now compare that to what we currently charge for 10x 100kB insertion:

  • 10 * storage_write_base (64'196'736'000 Gas)
  • 1'000'000 * storage_write_value_byte (31'018'539 Gas)

Total: 31.66 TGas per MB

This is around 10 times more than my measurements had in the absolute worst case. Which seems like a good place to be right now. But it also hints that we can maybe afford to lower storage write costs, even when considering overhead costs and including some safety margin.

Key Insights

  • RocksDB insert operations in isolation take significant time but are well-covered by current cost parameters.
  • Compaction also takes significant time.
  • Even for this artificially simple case, RocksDB performance has a high variance. This can be reduced by running tests on the cloud and forcing compaction.

Additional Details

What is reported in this section shouldn't be relevant for the conclusion taken. However, for the extra curious reader (and my own benefit to sort my thoughs), I wrote down some more of my analysis on this topic in the spoilers below.

LSMT shape analysis

It turns out, a large part of the variance in measurements is correlated with a difference in the shape of the underlying tree structure within RocksDB.

Example for 100kB local, forced compaction:

# occurrences L0/L1/L2 files after setup and after benchmark avg min max
90 0/0/33-->0/0/48 9.782 s 4.022 s 20.302 s
10 0/0/32-->0/0/47 6.226 s 4.607 s 8.952 s

The above table shows a very small difference in the shape of the tree after the setup, just one less file in level-2 of LSM tree. (And an equally small difference at the end of the benchmark.) This was enough to bring the minimum in the smaller sample down to 6.2s where the larger sample never finished before 9.7s. On the other hand, the average is nevertheless ~15% higher in the second sample. (Commenting on max is difficult. Because of the small sample size in the second group, less extreme max is expected to some degree.)

There are many more different shapes when looking at the results without manual compaction.

Example for 100kB local:

# occurrences L0/L1/L2 files after setup and after benchmark avg min max
25 2/2/20-->2/2/33 3.115 s 1.517 s 6.709 s
14 2/9/19-->2/2/33 4.304 s 1.552 s 9.522 s
7 2/9/19-->3/3/28 6.717 s 2.491 s 14.731 s
7 2/2/20-->3/2/29 5.988 s 2.520 s 7.653 s
5 2/2/20-->3/1/33 5.498 s 1.702 s 6.893 s
4 4/8/12-->3/3/28 10.236 s 2.811 s 13.794 s
4 2/16/12-->3/3/28 7.557 s 2.605 s 12.591 s
3 2/2/20-->3/3/28 5.356 s 4.709 s 6.546 s
3 2/15/13-->3/3/28 8.417 s 5.912 s 12.812 s
2 2/9/19-->4/1/32 2.566 s 2.079 s 3.052 s
2 2/18/10-->3/3/28 8.983 s 8.886 s 9.080 s
2 2/15/13-->3/2/32 2.074 s 2.001 s 2.146 s
2 2/13/15-->3/3/28 6.997 s 6.745 s 7.249 s
2 2/10/18-->4/1/32 2.169 s 2.116 s 2.223 s
2 2/10/18-->2/2/33 1.894 s 1.564 s 2.225 s
1 omitted 16 rows of single occurrences ... ... ...

The inserted data has been exactly the same, for setup and the actual measurements. Thus, for now, I can only assume these differences have to do with CPU and SSD timings that vary.

There is a trend that groups where the final tree has more files in L2 (e.g. 32) finish faster than those with less (e.g. 28). But it is not clear at all whether different shapes are the reason for this difference in performance, or the other way around.

Looking at the same on the Google cloud VM, the shapes are much more consistent even without compaction.

Example for 100kB gcloud, forced compaction:

# occurrences L0/L1/L2 files after setup and after benchmark avg min max
98 0/0/33-->0/0/48 32.455 s 31.779 s 33.138 s
2 0/0/32-->0/0/47 31.917 s 31.916 s 31.919 s

Example for 100kB gcloud:

# occurrences L0/L1/L2 files after setup and after benchmark avg min max
86 2/9/19-->3/3/28 17.459 s 16.816 s 18.194 s
8 2/9/19-->4/2/28 17.490 s 17.483 s 17.501 s
5 4/8/12-->3/3/28 20.659 s 20.647 s 20.667 s
1 2/2/20-->2/3/29 17.251 s 17.251 s 17.251 s

My next steps

This first step was mostly for me to learn and intuitively understand what I can expect from RocksDB performance measurements. Next, I will use what I learned here to integrate robust IO measurements into the parameter estimator.

@bowenwang1996
Copy link
Collaborator

Now compare that to what we currently charge for 10x 100kB insertion:

  • 10 * storage_write_base (64'196'736'000 Gas)
  • 1'000'000 * storage_write_value_byte (31'018'539 Gas)

@jakmeier What about storage_write_key_byte?

@jakmeier
Copy link
Contributor Author

Good point, I should have mentioned that. In this instance, all keys have less than 10 bytes. The associated costs is less than 10 * 70'482'867 and thus completely negligible for the purpose of this high-level comparison of raw RocksDB performance vs current gas cost.

@jakmeier
Copy link
Contributor Author

An interesting find regarding read performance: Writes that have not been flushed & compacted negatively impact read performance.

Here are two plots produced using the script rocksdb_read_benchmark.sh.

rocksdb_read_benchmark.pdf
rocksdb_read_benchmark_up_to_6kb.pdf

Clearly, forcing a flush and compaction before the read measurement starts is beneficial. In other words, cleaning up RocksDB structure does measurably affect read performance. Which raises the question, is this overhead caused by writes earlier something that should be accounted for in the gas cost for reads?

For now, the new estimations RocksDbReadValueByte,RocksDbWriteValueByte I added in the parameter estimator assume compaction is not enforced. This means writes are estimated cheaper (as shown in table from last week) but reads are more expensive, as shown in the new plots. I think this reflects best how we currently use RocksDB.

But the discussion on how we want to treat storage costs in the future are ongoing. Personally, I think we should enforce compaction at a certain interval (e.g. after every block), add the compaction cost to the write gas cost and assume reads are always on a clean state. This should be more stable and reliable than what we have today.

@bowenwang1996
Copy link
Collaborator

@jakmeier fyi there is some plan to have a separate hashtable based storage for reads that would allow for better performance

@jakmeier
Copy link
Contributor Author

Another issue I discovered in the last weeks is that currently the estimator cannot run for large DB stores. If I insert 8 million accounts, then the OOM killer kicks in on a machine with 16GB RAM. I think we load the entire DB into memory for the test setup - which seems generally bad for storage estimation.

@bowenwang1996
Copy link
Collaborator

I think we load the entire DB into memory for the test setup - which seems generally bad for storage estimation.

Indeed. Could we fix that?

@stale
Copy link

stale bot commented May 22, 2022

This issue has been automatically marked as stale because it has not had recent activity in the last 2 months.
It will be closed in 7 days if no further activity occurs.
Thank you for your contributions.

@jakmeier
Copy link
Contributor Author

I think we load the entire DB into memory for the test setup - which seems generally bad for storage estimation.

Indeed. Could we fix that?

I looked into this a bit more today. With a large DB, creating a new store from a state dump is often taking the majority of the time for estimations. Unfortunately, my attempts so far to optimize this failed.

  1. Streaming the state dump and writing it in batches is significantly slower. But memory footprint is reduced.
  2. Creating a RocksDB directory once at the beginning and the copying that on the filesystem level is much faster for each iteration. But the first round is much slower. And crucially, this adds significant variance to measurements as the SST files of RocksDB are now subject to copy-on-wrote optimizations by the filesystem.

In conclusion, it would ultimately be better to avoid the need for huge databases in the estimator. Creating huge DBs has not reliably produced worst-cast access times anyway. For example #4771 shows that even small amounts of data can produce really bad cases. Far worse than what a DB with 50M accounts produces.

My current plan is to either find specific bad cases with small data, or otherwise rely on counting DB operations and assuming fixed latencies. The next step forward for this is to implement #7058.

@jakmeier
Copy link
Contributor Author

jakmeier commented Aug 26, 2022

I am happy to declare I finally reached the point where "all worst-case assumptions" proves to be too pessimistic.

Combining these worst known cases

  1. unreasonably bad SST file layout ,
  2. all nearcore caches disabled,
  3. rocksdb state is at least a couple dozen GBs large,
  4. clearing rocksdb buffers and caches,
  5. clearing OS page cache,

leads to a processing time of more than 1s for an empty action receipt. Clearly, this is going too far. There is no way this could be reproduced in production. And yet, in the estimator we can reproduce it. See #4771.

I think the way forward is to reconsider what is a reasonably pessimistic assumption. I think I would go with enabled caches, (almost) fully compacted RocksDB, and only selective cache clearing where it makes sense. I will post more updates when I had the chance to play around with these ideas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-params-estimator Area: runtime params estimator C-tracking-issue Category: a tracking issue T-contract-runtime Team: issues relevant to the contract runtime team
Projects
None yet
Development

No branches or pull requests

3 participants