Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use the new block endpoint for history fill #162

Merged
merged 15 commits into from
Dec 21, 2023
Merged

Conversation

edmundnoble
Copy link
Contributor

@edmundnoble edmundnoble commented Aug 16, 2023

This PR uses chainweb-node's new block headers with payloads endpoint during the history fill operations. Doing so simplifies the code and presumably reduces the CPU overhead over the CW-N and CW-D involved in the process.

The next step is to migrate the listen process from the header stream to the header with payloads stream, but it is not included in this PR.

@enobayram
Copy link
Contributor

@edmundnoble Thank you very much for opening this PR! I've been waiting for the corresponding CW-N endpoints to go into a release and now that they are I'll look into updating, verifying and merging this PR.

@enobayram
Copy link
Contributor

@edmundnoble, I've used a CW-D instance from this branch to backfill the database over night. Today I've noticed that it currently leaves a lot of gaps of blocks ~10-20 heights tall. Seems like this is caused by the fact that we're trying to fetch chunks of 360 blocks by asking for maxheight - minheight = 360 ranges, but CW-N also seems to limit the block endpoint response at exactly 360 items max. The end result is that in the presence of orphans, CW-N sends multiple blocks for some heights, fills up the 360-item limit short of fulfilling the range we asked for. Is this the intended behavior for this endpoint?

I'm planning to fix the issue with CW-D by shrinking our chunks, chunks of 360 is unnecessarily large anyway and I think it also inflates the memory consumption unnecessarily.

@edmundnoble
Copy link
Contributor Author

Unfortunately, there is no paginated endpoint which includes orphans, which would be the natural way to deal with this issue. You can sort of get around this by asking for maxheight - minheight = 120, because there are almost never more than 2 orphans at a given height.

@enobayram
Copy link
Contributor

I also opted for a chunk size of 120, which solved the problem of gaps as expected. I'll leave this fill running for a little more and look for any issues, otherwise it seems like this PR is very close to completion.

@enobayram enobayram changed the title WIP: use new block endpoints Use new block endpoints Dec 20, 2023
@enobayram
Copy link
Contributor

I think this PR is ready to be merged. I've been running a chainweb-data process from this PR in server mode as well as in fill mode for the last 24 hours or so against an (initially) empty Postgres SCHEMA inside a database that has a chainweb-data instance from the last release populating another SCHEMA for comparing the populated data.

During those 24 hours, chainweb-data managed to backfill roughly 50% of current mainnet. The machine I ran this on has other services running on it, so it's hard to measure fill performance differences, but the populated data doesn't seem to have any issues based on the following checks:

No gaps in the data

I've run the following query to check if there are any gaps in the populated data:

SELECT chainid, height, next_height                                                                                
FROM (                                                                                                                    
    SELECT                                                   
        chainid,                                                                                                          
        height,                                                                                                                                                                                                                                      
        LEAD(height) OVER (PARTITION BY chainid ORDER BY height DESC) AS next_height                                      
    FROM                                                     
        blocks                                               
) AS subquery                                                
WHERE                                                        
    (height - next_height) > 1                               
ORDER BY                                                     
    chainid ASC,                                             
    height DESC LIMIT 20;                                                                                                 
                                                                                                                          
 chainid | height | next_height                              
---------+--------+-------------                             
(0 rows)                                                                                                                  

(Essentially the same query performed by chainweb-data for discovering the gaps in the data)

Populated data is identical to the data from the previous release

SELECT count(*)
     , count(*) FILTER (WHERE (SELECT to_jsonb(b) = to_jsonb(b2) FROM cwd.blocks b2 WHERE b2.hash = b.hash))
     , count(*) FILTER (WHERE NOT (SELECT to_jsonb(b) = to_jsonb(b2) FROM cwd.blocks b2 WHERE b2.hash = b.hash))
FROM blocks b 
;

  count   |  count   | count                                 
----------+----------+-------                                
 34508643 | 34498688 |     0                                 
(1 row)                                                      

Seems like we've ended up with some additional blocks (I think this new endpoint provides all orphans seen), but all the blocks we've copied that are common with the release CW-D have all of their fields identical.

SELECT count(*)
     , count(*) FILTER (WHERE (SELECT to_jsonb(t) = to_jsonb(t2) FROM cwd.transactions t2 WHERE t2.block = t.block AND t2.requestkey = t.requestkey))
     , count(*) FILTER (WHERE NOT (SELECT to_jsonb(t) = to_jsonb(t2) FROM cwd.transactions t2 WHERE t2.block = t.block AND t2.requestkey = t.requestkey))
FROM transactions t 
WHERE t.height > 4100000
;

  count   |  count   | count 
----------+----------+-------
 22884000 | 22825320 |     0
(1 row)

Similar query for a subset of the transactions, no discrepancy found.

The query was a little trickier for the events table, because that table contains the paramtext column, which is rendered from JSON to string on the Haskell side, and the field order changed between versions. That's why I had to remove that from the json representation we use for comparison in order to avoid false negatives:

SELECT count(*)
     , count(*) FILTER (WHERE (SELECT to_jsonb(e) - 'paramtext' = to_jsonb(e2) - 'paramtext' FROM cwd.events e2 WHERE e2.block = e.block AND e2.requestkey = e.requestkey AND e2.idx = e.idx))
     , count(*) FILTER (WHERE NOT (SELECT to_jsonb(e) - 'paramtext' = to_jsonb(e2) - 'paramtext' FROM cwd.events e2 WHERE e2.block = e.block AND e2.requestkey = e.requestkey AND e2.idx = e.idx))
FROM events e 
WHERE e.height > 4300000
;
  count  |  count  | count 
---------+---------+-------
 7695345 | 7674529 |     0
(1 row)

Copy link
Contributor

@enobayram enobayram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you very much for the PR @edmundnoble

@enobayram enobayram merged commit 4ab8ea8 into master Dec 21, 2023
6 checks passed
@enobayram enobayram deleted the edmund/block-endpoints branch December 21, 2023 10:04
@enobayram enobayram changed the title Use new block endpoints Use the new block endpoint for history fill Dec 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants