Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A story of two indexers #511

Merged
merged 11 commits into from
Jun 16, 2022

Conversation

raduom
Copy link
Contributor

@raduom raduom commented Jun 10, 2022

Pre-submit checklist:

  • Branch
    • Tests are provided (if possible)
    • Commit sequence broadly makes sense
    • Key commits have useful messages
    • Formatting, PNG optimization, etc. are updated
  • PR
    • Self-reviewed the diff
    • Useful pull request description
    • Reviewer requested

@raduom raduom changed the title Raduom/a story of two indexers A story of two indexers Jun 10, 2022
@raduom raduom marked this pull request as ready for review June 10, 2022 11:36
@raduom raduom force-pushed the raduom/a-story-of-two-indexers branch from 91a8efa to a130a76 Compare June 10, 2022 11:55
Comment on lines +62 to +63
optionsUtxoPath :: Maybe FilePath,
optionsDatumPath :: Maybe FilePath
Copy link
Contributor

@koslambrou koslambrou Jun 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, is it possible (or desirable) to put the data from the indexers in the same database?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of SQLite, most likely not due to higher contention.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

coordinator <- initialCoordinator indexerCount
when (isJust datumPath) $ do
ch <- atomically . dupTChan $ _channel coordinator
void . forkIO . datumWorker coordinator ch $ fromJust datumPath
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not do the dupTChan inside the datumWorker instead of passing it as a parameter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually thinking about how this works, this is a possible multi-threading bug. Let me fix it first :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. So there is no bug.

Moving the channel creation in the function would mean I need to define an inner loop function which I preferred not to do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

void . forkIO . datumWorker coordinator ch $ fromJust datumPath
when (isJust utxoPath) $ do
ch <- atomically . dupTChan $ _channel coordinator
void . forkIO . utxoWorker coordinator ch $ fromJust utxoPath
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

Comment on lines +216 to +228
:: Maybe FilePath
-> Maybe FilePath
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach does not seem extensible for multiple indexers. Why not pass a list of workers? They seem to have a similar interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can probably be done, but it's not trivial. You need to provide specific configurations for each indexer, and the startup may also be different (not the case here). The issue mentioned 2 indexers and I am not sure if supporting multiple indexers is on the bird path, but if it is, it should be a separate issue, I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

-> TChan (ChainSyncEvent (BlockInMode CardanoMode))
-> FilePath
-> IO ()
datumWorker Coordinator{_barrier} ch path = Datum.open path (Datum.Depth 2160) >>= innerLoop
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The security param 2160 shouldn't be hardcoded right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why it shouldn't be hardcoded in this case. Would you like to expose it as a command-line argument? One for all indexers, or one command-line argument for each? I can write the code if you think it's important to do so.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect the security param to be same for all indexers in order to guaranty that rollbacks will only modify the data that is in memory.
I'll add a story about querying the security param from the local node directly (without needing to pass it as a CLI param).

Copy link
Contributor

@andreabedini andreabedini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if making the two indexes explicitly concurrent is worth the effort. If we keep concurrency implict (because most of the computation is pure) the entire PR boils down to

withStream bla bla $ L.impurely S.foldM_ (mconcat [consumer1, consumer2]) 

It's not clear to me that explicit concurrency brings an advange in terms of performace. I have found some discussions 1 2 about SQLite and multithreading but I haven't dug into it yet.

AFAIU each indexer opens a new sqlite database so there should not be any contention at the connection level, is this right?

I have noticed you use Control.Concurrent.STM.TChan (like I did in plutus-streaming). This is an unbounded channel and it's memory usage will grow without limits if the producer is faster than the consumer. I believe this is likely here. Can you either use a bounded channel or make sure this arbitrary grow doesn't happen in other means?

@andreabedini andreabedini dismissed their stale review June 14, 2022 05:28

Dismissing my "request changes" to not block author while I am not available

@raduom raduom force-pushed the raduom/a-story-of-two-indexers branch from a130a76 to 6a1b6e1 Compare June 14, 2022 05:30
@raduom
Copy link
Contributor Author

raduom commented Jun 14, 2022

I am wondering if making the two indexes explicitly concurrent is worth the effort. If we keep concurrency implict (because most of the computation is pure) the entire PR boils down to.

One of the main complaints that the users of the PAB / chain-index have is related to scalability / efficiency of our implementation. This is why I opted for a slightly less composable implementation, but one that uses the CPU more efficiently.

It's not clear to me that explicit concurrency brings an advange in terms of performace. I have found some discussions 1 2 about SQLite and multithreading but I haven't dug into it yet.

I think it does as you are running two indexers in parallel (CPU usage is quite high on all indexers, so parallelizing it to several cores seems like a win). The only reason why this would not be the case is if the disk is not fast enough, but I would not think that to be the case.

We could test it though, but I would suggest we do that in another issue if anyone thinks the above reasoning is not very sound.

AFAIU each indexer opens a new sqlite database so there should not be any contention at the connection level, is this right?

Right.

I have noticed you use Control.Concurrent.STM.TChan (like I did in plutus-streaming). This is an unbounded channel and it's memory usage will grow without limits if the producer is faster than the consumer. I believe this is likely here. Can you either use a bounded channel or make sure this arbitrary grow doesn't happen in other means?

The TChans are artificially bounded by the semaphore. I have also tested the implementation for memory leaks on the main net and (after fixing the streaming memory leak) everything runs smoothly.

@raduom raduom force-pushed the raduom/a-story-of-two-indexers branch from a1e92fe to 92fdf6e Compare June 14, 2022 09:25
@andreabedini andreabedini self-requested a review June 14, 2022 09:27
@raduom raduom requested a review from koslambrou June 14, 2022 20:21
@raduom raduom force-pushed the raduom/a-story-of-two-indexers branch from 92fdf6e to 265f074 Compare June 15, 2022 15:24
@raduom raduom merged commit f20efc3 into IntersectMBO:main Jun 16, 2022
@raduom raduom deleted the raduom/a-story-of-two-indexers branch June 16, 2022 05:07
koslambrou pushed a commit that referenced this pull request Jun 22, 2022
* Remove dependency on quickspec.

* Extract inputs and output from block

* Implement handlers.

* Remove dependency on fromCardanoTx.

* Add some lost queries.

* Add indexer options.

* Add options for running the indexers.

* The two indexers should now run in parallel.

* Fixed a concurrency bug.

* Add some comments.

* Format plutus-chain-index.cabal
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants