Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove node streams-based ingest code to prepare for post-ingest notifications #485

Merged
merged 9 commits into from
May 8, 2023

Conversation

jwalgran
Copy link
Collaborator

@jwalgran jwalgran commented May 4, 2023

Related Issue(s):

Proposed Changes:

  1. Remove logic from ingestItem and have it call ingestItems with a single-item list.
  2. Remove node streams-based ingest by extraction functionality into lib/ingest.js and calling it directly from ingestItems
  3. Remove batch processing functionality (at least temporarily)

My initial approach to preparing the code to support post-ingest notifications was to make a minimally invasive change that expanded the existing node streams-based implementation. Theoretically converting the Writable subclass to a Duplex subclass so that items would be written to the DB and then the results would be made available to downstream consumer ( eventually an SNS notification) would be small in scope.

I abandoned this approach for a number of reasons.

  • Unresolved hanging problem in the Duplex implementation. The first item would be processed but failed to move forward.
    • Google searching did not lead to an effective solution
    • I also tried using ChatGPT to suggest solution that resulted in the same hanging issues
  • The current project maintainers do not have a regular working knowledge of the node streams API
  • The stream processing switched between single and bulk operations in a way that is was not directly controllable. Code within the node streams API decides when to call _write to process a single item or _writev to process multiple items.
  • Streams are most helpful when dealing with large inputs because they allow avoiding the need to hold the entire set of data to be processed in memory. In this case the data to be processed is already all in memory.
  • The INGEST_BATCH_SIZE variable was misleading. It was only used to set highWaterMark of the SearchDatabaseWritableStream

This PR removes the node-stream based processing by extracting functionality from the lib/databaseStream module and moving it into lib/ingest where it can be called directly.

Unmerged PR #131 was a valuable reference but I did not adapt it directly because it was opened against a much older version of the codebase and it mixed removing the node streams-based implementation with SNS notifications. Follow up PRs to add this notification functionality will likely borrow more heavily from this previous attempt.

ingestItem and ingestItems differed only in their logging, so to reduce duplication I have removed all the logic from ingestItem and it now just calls ingestItems

Batch processing is removed from this implementation. ingestItems saves items to the database sequentially, one at a time. Allowing items to be created in parallel without using the bulk API resulted in test failures. If this is a significant performance issue in a high-volume setting it can be restored in a future pull request. Though not a load test, I did time the running of ./bin/system-tests.sh 3 time on both the main branch and this branch as a minimal test that this did not change performance in small-load situations.

main
real 1m4.807s
real 0m58.459s
real 0m59.376s

jcw/remove-stream-based-ingest
real 0m57.250s
real 0m57.398s
real 0m58.861s

PR Checklist:

  • I have added my changes to the CHANGELOG or a CHANGELOG entry is not required.

@jwalgran jwalgran marked this pull request as ready for review May 4, 2023 21:03
@jwalgran jwalgran marked this pull request as draft May 4, 2023 21:04
jwalgran added 7 commits May 4, 2023 14:04
All the functionality has been extracted from this module so it can now be
removed.

We replicate the sequential processing of the previous stream-based
implmenetation by adding the `asyncMapInSequence` helper.

This commit does not reimplement the `bulk` feature of the database client
previously used by the `_writev` mentod in the writable stream subclass.
@jwalgran jwalgran force-pushed the jcw/remove-stream-based-ingest branch from 1c6559b to 3b0ae0b Compare May 4, 2023 21:13
@jwalgran jwalgran marked this pull request as ready for review May 4, 2023 21:14
@jwalgran jwalgran requested a review from philvarner May 4, 2023 21:14
src/lib/ingest.js Outdated Show resolved Hide resolved
jwalgran added 2 commits May 8, 2023 10:43
This single item convenience menthod is not use anywhere in the codebase.
@jwalgran jwalgran force-pushed the jcw/remove-stream-based-ingest branch from 3b0ae0b to 81b9ded Compare May 8, 2023 17:43
@jwalgran
Copy link
Collaborator Author

jwalgran commented May 8, 2023

Thanks for the review @philvarner. I pushed 0d22f41 with removes ingestItem. It turns out this was not used anywhere in the codebase so there was no need to change any other code.

I implemented a fix for #486 in separate PR #492 and targeted it at this branch, just to clearly separate the two.

Both this PR and #492 are ready for review.

@jwalgran jwalgran merged commit a833b98 into main May 8, 2023
@jwalgran jwalgran deleted the jcw/remove-stream-based-ingest branch May 8, 2023 20:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants