Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(badger): add support for Badger version 4. The default remains Badger version 2 to ensure backward compatibility. #12316

Open
wants to merge 66 commits into
base: feat/faster-datastore-with-badger
Choose a base branch
from

Conversation

snissn
Copy link
Contributor

@snissn snissn commented Jul 27, 2024

Performance improvement 12m16.290s vs 19m51.546s to import chain

time LOTUS_CHAINSTORE_BADGERVERSION=4 ./lotus daemon --remove-existing-chain --halt-after-import --import-snapshot ../forest_snapshot_mainnet_2024-07-23_height_4113378.forest.car.zst
real	12m16.290s
user	34m10.020s
sys	7m31.235s

 du -h ~/.lotus
1.7G	/home/mikers/.lotus/sqlite
12K	/home/mikers/.lotus/keystore
228K	/home/mikers/.lotus/datastore/metadata
119G	/home/mikers/.lotus/datastore/chain
119G	/home/mikers/.lotus/datastore
19M	/home/mikers/.lotus/heapprof
35M	/home/mikers/.lotus/journal
121G	/home/mikers/.lotus
time LOTUS_CHAINSTORE_BADGERVERSION=2 ./lotus daemon --remove-existing-chain --halt-after-import --import-snapshot ../forest_snapshot_mainnet_2024-07-23_height_4113378.forest.car.zst 
real	19m51.546s
user	23m37.446s
sys	3m45.004s
 
du -h ~/.lotus
1.7G	/home/mikers/.lotus/sqlite
12K	/home/mikers/.lotus/keystore
232K	/home/mikers/.lotus/datastore/metadata
122G	/home/mikers/.lotus/datastore/chain
122G	/home/mikers/.lotus/datastore
19M	/home/mikers/.lotus/heapprof
35M	/home/mikers/.lotus/journal
123G	/home/mikers/.lotus

Related Issues

Proposed Changes

Additional Info

Checklist

Before you mark the PR ready for review, please make sure that:

  • Commits have a clear commit message.
  • PR title is in the form of of <PR type>: <area>: <change being made>
    • example: fix: mempool: Introduce a cache for valid signatures
    • PR type: fix, feat, build, chore, ci, docs, perf, refactor, revert, style, test
    • area, e.g. api, chain, state, mempool, multisig, networking, paych, proving, sealing, wallet, deps
  • Update CHANGELOG.md or signal that this change does not need it.
    • If the PR affects users (e.g., new feature, bug fix, system requirements change), update the CHANGELOG.md and add details to the UNRELEASED section.
    • If the change does not require a CHANGELOG.md entry, do one of the following:
      • Add [skip changelog] to the PR title
      • Add the label skip/changelog to the PR
  • New features have usage guidelines and / or documentation updates in
  • Tests exist for new functionality or change in behavior
  • CI is green

TODO at the moment:

@snissn snissn changed the title feat: add support for badger version 4 while keeping version 2 support feat: Add support for Badger version 4. The default remains Badger version 2 to ensure backward compatibility. Jul 27, 2024
@snissn
Copy link
Contributor Author

snissn commented Aug 13, 2024

thanks @ZenGround0 I think this is in a good place to be merged. I don't have merge access but please use the squash option to bring this PR in as a single commit when it is merged.

Copy link
Member

@BigLep BigLep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm no Lotus veteran so apology in advance for any newbie questions or mistakes.

A few things:

User communication

  1. I think we need to expand the changelog entry. I put down some potential text.
  2. Is there a stronger way to communicate this is experimental? I'm wondering if the config key and env variable can be prefixed with "Experimental". That gives a clear signal and then also gives maintainers liberty to change config keys/semantics for when it's not experimental.
  3. I think it would be good to have a "Add badger v4 support" tracking issue. A first task can be adding experimental support (which this PR does) but it can also list any followup tasks we need to do to fully add Badger v4 support. It also serves as a good place to store the rationale for this work, benchmarks, etc. rather than having it embedded in PR comments.
  4. I think it would be good have @rjan90 review/approve, at least from the user communication/rollout perspective.

KV store improvements

  1. I understand the suggestion to not overload this PR - sounds good. Do we have a backlog item tracking the space improvements suggestions?
  2. No pressure, but is this something you're planning to do a fast-followup with?
  3. My understanding of @ribasushi 's point is that if we're already going to be asking users at some point to do a migration when they update to badger v4, we should bundle in other things that also require a migration (e.g., key changes). I'm wondering: how is this going to play out in time? I'm assuming the user message will be: anytime you change the value of one of these flags (LOTUS_CHAINSTORE_BADGERVERSION or LOTUS_CHAINSTORE_KEY_OPTIMIZATION) one needs to setup a new datastore. If we ever want to change the default values then we'll need to provide a migraiton path (ideally automated while keeping a backup). I just want to make sure we're thinking ahead for how we minimize user impact and account for the total work to fully complete this so we can remove old code paths. (I think this is the kind of discussion/planning that would be good in an overarching issue.)

CHANGELOG.md Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs expansion. Maybe badger v4 should should be its own section? Maybe something like:

Experimental Badger v4 Support

With filecoin-project/lotus#12316, users can now opt-in to using Badger v4 instead of v2 for the datastore.

Why upgrade?

  • Performance improvement - Initial benchmarks showed 40% faster time to import a snapshot. (I'm making stuff up here...) I/O and CPU utiliation anecdotally reduced by 10% on the same workload (link to Grafana).
  • Active development - Badger v2 was released ~4 years ago, but v4 is where all the active development is by the community to improve disk read/write times and memory efficiency.

How to upgrade?
The v2 and v4 datastores are incompatible. Badger directories are directories, it's advised to first copy your v2 datastore. Then enable v4 with LOTUS_CHAINSTORE_BADGERVERSION=4. Download a recent mainnet snapshot (link to snapshot directory) to import using lotus command.

If you run into any problems please report them by opening an issue and you can also rollback with LOTUS_CHAINSTORE_BADGERVERSION=2 and copying back v2 directory.

# versions can only be done when the blockstore is empty and will be repopulated from a snapshot
# or chain sync. It cannot be upgraded or downgraded with existing data and there is currently
# no automatic migration tooling.
# This is an experimental feature and should not be used in production.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to link to the tracking issue where someone can go for finding updates on what is remaining for moving this out of experimental.

@snissn
Copy link
Contributor Author

snissn commented Aug 13, 2024

@BigLep Thank you for your feedback and suggestions! I appreciate your guidance as I navigate these areas—I'm still learning, so your input is invaluable.

User Communication

  1. Expanding the Changelog Entry
    I understand where you're coming from. After reviewing the general format of the CHANGELOG.md, it seems that this PR aligns with the typical entries. However, I find your notes very insightful. As we gather positive feedback on the Badger upgrade from the community, it would be great to collaborate on enhancing our documentation strategy to make the rollout as smooth as possible.

  2. Communicating Experimental Features
    I agree that it's crucial to signal the experimental nature of this feature clearly. I believe the current setup already makes this quite evident, especially with the recent changes by @rvagg, which include prominent warnings. The config.toml that's generated by lotus config default now has this text:

[Chainstore]
  # EXPERIMENTAL FEATURE. USE WITH CAUTION
  # BadgerVersion switches the version of the Badger database engine. The default is 2 which has
  # been well-tested and widely deployed. Switching this to version 4 will enable the new version
  # of Badger, but the blockstore will be incompatible with Lotus running with version 2. Switching
  # versions can only be done when the blockstore is empty and will be repopulated from a snapshot
  # or chain sync. It cannot be upgraded or downgraded with existing data and there is currently
  # no automatic migration tooling.
  # This is an experimental feature and should not be used in production.
  #
  # type: int
  # env var: LOTUS_CHAINSTORE_BADGERVERSION
  #BadgerVersion = 2

But I’m always open to refining this if needed!

  1. Tracking Issue for Badger v4 Support
    That’s a great idea. I think creating a tracking issue will be very beneficial once we have some feedback on Badger v4’s performance and reliability. This way, we can plan our next steps more effectively.

  2. Involving @rjan90
    Definitely, having @rjan90 review this would be valuable, particularly from the perspective of user communication and rollout.

KV Store Improvements

  1. Tracking Space Improvements
    I’m not sure if there’s a backlog item for this or not.

  2. Follow-up on Space Improvements
    I’m not able to confirm on being able to follow up quickly and I’d prefer to keep this discussion focused on the current PR for now.

  3. Migration Planning
    You raise an excellent point about the importance of minimizing user impact during migrations. I think this is something we should consider carefully, especially as it relates to the timing and communication around these changes. Depending on the user’s setup, the approach might vary significantly, so planning ahead is essential to ensure a smooth transition for everyone.

@rvagg
Copy link
Member

rvagg commented Aug 13, 2024

I'm also concerned about the key efficiency improvements coming on top of this as a separate PR after this is merged. It puts two separate hard-breaking changes in master. If we're doing both then we should do them both pretty closely together, and ideally at the same time. At a minimum I'd like this not to go into a release without the key efficiency improvements also going with it. Unless we come to an agreement that we're not going to try and achieve that any time soon. Reading sentiment and adding my own to it suggests we would like to get it done asap since we have this opportunity with the breakage here.

I agree that tying that functionality to this PR makes it too large and less focused, and it's unfair to lump that burden on @snissn here, but maybe this is a case for a feature branch where we can review and land multiple pieces before they make it into master and accidentally into separate releases.

@BigLep
Copy link
Member

BigLep commented Aug 14, 2024

Thanks @snissn . Responding to a few things:

User Communication

Expanding the Changelog Entry: it seems that this PR aligns with the typical entries

By the time we get to doing a release, certain newly-added features have text added in. https://github.com/filecoin-project/lotus/blob/master/CHANGELOG.md#v1281--2024-07-24 is an example with F3. I accept that F3 is bigger than a Badger update, but we want to avoid having to do documentation editorializing in the crunch of trying to get a release out. I think it's something that needs to be thought about and done as part of the code writing process.

Communicating Experimental Features

I agree that the config.toml docs have good disclaimers, but someone doesn't see that when they use environment variables and it also doesn't a clear signal that things may change before we make it non-experimental. If we get this into the name (LOTUS_EXPERIMENTAL_CHAINSTORE_BADGERVERSION / Experimental.BadgerVersion) this is very clear even if they don't read the doc comments and gives a clear break for when we release a non experimental version.

(I realize this is a larger discussion about how experimental features are handled in general in Lotus. I don't know the history here, and so this is something that would be good to have Lotus maintainers chime in on. My bias is coming from what I saw be clear for users in projects like Kubo (example).)

Tracking Issue for Badger v4 Support

I don't think we need/should wait for some on Badger v4’s performance and reliability. This is a big change, and thus ideally I think this PR should have been opened with an issue explaining the work and providing a central place to capture decision points, findings, next steps, etc.

KV Store Improvements

It sounds like maintainers need to:

  1. Create a backlog item
  2. Determine how interlinked this badger update needs to be with the KV store improvements (i.e., feature branch where both workstream happen, allow them to land in master separately but off by default)
  3. Determine a rollout strategy. I think some of the key things here are:
    1. Is there one config flag (ENABLE_AWESOME_DATASTORE_IMPROVEMENTS) which couples Badger update + key shrinkage improvements or multiple config flags (one for each).
    2. What would be the path for getting rid of old code paths? Would we do some sort of automated migration on startup of a new Lotus version, provide written manual directions that need to be followed, etc.

If there are Lotus maintainers at the 2024-08-14 maintainer call today, I'll see if I can get answers or input on any of the above and ensure there is clarity on the next steps for landing this PR (and where to land it to).

@BigLep
Copy link
Member

BigLep commented Aug 15, 2024

Below are some notes I took from Lotus maintainer conversation on 2024-08-14. Others can fill in if I'm misrepresenting:

  1. It's great to see experimentation and work to make the datastore better as this a known major pain point for users. We want to see contributions here and want to see this area of the code get better. Thank you @snissn for working to move the ball forward! It would be great to see other experiments by the community here as well (e.g., Postgres-backed datastore).
  2. The snag we're running into is whether the disruptiveness of this change is worth it. If this didn't require a user to do a manual migration, this would be easy to approve and merge. After a limited time window, we could remove old code, keep the new code, and users would benefit while not increasing the maintenance burden on maintainers.
  3. To determine whether this change is worth it, we'd also want to look at:
    • How much does it impact splitstore compaction speed?
    • How much does v4 impact runtime CPU and memory?
    • Are execution time of tispets impacted in a material way? (although this isn't really a user concern)
      It sounded like there was maybe an A/B-testing framework that you have used in the past with Lotus @snissn?
      I know you provided the snapshort import improvement times, but that is a pure Go pathway, and we don't think is representative of other important scenarios.
  4. Related to the previous point, we want to make sure we're clear on: "who is the user that is really benefiting enough to warrant the time for this feature?" Yes faster import snapshot times is nice, but our understanding is that that hasn't been a make or break issue for users. (Please correct if you've observed otherwise.). This is why we raised the snapshot compaction time measurement, as improving that would be really nice for all.
  5. Because of the user and maintenance burden that this change brings, a larger narrative of the benefits and the rollouts steps is needed so we can look at it to help answer is it worth it. I can understand that help from maintainers may be needed to help generate that and think this through. Given maintainers are tied up with other areas currently but also to help this PR land rather than stay open while additional testing and followups are done, we think @rvagg's idea of a feature branch makes sense here. Maybe we name it feat/faster-datastore-with-badger? This is also where other improvements like key shrinkage could land.

Thanks again @snissn .

@rvagg
Copy link
Member

rvagg commented Aug 15, 2024

feat/faster-datastore-with-badger made, this PR could be retargeted.

Re feature branch and how to proceed, here's some additional thoughts on how we can be productive with that:

  • We should review PRs into that branch like we do with master because any eventual merge into master is likely going to be very large and maybe impossible given its eventual scope. Ongoing reviews for incremental changes would help alleviate that burden by building trust in the branch over time.
  • Likewise, keeping the branch updated will be helpful.
  • As @BigLep suggests, we need to make sure that this moves the needle enough to warrant the additional maintenance and support burden of having yet another permutation of possible Lotus configuration in the wild. So we need to work on quantifying that a bit more, perhaps there's enough here across the scenarios that matter to make it worth a push.
  • A migration from 2 -> 4 would certainly help matters and may become necessary for a path forward here so we don't have perpetual v2 support if we deem v4 worthwhile.
  • Getting feedback from people using this would be really helpful, we have such variety of node users out there that there may be a subset of users for who this is a significant enough win which justifies further pushes to make this worth pushing further toward release.
  • The key efficiency improvements may make this worthwhile, since we can bundle a hard-break in blockstore with significant enough improvements to warrant the pain: do smaller keys make the in-memory indexes small enough to make a dent on concerns of larger node operators? Archival node operators in particular have the problem with badger that the indexes don't fit in memory and badger apparently insists on swapping them in and out when it needs to go fishing.
  • As @Stebalien was pointing out in our discussion about this recently, for average users, most blockstore usage comes via the Rust layer; which is an inversion of the original Lotus design: Badger was a reasonable option when cgo was being avoided, but now we have Rust code that needs to call back into Go just to load data from the blockstore. This means that the ultimate "fix" for Lotus' blockstore woes exists in a non-Go blockstore that Go has the burden of calling in to. This may not make life better for archival node operators but there may be an alternative path there where they get an external blockstore that they can scale via traditional means (e.g. postgres).

* Update go-f3 to 0.2.0

Includes:
 - fix for excessive bandwidth usage
 - significant performance improvements
 - minor consensus fixes

Signed-off-by: Jakub Sztandera <oss@kubuxu.com>

* add changelog

Signed-off-by: Jakub Sztandera <oss@kubuxu.com>

* chore(f3): update to final released version

---------

Signed-off-by: Jakub Sztandera <oss@kubuxu.com>
Co-authored-by: Steven Allen <steven@stebalien.com>
@ZenGround0
Copy link
Contributor

@BigLep @rvagg glad to see you guys thinking about integration wholistically and seriously.

However I think we are overthinking it. A very standard development usage of lotus is 1) sync from snapshot 2) run for a while 3) lose sync tear down go back to 1. And SPs are good at spinning lotus up and down. We're really overestimating the utility of database migration from my pov. I think it would be a waste of work to write a migration personally.

I don't think we need to block on making v4 even better, this is an experimental feature, why can't we just break v4 badger support in the future by making breaking improvements? The feature is clearly marked experimental. This feature is totally independent of the default so if someone wants to try v4 badger why stop them?

I don't think we need to do so much work quantifying this experimental feature until we commit to working on it ourselves. It makes sense to do this investigation before we commit to working on this. We're going to get organic feedback from people in the network only if we make the change available.

What about saying "if this causes more than 2 unhappy user reports we revert this" to limit maintenance overhead?

And hey maybe I'm underthinking it. To me it just feels like we're indefinitely stopping merging a complete-enough open source contribution without great reason.

@ZenGround0
Copy link
Contributor

A little added context, @snissn is trying to get more performance testing information from RPC providers and getting an experimental feature on master makes the process of convincing them to test a lot easier.

@trruckerfling
Copy link

Great discussion above. It seems we all agree early user feedback is needed. Can we merge this as an experimental feature so we can start getting early user feedback?

We can decide on migration as needed after that early feedback comes in. Concretely, Ankr and Protofire seemed keen to try this when we last spoke a month ago and Mike can share this with them once merged

@BigLep
Copy link
Member

BigLep commented Aug 16, 2024

Thanks all for the comments here. Some thoughts below synthesized from conversations I've had more recently (meaning I can stand behind anything written here but any utility was likely due to others):

It seems like the target user here we're hoping to benefit here is RPC providers. A couple of things on this:

  1. There is other/separate ongoing work right seeking to benefit this userbase as well in Meta Issue: Fixing high impact correctness and performance problems in ETH RPC API for snapshot synced nodes #12293 . This was deemed to be the most impactful single area to apply limited maintainer time.
  2. Does the RPC provider user care about having to restart from a snapshot? I'm assuming not, but I wanted to check as that helps gage how important we need to weigh the "breakage" this causes if it's enabled. I saw @ZenGround0 say " And SPs are good at spinning lotus up and down". I wanted to make sure that applies to RPC providers too.

Maintainers are aware of other ways to improve RPC provider experience as well beyond this badger update, including things like:

  1. disable fsync on the chainstore: perf: chain: Disable fsync on the chainstore by default #7965
  2. reduce the index size (discssed in this PR)
  3. disable mmap

It is encouraging to see people pick up the torch on the blockstore problem. The concern is that we're poking at the edges which still consumes non-trivial time and bandwidth for marginal improvements to something that’s just structurally bad. The worry is that minor incremental fixes will increase the permutations of how people run lotus, which will make maintainer work harder in the long run. For example, "are you running badger 2 or 4? are you running the version with fsync defaulting to off or have you set it to on? is mmap enabled on your node? do you have eth API enabled? are you running with historical events? are you saving messages to the blockstore? what’s your splitstore mode?”

If doing any combination of things can be shown to make a meaningful difference to the life of node operators, then that would be great news and would give weight to schedule more time to take a proper look at this and think holistically on integrating it back into master/main in a non-experimental way. Without better info, there isn't maintainer conviction to see this moving the needle enough on the “lotus sucks” meter, or adjusting the “filecoin is a hard chain to run” sentiment, or dispelling the “forest can do the same work with a fraction of the resources” meme.

In knowing that having a holistic strategy to dealing with the blockstore problem is likely what's needed but that there's no one committed in 2024Q3 to make that happen, I still think the dedicated feature branch path is a good compromise. I think the couple of RPC providers like Ankr and Protofire who are keen to experiment should be able. Hopefully we can get some good news from those experiments to help inform priorities later on in the year? Does that work @snissn and keep you sufficiently unblocked?

rvagg and others added 3 commits August 19, 2024 17:25
…lease) (#12400)

* fix: lotus-miner: remove provecommit1 method (#12251)

* remove provecommit1

* add changelog

* update precommit and commit params

* fix lint error

* fix commit params

* dep: f3: Update go-f3 to 0.0.6, enable it on mainnet (#12295)

* Update go-f3 to 0.0.6

Signed-off-by: Jakub Sztandera <oss@kubuxu.com>

* Enable F3 in passive configuration in mainnet config

Signed-off-by: Jakub Sztandera <oss@kubuxu.com>

* Add changelog

Signed-off-by: Jakub Sztandera <oss@kubuxu.com>

* add new butterfly assets

---------

Signed-off-by: Jakub Sztandera <oss@kubuxu.com>
Co-authored-by: Jennifer Wang <jiayingw703@gmail.com>

* retract v1.28.0

* update v1.28.0 changelog and add v1.28.1

* Update CHANGELOG.md

* wip - update f3

* don't convert bigint type

We now use the same one in GPBFT.

* update docs

* fix wrong param name

* update butterfy assets

* update go-f3

* update changelog

* update version

* fix typo

* Update CHANGELOG.md

Co-authored-by: Steven Allen <steven@stebalien.com>

* Update CHANGELOG.md

Co-authored-by: Rod Vagg <rod@vagg.org>

* Update CHANGELOG.md

Co-authored-by: Rod Vagg <rod@vagg.org>

* apply f3 patch

* chore: bump versions and make gen/docsgen-cli

chore: bump versions and make gen/docsgen-cli

* chore: update v1.28.2 changelog

chore: update v1.282. changelog

* feat: f3: update go-f3 to 0.2.0 (#12390)

* Update go-f3 to 0.2.0

Includes:
 - fix for excessive bandwidth usage
 - significant performance improvements
 - minor consensus fixes

Signed-off-by: Jakub Sztandera <oss@kubuxu.com>

* add changelog

Signed-off-by: Jakub Sztandera <oss@kubuxu.com>

* chore(f3): update to final released version

---------

Signed-off-by: Jakub Sztandera <oss@kubuxu.com>
Co-authored-by: Steven Allen <steven@stebalien.com>

* fix!: sealer: handle initialisation error without panic

storage/pipeline.NewPreCommitBatcher and storage/pipeline.New now have an additional
error return to deal with errors arising from fetching the sealing config.

* add breaking API upgrade warning to the ChangeLog

* NewCommitBatcher now has an additional
error return to deal with errors arising from fetching the sealing config.

* fix: miner: Fix DDO pledge math (#12341)

* Power is units of Space * Time so multiply by deal duration

* fix: miner: Fix DDO pledge math

* appease the changelog checker

* Fix gen

---------

Co-authored-by: zenground0 <ZenGround0@users.noreply.github.com>

* chore: fix lint error

- Updated the logging statement in `testOutOfGasError` to correctly reference `build.BlockGasLimit` instead of `buildconstants.BlockGasLimit`.

* fix: update changelog to reference bandwidth issue ticket

fix: update changelog to reference bandwidth issue ticket

* Update CHANGELOG.md

Co-authored-by: Steve Loeppky <biglep@filoz.org>

* Update CHANGELOG.md

* chore: make gen and make docsgen-cli

Run `make gen` and `make docsgen-cli`

---------

Signed-off-by: Jakub Sztandera <oss@kubuxu.com>
Co-authored-by: LexLuthr <88259624+LexLuthr@users.noreply.github.com>
Co-authored-by: Jakub Sztandera <oss@kubuxu.com>
Co-authored-by: Jennifer Wang <jiayingw703@gmail.com>
Co-authored-by: Jiaying Wang <42981373+jennijuju@users.noreply.github.com>
Co-authored-by: Steven Allen <steven@stebalien.com>
Co-authored-by: Rod Vagg <rod@vagg.org>
Co-authored-by: aarshkshah1992 <aarshkshah1992@gmail.com>
Co-authored-by: Łukasz Magiera <magik6k@users.noreply.github.com>
Co-authored-by: zenground0 <ZenGround0@users.noreply.github.com>
Co-authored-by: Steve Loeppky <biglep@filoz.org>
This exposes bandwidth metrics via async callback to avoid
allocating/reporting metrics on any hot-paths. I'm using open telemetry
as we've already setup a bridge for F3 and opencensus is deprecated in
favor of open telemetry (so we're going to slowly move over anyways).
@snissn snissn changed the base branch from master to feat/faster-datastore-with-badger August 19, 2024 20:23
@snissn
Copy link
Contributor Author

snissn commented Aug 19, 2024

Hi - I have updated the PR to target feat/faster-datastore-with-badger.

I hope that helps!!

qwdsds and others added 3 commits August 20, 2024 10:06
* chore: update Node version

chore: update Node version

* chore: cleanup unreleased changelog section

chore: cleanup unreleased changelog section
@BigLep
Copy link
Member

BigLep commented Aug 23, 2024

Circling back to next steps now that I'm back...

My understanding is that @snissn is doing this work currently as an extra/side project outside of the context of a particular team's priorities and would like to get feedback of how this change performs in a production setting.

It's to this end that the idea of a feature branch was suggested (and has been created).

Regardless of the branch though, the more pertinent question is how the branch stays updated. Given Lotus maintainers aren't focused on this more holistic blockstore effort currently (comment above), Lotus maintainers will leave it up to those interested in experimenting here to keep up with master or the latest release/ production release.

(In retrospect it may have been better not to do this as a "feature branch" as that maybe gives the impression that this feature is more official than it is currently. But I guess having a feature branch enables actually merging this PR vs. leaving it open indefinitely. 🤷)

Whether in a feat branch or mikers/BadgerVersions, I think it's important to communicate with anyone who is trying this functionality out that:

  1. There currently isn't commitment or plan for how this gets into master. I wouldn't want someone taking time to experiment here to think that we're on the verge of getting this into an official release.
  2. Data that will help with this decision (but is not the only consideration) is a production operator showing CPU, memory, and compaction speed of some nodes using v2 vs. those using v4.
  • Ideally this would be done as a in-parallel A/B test.
  • (If there are other impactful metrics that operators observe, those are welcome too.)
  • Let's capture this info in the currently-missing issue justifying/explaining this work

@rvagg: if you're good with the above, can you please take on merging?

@trruckerfling
Copy link

Following up here:

  • user testing
    • @snissn spoke with Ankr and Protofire to agree to test this and share feedback (thanks all!)
    • have also tagged on maintainer questions from @BigLep @rvagg and added yall so free to follow there
    • ideally, this gives us clarity into v4 upgrade need and readiness
  • feature branch maintenance
    • this PR has been retargeted into feature branch as requested. Unfortunately, we can't commit bandwidth to maintain this branch
    • depending on user testing feedback, we can help with updates to this PR as needed

BigLep and others added 3 commits August 28, 2024 11:43
* docs: updates about branches and where to target PRs

CONTRIBUTING.md: calling out the release flow doc earlier and more of why a contributor should read it.
LOTUS_RELEASE_FLOW: being more precise and organized about branch strategy.
RELEASE_ISSUE_TEMPLATE: added comment on what we expect to see when backporting from release branch to master.

* Update LOTUS_RELEASE_FLOW.md

Co-authored-by: Rod Vagg <rod@vagg.org>

* Update LOTUS_RELEASE_FLOW.md

Co-authored-by: Rod Vagg <rod@vagg.org>

* Incorporating review feedback.  Stripped colors and made clear that we're cherry picking.

* Removed one more color reference

---------

Co-authored-by: Rod Vagg <rod@vagg.org>
@BigLep
Copy link
Member

BigLep commented Sep 4, 2024

Next steps per 2024-08-28 sync between FilOz and Fil-B teams:

  • @rvagg to review so this can be merged into a feature branch
  • Strong signal is then needed from infra providers that this indeed worth pursuing more to get it into master. (This has been discussed in more detail earlier in the PR).

Additional notes:

  • @snissn is the only person so far who has raised their hand to try and drive it forward with infra providers, but Lotus maintainers will keep this feature in mind as they also engage with infra providers as well.
  • Lotus maintainers haven't committed to keep this feature branch updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants