Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: customeizable fd_limit as env var #10962

Closed
wants to merge 56 commits into from

Conversation

rtsainear
Copy link

This PR introduces the ability to customize the file descriptor (FD) limit for the neard process through an environment variable. Currently, neard has a hardcoded file descriptor hard limit of 65,000. While this default setting accommodates typical operation, edge cases such as the resharding event have demonstrated that archival nodes can exhaust this limit due to the opening of multiple RocksDB instances, leading to errors from too many open files.

Motivation
During intensive operations like resharding, archival nodes encounter the FD hard limit, resulting in failure due to "too many open files" errors. Providing operators with the flexibility to adjust the FD limit as needed based on their configuration and operational demands will enhance the stability and adaptability of neard.

Changes
Environment Variable FD_LIMIT: Operators can now specify an FD_LIMIT environment variable to set the file descriptor limit, offering a way to increase it beyond the default 65,000 when necessary.
Default Behavior: In the absence of this environment variable, the file descriptor limit will remain at the default setting of 65,000. This ensures backward compatibility and maintains current performance expectations under standard operational conditions.

posvyatokum and others added 30 commits January 22, 2024 16:08
* Do not apply default size increase introduced in #10373 to
`view_trie_cache` since view calls are not latency-sensitive
* Opt out shard 1 from the increase since it only contains aurora
account and based on the current metrics has very low cache miss rate
even with a size of 50MB
* Configure shard 3 override of 3GB to also apply after resharding
…ic (#10468)

#9316 extracted
maybe_mark_block_invalid() as a helper function, but it changed the
behavior so that near_num_invalid_blocks is incremented even if we're
not marking a block as invalid. So fix it by just putting that metric
increment inside the if block like it was before
…lts (#10472)

the command `neard --home {home_dir} init --chain-id mainnet
--download-config` downloads the config at

https://s3-us-west-1.amazonaws.com/build.nearprotocol.com/nearcore-deploy/mainnet/config.json
and then saves a config file built from that one to
{home_dir}/config.json. This is expected to work even when particular
fields are missing by filling them in with defaults, which is why many
config fields are marked with some sort of `#[serde(default)]` or
`#[serde(default = default_fn)]`. But the `ExperimentalConfig` doesn't
fill in these defaults. The existing `#[serde(default)]` over that field
in `near_network::config_json::Config` will have us fill it in if it's
totally missing, but we get an error if it's present with some fields
missing, which is the case today:

```
Error: Failed to initialize configs

Caused by:
    config.json file issue: Failed to deserialize config from /tmp/n/config.json: Error("missing field `network_config_overrides`", line: 98, column: 5)
```

Fix it by adding a `#[serde(default)]` to each of the fields in
`ExperimentalConfig`
On release triggered workflow runs, the HEAD is detached and no local
branch is present.
Due to this, BRANCH var ends up as an empty string and this causes
failures to publish artifacts:
https://github.com/near/nearcore/actions/runs/7640475082/job/20815674136
++ git branch --show-current
+ BRANCH=
++ git rev-parse HEAD
+ COMMIT=c869dd6c1e942f21e27a74c7e47a698de
++ uname


this leads to incorrect S3 paths:
`s3://build.nearprotocol.com/nearcore/$(uname)/${BRANCH}/latest` ->
`s3://build.nearprotocol.com/nearcore/Linux//latest`
#10495 was meant to fix the builds
triggered by release events.
With actions/checkout GHA action only a single commit is fetched by
default and thus missing branch match.
To fetch all history for all branches and tags, setting fetch-depth to 0
for both binary and docker image release jobs.

This was
[tested](https://github.com/near/andrei-playground/actions/runs/7696882172/job/20972653224)
on a private repo.
<img width="676" alt="Screenshot 2024-01-29 at 13 42 09"
src="https://github.com/near/nearcore/assets/122784628/941c1cd8-285e-4853-a9e7-a6ea6885c838">
Removing spam from info logs.
1. GC shouldn't announce itself every block in `info` mode
2. `BlockResponse` handle shouldn't spam 10 times per block
3. `EpochOutOfBounds` is a normal error in
`is_last_block_in_finished_epoch`, and we shouldn't flag it.
)

I haven't verified it but it seem that restarting the node during
catchup does not resume the catchup but rather it restarts the whole
resharding from scratch. While suboptimal I'm not aiming to fix that
now. This PR merely moves the deletion of state snapshot to after
catchup is finished. This way the restarted resharding can succeed.
In testnet we hit an issue during resharding with split storage nodes
where they failed to create the snapshot required for resharding. This
PR is a fix for this issue.

Link to zulip thread:
https://near.zulipchat.com/#narrow/stream/308695-pagoda.2Fprivate/topic/cutting.201.2E37.2E0/near/420092380

Creation of snapshot happens via the function call
`checkpoint_hot_storage_and_cleanup_columns` which takes a hot_store and
creates a snapshot out of it. Later we open the snapshot
`opener.open_in_mode(Mode::ReadWriteExisting)`.

The function `open_in_mode` -> `ensure_kind` -> `is_valid_kind_archive`
was the point of failure. This is evident from the log line
```
Feb 06 14:19:29 testnet-rpc-archive-public-02-asia-east1-b-fc341f59 neard[1753]: 2024-02-06T14:19:29.215599Z ERROR state_snapshot: State snapshot creation failed err=Hot database kind should be RPC but got Some(Hot). Did you forget to set archive on your store opener?
```

There are three types of nodes, RPC, legacy archival and split storage
nodes.

The function call to `get_default_kind` gets us the DbKind for the new
snapshot hot storage while opening

To get the snapshot storage to open properly, we need to handle the
three types of storages. We need to set the value of `archive`
appropriately which is passed to the storage opener and ensure we set
the correct storage type of the snapshot.

The table below talks about the expected values of each of these fields.

| Node type | Hot storage kind | Required snapshot storage type |
archive |
| -- | -- | -- | -- |
| RPC | DbKind::RPC | RPC | false |
| Legacy Archival | DbKind::Archival | Legacy Archival | true |
| Split Storage | DbKind::Hot | RPC | false |

The place where our code went wrong with the split storage node was in
function call to `is_valid_kind_archive` where the DbKind is Hot for
split storage and archive is set as false.

The new relaxation in check basically states, if we are creating a
snapshot of the hot storage from a split storage node, convert it into a
RPC storage.

Testing: Adhoc testing where I manually set the storage DbKind as Hot
and check the fix works.
Additionally added an integration test that manually sets the DbKind as
Hot, Archive and RPC and checks whether resharding happens.
…#10611)

This PR is a fix for the issue where when we restart a node doing
resharding and in the catchup phase.

High level issue: When we restart a node in the epoch when resharding in
happening, what happens is we go through the whole state_sync process
from the beginning which includes resharding. Once building of the child
trie is completed, we then do a catchup, apply the split state to the
child tried and delete the split state changes for resharding
[here](https://github.com/near/nearcore/blob/e00dcaa72cfed35831b1e72760d21bb8152f1049/chain/chain/src/chain_update.rs#L399).

Zulip thread link:
https://near.zulipchat.com/#narrow/stream/308695-pagoda.2Fprivate/topic/Problems.20after.20resharding.20restart/near/421312417

This is the implementation of Option 1 in the thread.

The key idea here is to not delete the
`DBCol::StateChangesForSplitStates` during catchup of individual blocks
but rather at the end of the catchup phase. This implies, in case we
restart the node in the middle the catchup, we would still have all the
split state information in `DBCol::StateChangesForSplitStates` and it
wouldn't have been deleted.

TODO: Testing on mocknet.
…estart (#10684)

The root cause of the resharding issue in mocknet testing seems to be
that when the node is restarted the flat storage is not recreated.
Without flat storage when applying the first block of the new epoch we
get different state roots than the nodes that do have flat storage.

This PR introduces creating the flat storage for the children shards aka
next epoch shards. This is depended on the flat storage status that is
set at the end of resharding
[here](https://github.com/near/nearcore-private/blob/1a4756e1acbebc073f312fde1457be47e1da3fc0/chain/chain/src/resharding.rs#L180).

I failed to reproduce the issue so we'll need to test it in mocknet
unfortunately. We should test all the cases (no restart, restart during
resharding, restart during catchup, restart post catchup). cc
@marcelo-gonzalez

Ideally we should also error out when trying to apply a chunk without
flat storage but I'm not brave enough to do it in 1.37 :)

I also changed shard_id to shard_uid in a few places but this needs to
happen on a wider scale. I'll do that separately too.
… shard. (#10696)

Removing the assertion and allowing flat storage to be created multiple
times for a shard. This is needed so fix an issue when node is restarted
in the middle of resharding. The flat storage may be created already for
a subset of shards but unless all are finished resharding will get
restarted. Becuase the flat storage was created, for those shards, it
will be created on node startup as well as after the second resharding
is finished.

This is not a perfect solution and not particularly clean. The best
alternative seems to be to implement resuming of resharding where we
don't restart resharding for shards that were finished. This is more a
comlex change and we want to get this PR in to the release so for now
I'm sticking to the simplest approach.

This seems to be safe because even though the flat storage for children
shards is created it's not used anywhere.

Sanity check - do we ever check the existance of flat storage for a
shard for anything?
nagisa and others added 26 commits March 8, 2024 15:06
… the concurrency of the contract runtime too much (#10736)

This is a backport of #10733 for
inclusion in 1.37.x series.
This PR is to be cherry-picked into release 1.38.0 for splitting shard 2
into two parts.

Hot and kai-ching both fall on shard 2 which has been causing a lot of
congestion.

Zulip thread:
https://near.zulipchat.com/#narrow/stream/308695-nearone.2Fprivate/topic/constant.20congestion.20on.20shard.202/near/425367222

---------

Co-authored-by: wacban <wac.banasik@gmail.com>
…10685)

During resharding we maintain flat storage for both the parent and
children shards. In order to disambiguate and not mix the metrics for
shards with the same ids we should use shard_uid instead of shard_id.
…7.x release (#10754)

* #10476
* #10481

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Simonas Kazlauskas <git@kazlauskas.me>
As these metrics were recently changed and continuity of data is broken,
we can add "near_" prefix to prometheus metrics which was missing all
the time.

@wacban I wonder how painful is it to add this to 1.38 for convenience?

Co-authored-by: Longarithm <the.aleksandr.logunov@gmail.com>
… of a in-memory cache of loaded artifacts (#9244)" (#10788)

This reverts commit ad67e6b.
All the logic of `tx_status_fetch` function is: we poll tx_status method
and wait for the desired level of tx finality.
`tx_status_fetch` is used in several places including
`broadcast_tx_commit` RPC method.

With the chunk congestions we have right now, we return
`UNKNOWN_TRANSACTION` error to `broadcast_tx_commit` after 20 seconds of
waiting, which is both sad and weird.

The error we store in `tx_status_result` is not good enough to show it
to the user, otherwise we would break from the loop with it immediately.
If we reach the timeout boundary, I suggest always to return timeout
error.
Ideally, we need to rewrite `chain.get_final_transaction_result` method
completely.
But for now, let's start at least with supporting
`TxExecutionStatus::Included` status.
… keys (#10803)

If the `columns_to_keep` arg of
`checkpoint_hot_storage_and_cleanup_columns()` is `Some`, then we delete
all the data in every other column. Then if snapshot compaction is
enabled in the configs, we rely on that to clean up the files on disk.
Instead of doing that, we can just call `drop_cf()` on every unwanted
column family, and the associated sst files will be removed without the
need for any compactions.

So this moves the `columns_to_keep` arg to
`near_store::db::Database::create_checkpoint()`, and has the rocksdb
implementation of that trait call `drop_cf()` on unwanted column
families. These column families are then immediately recreated in
`checkpoint_hot_storage_and_cleanup_columns()` by the call to
`StoreOpener::open_in_mode()`, but the data on disk is gone.

This also means we can get rid of the state snapshot options in the
config, since they were only ever intended to clean up the unwanted
files, which aren't there anymore.
@rtsainear rtsainear requested a review from a team as a code owner April 5, 2024 15:53
@rtsainear rtsainear requested a review from akhi3030 April 5, 2024 15:53
@rtsainear rtsainear closed this Apr 5, 2024
mooori pushed a commit to mooori/nearcore that referenced this pull request Apr 16, 2024
<p>This PR was automatically created by Snyk using the credentials of a
real user.</p><br /><h3>Snyk has created this PR to upgrade react-router
from 6.17.0 to 6.18.0.</h3>

:information_source: Keep your dependencies up-to-date. This makes it
easier to fix existing vulnerabilities and to more quickly identify and
fix newly disclosed vulnerabilities when they affect your project.
<hr/>

- The recommended version is **3 versions** ahead of your current
version.
- The recommended version was released **21 days ago**, on 2023-10-31.


<details>
<summary><b>Release notes</b></summary>
<br/>
  <details>
    <summary>Package name: <b>react-router</b></summary>
    <ul>
      <li>
<b>6.18.0</b> - <a
href="https://snyk.io/redirect/github/remix-run/react-router/releases/tag/react-router-native%406.18.0">2023-10-31</a></br><p>react-router-native@6.18.0</p>
      </li>
      <li>
<b>6.18.0-pre.1</b> - <a
href="https://snyk.io/redirect/github/remix-run/react-router/releases/tag/react-router-native%406.18.0-pre.1">2023-10-30</a></br><p>react-router-native@6.18.0-pre.1</p>
      </li>
      <li>
        <b>6.18.0-pre.0</b> - 2023-10-26
      </li>
      <li>
        <b>6.17.0</b> - 2023-10-16
      </li>
    </ul>
from <a
href="https://snyk.io/redirect/github/remix-run/react-router/releases">react-router
GitHub release notes</a>
  </details>
</details>


<details>
  <summary><b>Commit messages</b></summary>
  </br>
  <details>
    <summary>Package name: <b>react-router</b></summary>
    <ul>
<li><a
href="https://snyk.io/redirect/github/remix-run/react-router/commit/667f9360759925adfc0e052b1c56e43447521f54">667f936</a>
chore: Update version for release (near#10981)</li>
<li><a
href="https://snyk.io/redirect/github/remix-run/react-router/commit/3c6e27c14668efec276a5b1cacc2572cc5a76d6c">3c6e27c</a>
Exit prerelease mode</li>
<li><a
href="https://snyk.io/redirect/github/remix-run/react-router/commit/67af262448f66e08d794e2418699f7c1a7392a52">67af262</a>
chore: Update version for release (pre) (near#10978)</li>
<li><a
href="https://snyk.io/redirect/github/remix-run/react-router/commit/150028859ef8c3e116dc840c8bde043cc3060051">1500288</a>
Updates for v7_fetcherPersist post-processing logic (near#10977)</li>
<li><a
href="https://snyk.io/redirect/github/remix-run/react-router/commit/cacc90b83cb2aa0f4b60f6ec5454f237633d9478">cacc90b</a>
chore: Update version for release (pre) (near#10966)</li>
<li><a
href="https://snyk.io/redirect/github/remix-run/react-router/commit/77402de4cb6d4a8468def168a96c9992d54cbbdb">77402de</a>
Enter prerelease mode</li>
<li><a
href="https://snyk.io/redirect/github/remix-run/react-router/commit/1a8265ccfce0005ba588288650707a1eba39afd8">1a8265c</a>
Merge branch &#x27;main&#x27; into release-next</li>
<li><a
href="https://snyk.io/redirect/github/remix-run/react-router/commit/19af0cf7652f309a6a68c3c698f8eb36e733ce72">19af0cf</a>
Add future.v7_fetcherPersist flag (near#10962)</li>
<li><a
href="https://snyk.io/redirect/github/remix-run/react-router/commit/cb2d911d20b6d3268cde61e5828097ce5166f05c">cb2d911</a>
Add fetcher data layer (near#10961)</li>
<li><a
href="https://snyk.io/redirect/github/remix-run/react-router/commit/c0dbcd256df3d58cb07713e947ab8014886e76c7">c0dbcd2</a>
Add useFetcher(key) and &lt;Form navigate&#x3D;{false}&gt; (near#10960)</li>
<li><a
href="https://snyk.io/redirect/github/remix-run/react-router/commit/805924dd2fd9003a3e9cb5e3534778e61ede6bbf">805924d</a>
Revert &quot;Ensure Form contains splat portion of pathname when no
action is specified (near#10933)&quot; (near#10965)</li>
<li><a
href="https://snyk.io/redirect/github/remix-run/react-router/commit/677d6c8915ef5f834700bcf5cc5ed90a6c55b200">677d6c8</a>
Support optional path segments in &#x60;matchPath&#x60; (near#10768)</li>
<li><a
href="https://snyk.io/redirect/github/remix-run/react-router/commit/a23f017ec16b1eb3c622a5549645227066c49e3a">a23f017</a>
docs: Clarify fetcher load revalidation behavior</li>
<li><a
href="https://snyk.io/redirect/github/remix-run/react-router/commit/908a40a25382612b869638664e19a4aa7a977e53">908a40a</a>
Ensure Form contains splat portion of pathname when no action is
specified (near#10933)</li>
<li><a
href="https://snyk.io/redirect/github/remix-run/react-router/commit/2041d9c7ebc8575ff99ed127d3c9ab8b79347b1e">2041d9c</a>
chore: sort contributors list</li>
<li><a
href="https://snyk.io/redirect/github/remix-run/react-router/commit/a71b4e296776ef3b4fcada1f2a0b9fb52f03cfed">a71b4e2</a>
Merge branch &#x27;release-next&#x27; into dev</li>
<li><a
href="https://snyk.io/redirect/github/remix-run/react-router/commit/779536cbfb25122264f9c89f2aef63ae2531f9fd">779536c</a>
Merge branch &#x27;release-next&#x27;</li>
<li><a
href="https://snyk.io/redirect/github/remix-run/react-router/commit/e2c0d828131d27f5e34b81327d30f7afd4ffea90">e2c0d82</a>
Split up router-test.ts (near#10929)</li>
<li><a
href="https://snyk.io/redirect/github/remix-run/react-router/commit/e74b935d4d162f89766f9c97be7797b6e9afa97c">e74b935</a>
Fix lint errors</li>
    </ul>

<a
href="https://snyk.io/redirect/github/remix-run/react-router/compare/edd9ad4957321cfb260cee21ad98aab2becfe250...667f9360759925adfc0e052b1c56e43447521f54">Compare</a>
  </details>
</details>
<hr/>

**Note:** *You are seeing this because you or someone else with access
to this repository has authorized Snyk to open upgrade PRs.*

For more information: <img
src="https://api.segment.io/v1/pixel/track?data=eyJ3cml0ZUtleSI6InJyWmxZcEdHY2RyTHZsb0lYd0dUcVg4WkFRTnNCOUEwIiwiYW5vbnltb3VzSWQiOiIzNWI0YmJiNy0wZTdiLTQ0MzAtOWI5YS1lZjhlODg4MjJkYWIiLCJldmVudCI6IlBSIHZpZXdlZCIsInByb3BlcnRpZXMiOnsicHJJZCI6IjM1YjRiYmI3LTBlN2ItNDQzMC05YjlhLWVmOGU4ODgyMmRhYiJ9fQ=="
width="0" height="0"/>

🧐 [View latest project
report](https://app.snyk.io/org/ecp88/project/98480bdc-d80b-4fd1-89d7-c4c56a706763?utm_source&#x3D;github&amp;utm_medium&#x3D;referral&amp;page&#x3D;upgrade-pr)

🛠 [Adjust upgrade PR
settings](https://app.snyk.io/org/ecp88/project/98480bdc-d80b-4fd1-89d7-c4c56a706763/settings/integration?utm_source&#x3D;github&amp;utm_medium&#x3D;referral&amp;page&#x3D;upgrade-pr)

🔕 [Ignore this dependency or unsubscribe from future upgrade
PRs](https://app.snyk.io/org/ecp88/project/98480bdc-d80b-4fd1-89d7-c4c56a706763/settings/integration?pkg&#x3D;react-router&amp;utm_source&#x3D;github&amp;utm_medium&#x3D;referral&amp;page&#x3D;upgrade-pr#auto-dep-upgrades)

<!---
(snyk:metadata:{"prId":"35b4bbb7-0e7b-4430-9b9a-ef8e88822dab","prPublicId":"35b4bbb7-0e7b-4430-9b9a-ef8e88822dab","dependencies":[{"name":"react-router","from":"6.17.0","to":"6.18.0"}],"packageManager":"npm","type":"auto","projectUrl":"https://app.snyk.io/org/ecp88/project/98480bdc-d80b-4fd1-89d7-c4c56a706763?utm_source=github&utm_medium=referral&page=upgrade-pr","projectPublicId":"98480bdc-d80b-4fd1-89d7-c4c56a706763","env":"prod","prType":"upgrade","vulns":[],"issuesToFix":[],"upgrade":[],"upgradeInfo":{"versionsDiff":3,"publishedDate":"2023-10-31T14:24:16.706Z"},"templateVariants":[],"hasFixes":false,"isMajorUpgrade":false,"isBreakingChange":false,"priorityScoreList":[]})
--->

Co-authored-by: snyk-bot <snyk-bot@snyk.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.