Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: customeizable fd_limit as env var #10962

Closed
wants to merge 56 commits into from

Commits on Jan 22, 2024

  1. cut CHANGELOG.md

    posvyatokum committed Jan 22, 2024
    Configuration menu
    Copy the full SHA
    843a44c View commit details
    Browse the repository at this point in the history
  2. modify testnet delay

    posvyatokum committed Jan 22, 2024
    Configuration menu
    Copy the full SHA
    6c2d981 View commit details
    Browse the repository at this point in the history
  3. config: adjust shard cache sizes (#10409)

    * Do not apply default size increase introduced in #10373 to
    `view_trie_cache` since view calls are not latency-sensitive
    * Opt out shard 1 from the increase since it only contains aurora
    account and based on the current metrics has very low cache miss rate
    even with a size of 50MB
    * Configure shard 3 override of 3GB to also apply after resharding
    pugachAG authored and posvyatokum committed Jan 22, 2024
    Configuration menu
    Copy the full SHA
    d7edd8e View commit details
    Browse the repository at this point in the history
  4. fix(metrics): fix false positives in the near_num_invalid_blocks metr…

    …ic (#10468)
    
    #9316 extracted
    maybe_mark_block_invalid() as a helper function, but it changed the
    behavior so that near_num_invalid_blocks is incremented even if we're
    not marking a block as invalid. So fix it by just putting that metric
    increment inside the if block like it was before
    marcelo-gonzalez authored and posvyatokum committed Jan 22, 2024
    Configuration menu
    Copy the full SHA
    65b89a2 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    174ce77 View commit details
    Browse the repository at this point in the history
  6. set voting date

    posvyatokum committed Jan 22, 2024
    Configuration menu
    Copy the full SHA
    06168f7 View commit details
    Browse the repository at this point in the history
  7. 1.37.0-rc.1

    posvyatokum committed Jan 22, 2024
    Configuration menu
    Copy the full SHA
    e8021d5 View commit details
    Browse the repository at this point in the history
  8. protobuf-patch

    saketh-are authored and posvyatokum committed Jan 22, 2024
    Configuration menu
    Copy the full SHA
    40aa586 View commit details
    Browse the repository at this point in the history

Commits on Jan 23, 2024

  1. Configuration menu
    Copy the full SHA
    390b452 View commit details
    Browse the repository at this point in the history

Commits on Jan 24, 2024

  1. fix(configs): fill in missing values in ExperimentalConfig with defau…

    …lts (#10472)
    
    the command `neard --home {home_dir} init --chain-id mainnet
    --download-config` downloads the config at
    
    https://s3-us-west-1.amazonaws.com/build.nearprotocol.com/nearcore-deploy/mainnet/config.json
    and then saves a config file built from that one to
    {home_dir}/config.json. This is expected to work even when particular
    fields are missing by filling them in with defaults, which is why many
    config fields are marked with some sort of `#[serde(default)]` or
    `#[serde(default = default_fn)]`. But the `ExperimentalConfig` doesn't
    fill in these defaults. The existing `#[serde(default)]` over that field
    in `near_network::config_json::Config` will have us fill it in if it's
    totally missing, but we get an error if it's present with some fields
    missing, which is the case today:
    
    ```
    Error: Failed to initialize configs
    
    Caused by:
        config.json file issue: Failed to deserialize config from /tmp/n/config.json: Error("missing field `network_config_overrides`", line: 98, column: 5)
    ```
    
    Fix it by adding a `#[serde(default)]` to each of the fields in
    `ExperimentalConfig`
    marcelo-gonzalez authored and posvyatokum committed Jan 24, 2024
    Configuration menu
    Copy the full SHA
    e1fc6bc View commit details
    Browse the repository at this point in the history
  2. fix merge conflict

    posvyatokum committed Jan 24, 2024
    Configuration menu
    Copy the full SHA
    c869dd6 View commit details
    Browse the repository at this point in the history

Commits on Jan 28, 2024

  1. Configuration menu
    Copy the full SHA
    71d0dc7 View commit details
    Browse the repository at this point in the history

Commits on Jan 29, 2024

  1. Fix: get branch on RELEASE event triggered workflow (#10495)

    On release triggered workflow runs, the HEAD is detached and no local
    branch is present.
    Due to this, BRANCH var ends up as an empty string and this causes
    failures to publish artifacts:
    https://github.com/near/nearcore/actions/runs/7640475082/job/20815674136
    ++ git branch --show-current
    + BRANCH=
    ++ git rev-parse HEAD
    + COMMIT=c869dd6c1e942f21e27a74c7e47a698de
    ++ uname
    
    
    this leads to incorrect S3 paths:
    `s3://build.nearprotocol.com/nearcore/$(uname)/${BRANCH}/latest` ->
    `s3://build.nearprotocol.com/nearcore/Linux//latest`
    andrei-near authored and posvyatokum committed Jan 29, 2024
    Configuration menu
    Copy the full SHA
    1bdb64f View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    563091a View commit details
    Browse the repository at this point in the history
  3. 1.37.0-rc.2

    posvyatokum committed Jan 29, 2024
    Configuration menu
    Copy the full SHA
    f4a9a68 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    ea69991 View commit details
    Browse the repository at this point in the history

Commits on Jan 30, 2024

  1. Fix: neard-release pipeline (#10521)

    #10495 was meant to fix the builds
    triggered by release events.
    With actions/checkout GHA action only a single commit is fetched by
    default and thus missing branch match.
    To fetch all history for all branches and tags, setting fetch-depth to 0
    for both binary and docker image release jobs.
    
    This was
    [tested](https://github.com/near/andrei-playground/actions/runs/7696882172/job/20972653224)
    on a private repo.
    <img width="676" alt="Screenshot 2024-01-29 at 13 42 09"
    src="https://github.com/near/nearcore/assets/122784628/941c1cd8-285e-4853-a9e7-a6ea6885c838">
    andrei-near authored and posvyatokum committed Jan 30, 2024
    Configuration menu
    Copy the full SHA
    08a2812 View commit details
    Browse the repository at this point in the history
  2. remove spammy info logs (#10529)

    Removing spam from info logs.
    1. GC shouldn't announce itself every block in `info` mode
    2. `BlockResponse` handle shouldn't spam 10 times per block
    3. `EpochOutOfBounds` is a normal error in
    `is_last_block_in_finished_epoch`, and we shouldn't flag it.
    posvyatokum committed Jan 30, 2024
    Configuration menu
    Copy the full SHA
    b097942 View commit details
    Browse the repository at this point in the history

Commits on Jan 31, 2024

  1. 1.37.0-rc.3

    posvyatokum committed Jan 31, 2024
    Configuration menu
    Copy the full SHA
    430a22f View commit details
    Browse the repository at this point in the history
  2. refactor shard management

    posvyatokum committed Jan 31, 2024
    Configuration menu
    Copy the full SHA
    276b7e7 View commit details
    Browse the repository at this point in the history

Commits on Feb 1, 2024

  1. fix(resharding): delay deleting the snapshot until after catchup (#10545

    )
    
    I haven't verified it but it seem that restarting the node during
    catchup does not resume the catchup but rather it restarts the whole
    resharding from scratch. While suboptimal I'm not aiming to fix that
    now. This PR merely moves the deletion of state snapshot to after
    catchup is finished. This way the restarted resharding can succeed.
    wacban authored and posvyatokum committed Feb 1, 2024
    Configuration menu
    Copy the full SHA
    4e06d0f View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    61fef83 View commit details
    Browse the repository at this point in the history

Commits on Mar 5, 2024

  1. Configuration menu
    Copy the full SHA
    7b4ee1e View commit details
    Browse the repository at this point in the history
  2. [resharding] Fix opening snapshot for split storage nodes (#10588)

    In testnet we hit an issue during resharding with split storage nodes
    where they failed to create the snapshot required for resharding. This
    PR is a fix for this issue.
    
    Link to zulip thread:
    https://near.zulipchat.com/#narrow/stream/308695-pagoda.2Fprivate/topic/cutting.201.2E37.2E0/near/420092380
    
    Creation of snapshot happens via the function call
    `checkpoint_hot_storage_and_cleanup_columns` which takes a hot_store and
    creates a snapshot out of it. Later we open the snapshot
    `opener.open_in_mode(Mode::ReadWriteExisting)`.
    
    The function `open_in_mode` -> `ensure_kind` -> `is_valid_kind_archive`
    was the point of failure. This is evident from the log line
    ```
    Feb 06 14:19:29 testnet-rpc-archive-public-02-asia-east1-b-fc341f59 neard[1753]: 2024-02-06T14:19:29.215599Z ERROR state_snapshot: State snapshot creation failed err=Hot database kind should be RPC but got Some(Hot). Did you forget to set archive on your store opener?
    ```
    
    There are three types of nodes, RPC, legacy archival and split storage
    nodes.
    
    The function call to `get_default_kind` gets us the DbKind for the new
    snapshot hot storage while opening
    
    To get the snapshot storage to open properly, we need to handle the
    three types of storages. We need to set the value of `archive`
    appropriately which is passed to the storage opener and ensure we set
    the correct storage type of the snapshot.
    
    The table below talks about the expected values of each of these fields.
    
    | Node type | Hot storage kind | Required snapshot storage type |
    archive |
    | -- | -- | -- | -- |
    | RPC | DbKind::RPC | RPC | false |
    | Legacy Archival | DbKind::Archival | Legacy Archival | true |
    | Split Storage | DbKind::Hot | RPC | false |
    
    The place where our code went wrong with the split storage node was in
    function call to `is_valid_kind_archive` where the DbKind is Hot for
    split storage and archive is set as false.
    
    The new relaxation in check basically states, if we are creating a
    snapshot of the hot storage from a split storage node, convert it into a
    RPC storage.
    
    Testing: Adhoc testing where I manually set the storage DbKind as Hot
    and check the fix works.
    Additionally added an integration test that manually sets the DbKind as
    Hot, Archive and RPC and checks whether resharding happens.
    Shreyan Gupta authored and posvyatokum committed Mar 5, 2024
    Configuration menu
    Copy the full SHA
    1035db5 View commit details
    Browse the repository at this point in the history
  3. [resharding] Handle case of restarting node during resharding catchup (

    …#10611)
    
    This PR is a fix for the issue where when we restart a node doing
    resharding and in the catchup phase.
    
    High level issue: When we restart a node in the epoch when resharding in
    happening, what happens is we go through the whole state_sync process
    from the beginning which includes resharding. Once building of the child
    trie is completed, we then do a catchup, apply the split state to the
    child tried and delete the split state changes for resharding
    [here](https://github.com/near/nearcore/blob/e00dcaa72cfed35831b1e72760d21bb8152f1049/chain/chain/src/chain_update.rs#L399).
    
    Zulip thread link:
    https://near.zulipchat.com/#narrow/stream/308695-pagoda.2Fprivate/topic/Problems.20after.20resharding.20restart/near/421312417
    
    This is the implementation of Option 1 in the thread.
    
    The key idea here is to not delete the
    `DBCol::StateChangesForSplitStates` during catchup of individual blocks
    but rather at the end of the catchup phase. This implies, in case we
    restart the node in the middle the catchup, we would still have all the
    split state information in `DBCol::StateChangesForSplitStates` and it
    wouldn't have been deleted.
    
    TODO: Testing on mocknet.
    Shreyan Gupta authored and posvyatokum committed Mar 5, 2024
    Configuration menu
    Copy the full SHA
    87bf8fe View commit details
    Browse the repository at this point in the history
  4. fix(resharding): create flat storage for children shards after node r…

    …estart (#10684)
    
    The root cause of the resharding issue in mocknet testing seems to be
    that when the node is restarted the flat storage is not recreated.
    Without flat storage when applying the first block of the new epoch we
    get different state roots than the nodes that do have flat storage.
    
    This PR introduces creating the flat storage for the children shards aka
    next epoch shards. This is depended on the flat storage status that is
    set at the end of resharding
    [here](https://github.com/near/nearcore-private/blob/1a4756e1acbebc073f312fde1457be47e1da3fc0/chain/chain/src/resharding.rs#L180).
    
    I failed to reproduce the issue so we'll need to test it in mocknet
    unfortunately. We should test all the cases (no restart, restart during
    resharding, restart during catchup, restart post catchup). cc
    @marcelo-gonzalez
    
    Ideally we should also error out when trying to apply a chunk without
    flat storage but I'm not brave enough to do it in 1.37 :)
    
    I also changed shard_id to shard_uid in a few places but this needs to
    happen on a wider scale. I'll do that separately too.
    wacban authored and posvyatokum committed Mar 5, 2024
    Configuration menu
    Copy the full SHA
    9e98b54 View commit details
    Browse the repository at this point in the history
  5. fix(resharding): Allow creating the flat storage multiple times for a…

    … shard. (#10696)
    
    Removing the assertion and allowing flat storage to be created multiple
    times for a shard. This is needed so fix an issue when node is restarted
    in the middle of resharding. The flat storage may be created already for
    a subset of shards but unless all are finished resharding will get
    restarted. Becuase the flat storage was created, for those shards, it
    will be created on node startup as well as after the second resharding
    is finished.
    
    This is not a perfect solution and not particularly clean. The best
    alternative seems to be to implement resuming of resharding where we
    don't restart resharding for shards that were finished. This is more a
    comlex change and we want to get this PR in to the release so for now
    I'm sticking to the simplest approach.
    
    This seems to be safe because even though the flat storage for children
    shards is created it's not used anywhere.
    
    Sanity check - do we ever check the existance of flat storage for a
    shard for anything?
    wacban authored and posvyatokum committed Mar 5, 2024
    Configuration menu
    Copy the full SHA
    f3597b8 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    69ed989 View commit details
    Browse the repository at this point in the history
  7. 1.37.0

    posvyatokum committed Mar 5, 2024
    Configuration menu
    Copy the full SHA
    0f8e073 View commit details
    Browse the repository at this point in the history

Commits on Mar 7, 2024

  1. bump crates version

    posvyatokum committed Mar 7, 2024
    Configuration menu
    Copy the full SHA
    3ecedff View commit details
    Browse the repository at this point in the history

Commits on Mar 8, 2024

  1. [1.37] runtime: temporarily workaround the LimitedMemoryPool limiting…

    … the concurrency of the contract runtime too much (#10736)
    
    This is a backport of #10733 for
    inclusion in 1.37.x series.
    nagisa authored Mar 8, 2024
    Configuration menu
    Copy the full SHA
    2567e70 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    915aea7 View commit details
    Browse the repository at this point in the history
  3. 1.37.1

    posvyatokum committed Mar 8, 2024
    Configuration menu
    Copy the full SHA
    9f9de3b View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    77f40fa View commit details
    Browse the repository at this point in the history

Commits on Mar 11, 2024

  1. [resharding] Nightshade V3 shard layout for protocol version 65 (#10725)

    This PR is to be cherry-picked into release 1.38.0 for splitting shard 2
    into two parts.
    
    Hot and kai-ching both fall on shard 2 which has been causing a lot of
    congestion.
    
    Zulip thread:
    https://near.zulipchat.com/#narrow/stream/308695-nearone.2Fprivate/topic/constant.20congestion.20on.20shard.202/near/425367222
    
    ---------
    
    Co-authored-by: wacban <wac.banasik@gmail.com>
    2 people authored and marcelo-gonzalez committed Mar 11, 2024
    Configuration menu
    Copy the full SHA
    1f98f92 View commit details
    Browse the repository at this point in the history
  2. bump protocol version for 1.38.0 (#10714)

    Bump protocol version to 35
    VanBarbascu authored and marcelo-gonzalez committed Mar 11, 2024
    Configuration menu
    Copy the full SHA
    9577d06 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    91018d9 View commit details
    Browse the repository at this point in the history
  4. 1.38.0-rc.1

    marcelo-gonzalez committed Mar 11, 2024
    Configuration menu
    Copy the full SHA
    63eeb87 View commit details
    Browse the repository at this point in the history

Commits on Mar 12, 2024

  1. port a4bc90b

    bowenwang1996 authored and marcelo-gonzalez committed Mar 12, 2024
    Configuration menu
    Copy the full SHA
    054ab4e View commit details
    Browse the repository at this point in the history
  2. fix(metrics): changed shard id to shard uid in flat storage metrics (#…

    …10685)
    
    During resharding we maintain flat storage for both the parent and
    children shards. In order to disambiguate and not mix the metrics for
    shards with the same ids we should use shard_uid instead of shard_id.
    wacban authored and marcelo-gonzalez committed Mar 12, 2024
    Configuration menu
    Copy the full SHA
    5798dd6 View commit details
    Browse the repository at this point in the history
  3. backport: crates 0.20.1 had two important fixes that has to be in 1.3…

    …7.x release (#10754)
    
    * #10476
    * #10481
    
    ---------
    
    Signed-off-by: dependabot[bot] <support@github.com>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
    Co-authored-by: Simonas Kazlauskas <git@kazlauskas.me>
    3 people authored and marcelo-gonzalez committed Mar 12, 2024
    Configuration menu
    Copy the full SHA
    b9bbb62 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    fa18c12 View commit details
    Browse the repository at this point in the history
  5. fix: near_ prefix to flat storage metrics (#10763)

    As these metrics were recently changed and continuity of data is broken,
    we can add "near_" prefix to prometheus metrics which was missing all
    the time.
    
    @wacban I wonder how painful is it to add this to 1.38 for convenience?
    
    Co-authored-by: Longarithm <the.aleksandr.logunov@gmail.com>
    2 people authored and marcelo-gonzalez committed Mar 12, 2024
    Configuration menu
    Copy the full SHA
    ed7a80f View commit details
    Browse the repository at this point in the history
  6. 1.38.0-rc.2

    marcelo-gonzalez committed Mar 12, 2024
    Configuration menu
    Copy the full SHA
    78122a6 View commit details
    Browse the repository at this point in the history

Commits on Mar 14, 2024

  1. Revert "near-vm: use a recycling pool of shared code memories instead…

    … of a in-memory cache of loaded artifacts (#9244)" (#10788)
    
    This reverts commit ad67e6b.
    nagisa authored Mar 14, 2024
    Configuration menu
    Copy the full SHA
    bb02713 View commit details
    Browse the repository at this point in the history

Commits on Mar 15, 2024

  1. rpc: return timeout error from tx_status_fetch (#10789)

    All the logic of `tx_status_fetch` function is: we poll tx_status method
    and wait for the desired level of tx finality.
    `tx_status_fetch` is used in several places including
    `broadcast_tx_commit` RPC method.
    
    With the chunk congestions we have right now, we return
    `UNKNOWN_TRANSACTION` error to `broadcast_tx_commit` after 20 seconds of
    waiting, which is both sad and weird.
    
    The error we store in `tx_status_result` is not good enough to show it
    to the user, otherwise we would break from the loop with it immediately.
    If we reach the timeout boundary, I suggest always to return timeout
    error.
    telezhnaya authored and marcelo-gonzalez committed Mar 15, 2024
    Configuration menu
    Copy the full SHA
    c5517fb View commit details
    Browse the repository at this point in the history
  2. support Included transaction status (#10796)

    Ideally, we need to rewrite `chain.get_final_transaction_result` method
    completely.
    But for now, let's start at least with supporting
    `TxExecutionStatus::Included` status.
    telezhnaya authored and marcelo-gonzalez committed Mar 15, 2024
    Configuration menu
    Copy the full SHA
    6f905b5 View commit details
    Browse the repository at this point in the history
  3. fix(snapshots): drop unwanted column families instead of deleting all…

    … keys (#10803)
    
    If the `columns_to_keep` arg of
    `checkpoint_hot_storage_and_cleanup_columns()` is `Some`, then we delete
    all the data in every other column. Then if snapshot compaction is
    enabled in the configs, we rely on that to clean up the files on disk.
    Instead of doing that, we can just call `drop_cf()` on every unwanted
    column family, and the associated sst files will be removed without the
    need for any compactions.
    
    So this moves the `columns_to_keep` arg to
    `near_store::db::Database::create_checkpoint()`, and has the rocksdb
    implementation of that trait call `drop_cf()` on unwanted column
    families. These column families are then immediately recreated in
    `checkpoint_hot_storage_and_cleanup_columns()` by the call to
    `StoreOpener::open_in_mode()`, but the data on disk is gone.
    
    This also means we can get rid of the state snapshot options in the
    config, since they were only ever intended to clean up the unwanted
    files, which aren't there anymore.
    marcelo-gonzalez committed Mar 15, 2024
    Configuration menu
    Copy the full SHA
    a56cd4a View commit details
    Browse the repository at this point in the history

Commits on Mar 18, 2024

  1. add state snapshot tool

    posvyatokum authored and marcelo-gonzalez committed Mar 18, 2024
    Configuration menu
    Copy the full SHA
    7fec892 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    afc8f93 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    b6b6f4f View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    301c01a View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    d639d51 View commit details
    Browse the repository at this point in the history
  6. 1.38.0

    marcelo-gonzalez committed Mar 18, 2024
    Configuration menu
    Copy the full SHA
    aac5e42 View commit details
    Browse the repository at this point in the history

Commits on Mar 26, 2024

  1. Configuration menu
    Copy the full SHA
    c64b022 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    0f28ce9 View commit details
    Browse the repository at this point in the history