Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix!: Only decrement the counter of an array if its address has not changed #7297

Merged
merged 21 commits into from
Feb 13, 2025

Conversation

aakoshh
Copy link
Contributor

@aakoshh aakoshh commented Feb 5, 2025

Description

Problem*

Trying to reproduce an issue we observed a while ago that array ref counts can go down to 0 and then to usize::MAX with an underflow if there are more dec_rc than inc_rc. Once that happens it is very hard to argue that subsequent ref-count values will make sense, and in particular that when the RC will equal 1 again, then it means the array is okay to be treated as mutable.

I observed usize::MAX while debugging some of the issues while syncing aztec-packages.

Summary*

Generated an extra constraint in Brillig after DecreaseRc that checks that the value is at least 1. I added this to find counter examples; I realise that it would increase the bytecode size to protect against something we should not be doing in the first place, so ideally this check can be removed, or made pedantic. In the end I added a --enable-brillig-debug-assertions flag to turn this check on (it's off by default), and threaded it through the application in a new BrilligOptions together with the flag to enable debug printing.

Came up with a fix for the status quo explained in this comment, which is to only carry out the decrement of the RC if an array_set has not created a new array; if it did then it would have also adjusted the original RC.

The fix is breaking because of the changing in the SSA format: dec_rc has an extra parameter now.

Additional Context

There were some fixes related to RC's in AztecProtocol/aztec-packages#11294 and with those in place there is only one test in this repo which fails, which is perhaps ironically the reference_counts test. With the fixes to the DIE pass removed, the brillig_rc_regression_6123 also fails with the same error.

Here's what I saw investigating the reference_counts test failure:

$ cd test_programs/execution_success/reference_counts
$ cargo run -q -p nargo_cli -- --program-dir . execute --inliner-aggressiveness 0 --force-brillig
0x00
0x03
0x04
0x06
0x06
error: Assertion failed: array ref-count went to zero
   ┌─ /Users/aakoshh/Work/aztec/noir/test_programs/execution_success/reference_counts/src/main.nr:46:5

46 │     println(array2[0]);
   │     ------------------

   = Call stack:
     1. /Users/aakoshh/Work/aztec/noir/test_programs/execution_success/reference_counts/src/main.nr:11:5
     2. /Users/aakoshh/Work/aztec/noir/test_programs/execution_success/reference_counts/src/main.nr:46:5

Failed assertion

Looking at the SSA what stands out is the following.

In the test we have this part:

/// Borrow the same array mutably through both parameters, inc_rc is necessary here, although
/// only one is needed to bring the rc from 1 to 2.
fn borrow_mut_two(array1: &mut [Field; 3], array2: &mut [Field; 3], rc_before_call: u32) {
    assert_refcount(*array1, rc_before_call + 1);
    assert_refcount(*array2, rc_before_call + 1);
    array1[0] = 5;
    array2[0] = 6;
 ....

So we write 5 and 6 into the same array through two different references.

Before Mem2Reg (2nd) it appears in the SSA like so:

    v3 = make_array [Field 0, Field 1, Field 2] : [Field; 3]
    v4 = allocate -> &mut [Field; 3]
    store v3 at v4
    ...
    v11 = load v4 -> [Field; 3]
    inc_rc v11
    v12 = load v4 -> [Field; 3]
    ...
    v17 = array_set v12, index u32 0, value Field 5
    store v17 at v4
    v18 = load v4 -> [Field; 3]
    v20 = array_set v18, index u32 0, value Field 6
    store v20 at v4
    v21 = load v4 -> [Field; 3]
    dec_rc v21
    ...

There is one inc_rc and one dec_rc which act on the v3 array, loaded through the v4 reference.

IIUC the RC being 2 means that array_set will create a copy, but because it stores it back into v4, both array_set will write to the same reference and thus only the last will survive.
What I don't fully understand is that I thought that when we make a copy, the RC of the new array is 1, so when we call dec_rc v21 I'm not sure it's not decrementing 1 to 0.

After the Mem2Reg (2nd) pass, the storing of 6 into the array turns into the creation of a new array instead:

    v3 = make_array [Field 0, Field 1, Field 2] : [Field; 3]
    v4 = allocate -> &mut [Field; 3]
    ...
    inc_rc v3
    ...
    v12 = make_array [Field 5, Field 1, Field 2] : [Field; 3]
    v14 = make_array [Field 6, Field 1, Field 2] : [Field; 3]
    dec_rc v14

Notice how dec_rc v14 is now acting on a new array. If that array had a ref-count of 1 start with, then it's now 0.

Documentation*

Check one:

  • No documentation needed.
  • Documentation included in this PR.
  • [For Experimental Features] Documentation to be submitted in a separate PR.

PR Checklist*

  • I have tested the changes locally.
  • I have formatted the changes with Prettier and/or cargo fmt on default settings.

@aakoshh
Copy link
Contributor Author

aakoshh commented Feb 5, 2025

@michaeljklein I tried running this on aztec-packages on the af/sync-noir-11294 branch (which includes the fixes I did). Chose a test which I knew did not require external setup:

./noir/bootstrap.sh
./noir-projects/bootstrap.sh
./yarn-project/bootstrap.sh
cd yarn-project/end-to-end
yarn test:e2e e2e_fees/failures.test.ts 

All 4 tests in this quite failed:

[17:14:20.049] DEBUG: wasm-simulator execution failed {"hash":3655590760533525500,"message":"Circuit execution failed: array ref-count went to zero"}
    error: {
      "callStack": [
        "0.2656",
        "0.5981",
        "0.17882"
      ],
      "rawAssertionPayload": {
        "selector": "15035963205125037332",
        "data": []
      },
      "brilligFunctionId": 0
    }
[17:14:20.049] DEBUG: bb-prover:wasm:bundle Failed to simulate PrivateKernelInnerArtifact {"circuitName":"private-kernel-inner"}
    error: {
      "callStack": [
        "0.2656",
        "0.5981",
        "0.17882"
      ],
      "rawAssertionPayload": {
        "selector": "15035963205125037332",
        "data": []
      },
      "brilligFunctionId": 0
    }
[17:14:20.050] ERROR: pxe:service Error: Error: Circuit execution failed: array ref-count went to zero
    at Object.<anonymous>.module.exports.__wbg_constructor_a10f2b77c63b8d5e (/mnt/user-data/akosh/aztec-packages/noir/packages/acvm_js/nodejs/acvm_js.js:646:17)
    at acvm_js::js_execution_error::JsExecutionError::new::hac8616e8116f28b4 (wasm://wasm/0093ac6a:1:1193712)
    at acvm_js::execute::ProgramExecutor<B>::execute_circuit::{{closure}}::h7328db332ff65aa6 (wasm://wasm/0093ac6a:1:396686)
    at acvm_js::execute::execute_program_with_native_program_and_return::{{closure}}::h1dbd573980cda30b (wasm://wasm/0093ac6a:1:969231)
    at acvm_js::execute::execute_program_with_native_type_return::{{closure}}::he3dce0d8e60779ba (wasm://wasm/0093ac6a:1:1135771)
    at wasm_bindgen_futures::future_to_promise::{{closure}}::{{closure}}::hb7075f80bb40c2ad (wasm://wasm/0093ac6a:1:882654)
    at wasm_bindgen_futures::queue::Queue::new::{{closure}}::h3f454fd38ec14ae9 (wasm://wasm/0093ac6a:1:1235576)
    at wasm_bindgen::convert::closures::invoke1_mut::h88a56384c96b757b (wasm://wasm/0093ac6a:1:1572449)
    at __wbg_adapter_22 (/mnt/user-data/akosh/aztec-packages/noir/packages/acvm_js/nodejs/acvm_js.js:220:10)
    at real (/mnt/user-data/akosh/aztec-packages/noir/packages/acvm_js/nodejs/acvm_js.js:205:20) {
  callStack: [ '0.2656', '0.5981', '0.17882' ],
  rawAssertionPayload: { selector: '15035963205125037332', data: [] },
  brilligFunctionId: 0
}

I'll push the branch to see the full extent of it in AztecProtocol/aztec-packages#11763

@jfecher
Copy link
Contributor

jfecher commented Feb 5, 2025

This is more reason to experiment with #7287 I think. Ref counts being these implicit values passed around along with arrays has lead to footguns in the past and:

What I don't fully understand is that I thought that when we make a copy, the RC of the new array is 1, so when we call dec_rc v21 I'm not sure it's not decrementing 1 to 0.

Certainly sounds like such a footgun as well.

@jfecher
Copy link
Contributor

jfecher commented Feb 5, 2025

As a bonus, I suspect once we do fix ref counts we should see a brillig speedup as well since I'd expect a well-behaved ref count to hover around 1 much more often since it is no longer drifting off.

@aakoshh
Copy link
Contributor Author

aakoshh commented Feb 6, 2025

Basically every test in aztec-packages fails with this assertion: https://github.com/AztecProtocol/aztec-packages/actions/runs/13164114024

@aakoshh
Copy link
Contributor Author

aakoshh commented Feb 6, 2025

@michaeljklein I did the experiment you suggested to detect whether the RC goes negative by moving the assertion AztecProtocol/aztec-packages@6217ca5

I'm afraid the results on aztec-packages are similarly bleak, this is not a false flag: https://github.com/AztecProtocol/aztec-packages/actions/runs/13178930935

In the noir repo none of the integration tests fail.

@aakoshh aakoshh changed the title feat(experimental): Add constraint on minimum value after dec_rc feat(experimental): Only decrement the counter of an array if its address has not changed Feb 6, 2025
@aakoshh aakoshh changed the title feat(experimental): Only decrement the counter of an array if its address has not changed fix: Only decrement the counter of an array if its address has not changed Feb 6, 2025
@aakoshh
Copy link
Contributor Author

aakoshh commented Feb 6, 2025

4652543 changes how the DecrementRc is carried out by the Brillig VM so that it only decrements if the variable to be decremented points at the same one we incremented when we entered the function. Otherwise an array_set probably detached them.

Copy link
Contributor

github-actions bot commented Feb 6, 2025

Changes to number of Brillig opcodes executed

Generated at commit: b5e594461ac8d5cc988218ff10487222811c3a56, compared to commit: 97afa52f5212be2d05af26b9e8dde9c3ea7a1d2e

🧾 Summary (10% most significant diffs)

Program Brillig opcodes (+/-) %
reference_counts_inliner_zero +955 ❌ +131.91%
reference_counts_inliner_min +852 ❌ +87.84%

Full diff report 👇
Program Brillig opcodes (+/-) %
reference_counts_inliner_zero 1,679 (+955) +131.91%
reference_counts_inliner_min 1,822 (+852) +87.84%
reference_counts_inliner_max 786 (+348) +79.45%
brillig_rc_regression_6123_inliner_min 374 (+6) +1.63%
brillig_rc_regression_6123_inliner_max 290 (+2) +0.69%
brillig_rc_regression_6123_inliner_zero 290 (+2) +0.69%
regression_6674_2_inliner_min 1,053 (+2) +0.19%
regression_6674_3_inliner_min 2,008 (+2) +0.10%

Copy link
Contributor

github-actions bot commented Feb 6, 2025

Changes to Brillig bytecode sizes

Generated at commit: b5e594461ac8d5cc988218ff10487222811c3a56, compared to commit: 97afa52f5212be2d05af26b9e8dde9c3ea7a1d2e

🧾 Summary (10% most significant diffs)

Program Brillig opcodes (+/-) %
reference_counts_inliner_max +384 ❌ +75.59%
reference_counts_inliner_zero +476 ❌ +71.47%

Full diff report 👇
Program Brillig opcodes (+/-) %
reference_counts_inliner_max 892 (+384) +75.59%
reference_counts_inliner_zero 1,142 (+476) +71.47%
reference_counts_inliner_min 1,282 (+443) +52.80%
brillig_rc_regression_6123_inliner_min 242 (+6) +2.54%
brillig_rc_regression_6123_inliner_max 172 (+3) +1.78%
brillig_rc_regression_6123_inliner_zero 172 (+3) +1.78%
regression_6674_2_inliner_min 309 (+3) +0.98%
regression_6674_3_inliner_min 870 (+3) +0.35%

github-actions[bot]

This comment was marked as resolved.

github-actions[bot]

This comment was marked as resolved.

@aakoshh aakoshh marked this pull request as ready for review February 6, 2025 17:15
@aakoshh aakoshh requested a review from a team February 6, 2025 17:15
Copy link
Contributor

@jfecher jfecher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not confident this is the correct fix or whether its applying a bandaid over a broken system, which could still be broken somewhere else after.

If it improves on what we currently have it could be good as a temporary fix but I'd like to see some test cases we currently fail with for a change like this to show that.

Blocking this PR before we have such tests but I wont "request changes" since I'm off next week to avoid keeping this blocked then once tests are added.

@aakoshh
Copy link
Contributor Author

aakoshh commented Feb 7, 2025

It's meant as a band-aid over a clear case that can cause the ref count to be decreased to zero, but it doesn't prevent the ref count from creeping up when there are no dec_rc issued at all, which is the case for mutable borrows for example (only references are treated with an inc+dec pair). I hope this will prevent accidental mutations when one of these counts ends up coming back to 1 when it should be 2 or 3.

The test failure can be provoked like this:

  1. Change the condition in brillig_block.rs to if true || array_register == orig_array_register { codegen_dec_rc(self.brillig_context);
  2. Run cd test_programs/execution_success/reference_counts && cargo run -q -p nargo_cli -- --program-dir . execute --force-brillig --silence-warnings --enable-brillig-debug-assertions

I'll try to write a test using the array_refcount 👍

@aakoshh aakoshh changed the title fix: Only decrement the counter of an array if its address has not changed fix!: Only decrement the counter of an array if its address has not changed Feb 7, 2025
@aakoshh aakoshh requested a review from TomAFrench February 10, 2025 14:58
@aakoshh aakoshh requested a review from a team February 11, 2025 13:07
Copy link
Contributor

@vezenovm vezenovm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After going through the discussions on this PR and the issue, I do think this is a reasonable temporary fix. Especially as we are planning to experiment with an overhaul of our RCs. Having the original array as part of the dec_rc does look provide us some connection to the original inc_rc for which the dec_rc was intended.

I have a couple nits and clarifying questions surrounding the test you added, but otherwise I am good to merge.

@aakoshh aakoshh requested a review from vezenovm February 13, 2025 12:56
@aakoshh aakoshh added this pull request to the merge queue Feb 13, 2025
Merged via the queue into master with commit 93d1740 Feb 13, 2025
103 checks passed
@aakoshh aakoshh deleted the af/decrease-rc-to-zero branch February 13, 2025 14:59
TomAFrench added a commit that referenced this pull request Feb 13, 2025
* master:
  chore: avoid u128s in brillig memory (#7363)
  chore: update docs about integer overflows (#7370)
  fix!: Only decrement the counter of an array if its address has not changed (#7297)
  fix: let LSP read `noirfmt.toml` for formatting files (#7355)
  chore: deprecate keccak256 (#7361)
AztecBot added a commit to AztecProtocol/aztec-packages that referenced this pull request Feb 13, 2025
chore: update docs about integer overflows (noir-lang/noir#7370)
fix!: Only decrement the counter of an array if its address has not changed (noir-lang/noir#7297)
fix: let LSP read `noirfmt.toml` for formatting files (noir-lang/noir#7355)
TomAFrench added a commit to AztecProtocol/aztec-packages that referenced this pull request Feb 13, 2025
Automated pull of development from the
[noir](https://github.com/noir-lang/noir) programming language, a
dependency of Aztec.
BEGIN_COMMIT_OVERRIDE
chore: avoid u128s in brillig memory
(noir-lang/noir#7363)
chore: update docs about integer overflows
(noir-lang/noir#7370)
fix!: Only decrement the counter of an array if its address has not
changed (noir-lang/noir#7297)
fix: let LSP read `noirfmt.toml` for formatting files
(noir-lang/noir#7355)
END_COMMIT_OVERRIDE

---------

Co-authored-by: Tom French <15848336+TomAFrench@users.noreply.github.com>
TomAFrench added a commit that referenced this pull request Feb 14, 2025
* master: (42 commits)
  fix: give "correct" error when trying to use AsTraitPath (#7360)
  chore: avoid u128s in brillig memory (#7363)
  chore: update docs about integer overflows (#7370)
  fix!: Only decrement the counter of an array if its address has not changed (#7297)
  fix: let LSP read `noirfmt.toml` for formatting files (#7355)
  chore: deprecate keccak256 (#7361)
  feat: `FunctionDefinition::as_typed_expr` (#7358)
  feat(performance): Check sub operations against induction variables (#7356)
  chore: avoid doing all brillig integer arithmetic on u128s (#7357)
  feat(cli): Add `--target-dir` option (#7350)
  fix(ssa): Make the lookback feature opt-in (#7190)
  feat(performance): Use unchecked ops based upon known induction variables (#7344)
  chore: mark sha256 as deprecated from the stdlib (#7351)
  fix: incorrect secondary file in LSP errors (#7347)
  chore: Basic test for MSM in Noir to catch performance improvements and regressions (#7341)
  fix(cli): Only lock the packages selected in the workspace (#7345)
  chore: remove some unused types and functions in the AST (#7339)
  chore: remove foreign calls array from Brillig VM constructor (#7337)
  chore(ci): Add Vecs and vecs to cspell (#7342)
  chore: redo typo PR by osrm (#7238)
  ...
TomAFrench added a commit that referenced this pull request Feb 14, 2025
…rom-brillig

* master:
  chore: allow opting in to displaying benchmark comments (#7399)
  chore: box `ExprValue` in `Value` enum (#7388)
  chore: pull out refactored methods from u128 branch (#7385)
  feat: require safety comments instead of safety doc comments (#7295)
  fix(ssa): Do not deduplicate division by a zero constant (#7393)
  chore: document traits required to be in scope (#7387)
  fix: field zero division in brillig (#7386)
  chore: box `ParserError`s in `InterpreterError` (#7373)
  chore: remove unnecessary dereferencing within brillig vm (#7375)
  fix: give "correct" error when trying to use AsTraitPath (#7360)
  chore: avoid u128s in brillig memory (#7363)
  chore: update docs about integer overflows (#7370)
  fix!: Only decrement the counter of an array if its address has not changed (#7297)
  fix: let LSP read `noirfmt.toml` for formatting files (#7355)
  chore: deprecate keccak256 (#7361)
  feat: `FunctionDefinition::as_typed_expr` (#7358)
  feat(performance): Check sub operations against induction variables (#7356)
  chore: avoid doing all brillig integer arithmetic on u128s (#7357)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants