Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Early frees on CPU Implementations #3193

Closed
Tracked by #3678
teoxoy opened this issue Nov 8, 2022 · 31 comments · Fixed by #3626 or #5251
Closed
Tracked by #3678

Early frees on CPU Implementations #3193

teoxoy opened this issue Nov 8, 2022 · 31 comments · Fixed by #3626 or #5251
Assignees
Labels
api: dx12 Issues with DX12 or DXGI area: correctness We're behaving incorrectly type: bug Something isn't working

Comments

@teoxoy
Copy link
Member

teoxoy commented Nov 8, 2022

Found in #3174 (comment)
Related: #3031, #2285

Description
Getting DX12 errors

[ERROR wgpu_hal::auxil::dxgi::exception] ID3D12CommandAllocator::Reset: A command allocator 0x000001FD96C262F0:'Unnamed ID3D12CommandAllocator Object' is being reset before previous executions associated with the allocator have completed. [ EXECUTION ERROR #552: COMMAND_ALLOCATOR_SYNC]
[ERROR wgpu_hal::auxil::dxgi::exception] ID3D12Resource2::<final-release>: CORRUPTION: An ID3D12Resource object (0x000001FD96C25E80:'(wgpu internal) Staging') is referenced by GPU operations in-flight on Command Queue (0x000001FD96B64D10:'Unnamed ID3D12CommandQueue Object').  It is not safe to final-release objects that may have GPU operations pending.  This can result in application instability. [ EXECUTION ERROR #921: OBJECT_DELETED_WHILE_STILL_IN_USE]

or

[ERROR wgpu_hal::auxil::dxgi::exception] ID3D12CommandAllocator::Reset: The command allocator cannot be reset because a command list is currently being recorded with the allocator. [ EXECUTION ERROR #543: COMMAND_ALLOCATOR_CANNOT_RESET]

depending if queue.submit is called in the repro below.

This feels like a timing issue (also pointed out by @kvark in #2285 (comment)) since I could only reproduce this locally by increasing array_size to 2048. I also can't reproduce the issue on actual hardware (tried on an Nvidia dGPU and Intel iGPU).

Repro steps

use wgpu::{
    BindGroupDescriptor, BindGroupEntry, BindGroupLayoutDescriptor, BindGroupLayoutEntry,
    BindingType, BufferBindingType, BufferDescriptor, BufferUsages, CommandEncoderDescriptor,
    ComputePassDescriptor, ComputePipelineDescriptor, DownlevelFlags, Limits,
    PipelineLayoutDescriptor, ShaderModuleDescriptor, ShaderSource, ShaderStages,
};

use crate::common::{initialize_test, TestParameters, TestingContext};

#[test]
fn test() {
    initialize_test(
        TestParameters::default()
            .downlevel_flags(DownlevelFlags::COMPUTE_SHADERS)
            .limits(Limits::downlevel_defaults()),
        test_impl,
    );
}

const SRC: &'static str = r#"
let array_size = 2048u; // increase this if you can't reproduce

var<workgroup> w_mem: array<u32, array_size>;

@group(0) @binding(0)
var<storage, read_write> output: u32;

@compute @workgroup_size(1)
fn main() {
    w_mem = array<u32, array_size>();
    workgroupBarrier();

    var is_zero = true;
    for(var i = 0u; i < array_size; i++) {
        is_zero &= w_mem[i] == 0u;
    }
    output = u32(!is_zero);
}
"#;

fn test_impl(ctx: TestingContext) {
    let bgl = ctx
        .device
        .create_bind_group_layout(&BindGroupLayoutDescriptor {
            label: None,
            entries: &[BindGroupLayoutEntry {
                binding: 0,
                visibility: ShaderStages::COMPUTE,
                ty: BindingType::Buffer {
                    ty: BufferBindingType::Storage { read_only: false },
                    has_dynamic_offset: false,
                    min_binding_size: None,
                },
                count: None,
            }],
        });

    let output_buffer = ctx.device.create_buffer(&BufferDescriptor {
        label: None,
        size: 4,
        usage: BufferUsages::COPY_DST | BufferUsages::STORAGE,
        mapped_at_creation: false,
    });

    let bg = ctx.device.create_bind_group(&BindGroupDescriptor {
        label: None,
        layout: &bgl,
        entries: &[BindGroupEntry {
            binding: 0,
            resource: output_buffer.as_entire_binding(),
        }],
    });

    let pll = ctx
        .device
        .create_pipeline_layout(&PipelineLayoutDescriptor {
            label: None,
            bind_group_layouts: &[&bgl],
            push_constant_ranges: &[],
        });

    let sm = ctx.device.create_shader_module(ShaderModuleDescriptor {
        label: None,
        source: ShaderSource::Wgsl(SRC.into()),
    });

    let pipeline = ctx
        .device
        .create_compute_pipeline(&ComputePipelineDescriptor {
            label: None,
            layout: Some(&pll),
            module: &sm,
            entry_point: "main",
        });

    ctx.queue
        .write_buffer(&output_buffer, 0, bytemuck::cast_slice(&[1])); // if this is here, we get OBJECT_DELETED_WHILE_STILL_IN_USE

    let mut encoder = ctx
        .device
        .create_command_encoder(&CommandEncoderDescriptor::default());

    let mut cpass = encoder.begin_compute_pass(&ComputePassDescriptor::default());
    cpass.set_pipeline(&pipeline);
    cpass.set_bind_group(0, &bg, &[]);
    cpass.dispatch_workgroups(1, 1, 1);
    drop(cpass);

    ctx.queue.submit(Some(encoder.finish())); // if this is removed, we get COMMAND_ALLOCATOR_CANNOT_RESET
}

Expected vs observed behavior
No errors.

Platform
Windows 11, wgpu master (08b160c)

@teoxoy teoxoy added the api: dx12 Issues with DX12 or DXGI label Nov 8, 2022
@teoxoy
Copy link
Member Author

teoxoy commented Dec 21, 2022

This might end up having the same fix as #3160 (which is also easier to reproduce).

@cwfitzgerald cwfitzgerald added type: bug Something isn't working area: correctness We're behaving incorrectly labels Dec 26, 2022
@cwfitzgerald
Copy link
Member

This is reproing in CI now on linux: https://github.com/gfx-rs/wgpu/actions/runs/5173769981/jobs/9319358001?pr=3830

@cwfitzgerald cwfitzgerald changed the title Early frees on DX12 with WARP Early frees on CPU Implementations Jun 8, 2023
@teoxoy
Copy link
Member Author

teoxoy commented Aug 16, 2023

The COMMAND_ALLOCATOR_CANNOT_RESET error will most likely go away with #4023.

@cwfitzgerald
Copy link
Member

This is unfortunately still reproducing on CI (see #4728)

@cwfitzgerald cwfitzgerald reopened this Nov 20, 2023
@Dinnerbone
Copy link
Contributor

Confirmed that this is still happening. This happens extremely often with Ruffle and it's starting to make our visual tests quite flaky

@cwfitzgerald
Copy link
Member

Now that the arcanization dust has settled, we should be able to properly investigate this.

@jimblandy
Copy link
Member

It seems like CI for #5222 hits this reliably in the minimum_buffer_binding_size_dispatch test.

@jimblandy
Copy link
Member

jimblandy commented Feb 9, 2024

Some background. This is not a coherent explanation of anything, just me writing down what seemed possibly relevant:

Direct3D 12's complaint is that users are not permitted to call ID3D12CommandAllocator::Reset on a command allocator "if there is an actively recording command list referencing the command allocator" (docs).

In the d3d12 crate

  • CommandAllocator::reset calls ID3D12CommandAllocator::reset.

  • Device::create_graphics_command_list calls ID3D12Device::CreateCommandList.

  • GraphicsCommandList::close calls ID3D12GraphicsCommandList::Close. I assume this puts it in the "closed" state.

  • GraphicsCommandList::reset calls ID3D12GraphicsCommandList::Reset. Based on the docs, I believe GraphicsCommandList::reset makes the command list "actively recording" again:

    Before an app calls Reset, the command list must be in the "closed" state.
    ...
    After Reset succeeds, the command list is left in the "recording" state.

In other words, both allocators and lists have reset methods, and creating a fresh list or calling reset on an extant list means we can't call reset on its allocator any more, until we have called close on that list.

So in wgpu_hal's terms, I believe this means that we cannot call d3d12::CommandAllocator::reset unless we have called d3d12::GraphicsCommandList::close on all the graphics command lists created by calling d3d12::Device::create_graphics_command_list on that allocator.

In the wgpu_hal::dx12 module

The CommandEncoder type:

  • Owns a d3d12::CommandAllocator, created and dropped with the CommandEncoder.

  • Has an Option<GraphicsCommandList>, which is the allocator's currently recording command list, if one exists. Per the docs:

    A given allocator can be associated with no more than one currently recording command list at a time

    I gather this is that.

  • Maintains a Vec of recycled d3d12::GraphicsCommandLists, ready for reuse. Obviously, these had better not be "currently recording".

CommandEncoder methods of interest are:

  • reset_all, which is the only call to CommandAllocator::reset. That's the function that raises the error that this issue exists to complain about. It also pushes a bunch of GraphicsCommandList values from dx12::CommandBuffers onto the recycled list.
  • begin_encoding:
    • is the only place that calls d3d12::Device::create_graphics_command_list.
    • is the only place that draws from the list of recycled command lists. It calls reset on them.
    • is the only place that sets CommandEncoder's currently recording command list.
  • discard_encoding takes the current recording list, calls close on it, and puts it on the recycled list.
  • end_encoding also takes the current recording list, calls close on it, and returns it as a dx12::CommandBuffer. This is the only place that constructs dx12::CommandBuffer values.

Device methods of interest are:

  • destroy_command_encoder, which calls close on the CommandEncoder's currently recording command list, if any, and then drops the CommandEncoder.

The three uses of close described above (discard_encoding, end_encoding and destroy_command_encoder) are the only uses of GraphicsCommandList::close.

The only calls to GraphicsCommandList::reset or d3d12::Device::create_graphics_command_list are in begin_encoding, so that is the only place that creates GraphicsCommandLists in the "recording" state.

Graphics command list state invariants:

  • The CommandEncoder's currently recording command list is always either None or in the "recording" state:
    • It is only set in begin_encoding, which either creates a fresh GraphicsCommandList, or takes one off the recycled list and calls reset on it. Either way, the result is a "recording" command list.
    • It is only closed in discard_encoding and end_encoding, which both take the value.
  • A dx12::CommandBuffer's GraphicsCommandList is always in the "closed" state, since CommandBuffer is only constructed in end_encoding, which just saw a close call succeed.

Further, the CommandEncoder's currently recording command list, if there is one, is the only "recording" command list for that CommandEncoder's allocator.

To provoke the error, we have to reach CommandEncoder::reset_all while there is a "recording" command list. In other words, we have to call reset_all while the CommandEncoder has a currently recording command list.

The reset_all method is used numerous places throughout wgpu_core, all of which maintain an A::CommandEncoder and a list of A::CommandBuffer values constructed from it (a typical example is BakedCommands). These uses all call reset_all on the encoder and pass its associated buffers along.

@jimblandy
Copy link
Member

So I think all that's necessary for this bug to occur is for wgpu_core to call reset_all on a CommandEncoder while it's still got a currently recording command list --- i.e. CommandEncoder::list needs to be Some.

@jimblandy
Copy link
Member

I found a workaround for #5222, but https://github.com/jimblandy/wgpu/tree/repro-wgpu-3193 has the code that was crashing on CI.

@jimblandy
Copy link
Member

Confirmed that that branch still crashes. That suggests that this change removed the behavior that triggers the bug.

@jimblandy

This comment was marked as outdated.

@torokati44

This comment was marked as outdated.

@jimblandy

This comment was marked as outdated.

@torokati44

This comment was marked as outdated.

@ErichDonGubler
Copy link
Member

ErichDonGubler commented Feb 13, 2024

Confirmed that that branch still crashes. That suggests that this change removed the behavior that triggers the bug.

@jimblandy: Similar to what's noted in the large repro case in the OP, the major difference in that change is the addition of a call to CommandEncoder::finish before ending the test. I can confirm that on my machine; the issue reproduces faithfully on the DX12 backend without it, and with it, the test passes.

@ErichDonGubler
Copy link
Member

ErichDonGubler commented Feb 14, 2024

Investigating now. I suspect this has to do with the order in which we attempt to free resources. I have a WIP PR that seems to fix the issue by forcing a command encoder to be discarded when command_encoder_drop is called: #5251

As a note, I'm taking time off from my day job for the next 7 days. My attention on this issue is going to be sparse until then, but perhaps I've given the rest of us a nice springboard to a quick resolution, if anybody has bandwidth in the meantime? EDIT: Already merged. 😂

@ErichDonGubler
Copy link
Member

N.B. that we already have a repro. case in the form of the wgpu_tests::encoder::drop_encoder_after_error test by @bradwerth, where the DX12 failure noted in the OP is specifically noted as an accepted failure case unique to the DX12 backend.

@Dinnerbone
Copy link
Contributor

Dinnerbone commented Mar 1, 2024

Looks like there's still another cause somewhere. This is on 0.19.3:

[WARN  wgpu_hal::auxil::dxgi::exception] ID3D12Device::CreateHeap: Specifying D3D12_CPU_PAGE_PROPERTY_WRITE_COMBINE on pDesc can have a large performance impact. [ STATE_CREATION WARNING #1318: WRITE_COMBINE_PERFORMANCE_WARNING]
[ERROR wgpu_hal::auxil::dxgi::exception] ID3D12CommandAllocator::Reset: A command allocator 0x00000168759105E0:'Unnamed ID3D12CommandAllocator Object' is being reset before previous executions associated with the allocator have completed. [ EXECUTION ERROR #552: COMMAND_ALLOCATOR_SYNC]
[ERROR wgpu_hal::auxil::dxgi::exception] ID3D12CommandAllocator::Reset: A command allocator 0x0000016875916EF0:'Unnamed ID3D12CommandAllocator Object' is being reset before previous executions associated with the allocator have completed. [ EXECUTION ERROR #552: COMMAND_ALLOCATOR_SYNC]
[ERROR wgpu_hal::auxil::dxgi::exception] ID3D12CommandAllocator::Reset: A command allocator 0x00000168759157E0:'Unnamed ID3D12CommandAllocator Object' is being reset before previous executions associated with the allocator have completed. [ EXECUTION ERROR #552: COMMAND_ALLOCATOR_SYNC]
[ERROR wgpu_hal::auxil::dxgi::exception] ID3D12CommandAllocator::Reset: A command allocator 0x0000016875913E40:'Unnamed ID3D12CommandAllocator Object' is being reset before previous executions associated with the allocator have completed. [ EXECUTION ERROR #552: COMMAND_ALLOCATOR_SYNC]
[ERROR wgpu_hal::auxil::dxgi::exception] ID3D12Resource2::<final-release>: CORRUPTION: An ID3D12Resource object (0x000001687739A330:'Unnamed Object') is referenced by GPU operations in-flight on Command Queue (0x0000016846B71910:'Unnamed ID3D12CommandQueue Object').  It is not safe to final-release objects that may have GPU operations pending.  This can result in application instability. [ EXECUTION ERROR #921: OBJECT_DELETED_WHILE_STILL_IN_USE]

@torokati44
Copy link
Contributor

@cwfitzgerald cwfitzgerald reopened this Mar 2, 2024
@ErichDonGubler
Copy link
Member

ErichDonGubler commented Mar 2, 2024

Perhaps we're not correctly working with the lifetime of discarding command encodings yet WRT fences, then? At least for the DX12 backend, the error message makes sense to me, with that hypothesis: we're not sync'ing on a D3D12 fence to guarantee that backend work is actually finished before we try to reset. MSDN docs. that seem noteworthily relevant:

The example code in the overview seems particularly interesting.

FTR: I've also seen this issue in Firefox on DX12 testing for CTS, though I don't have a log link handy, ATM.

@ErichDonGubler
Copy link
Member

Unassigning from myself to reflect that I'm not giving this active attention, ATM.

@ErichDonGubler ErichDonGubler removed their assignment Mar 11, 2024
@Imberflur
Copy link
Contributor

Imberflur commented Apr 15, 2024

I'm somewhat consistently seeing this when running this test on a windows laptop:

 wgpu-test::wgpu-test [Executed] [Dx12/Microsoft Basic Render Driver/2] wgpu_test::clear_texture::clear_texture_uncompressed_gles
[2024-04-15T00:39:18Z ERROR wgpu_hal::auxil::dxgi::exception] ID3D12CommandAllocator::Reset: A command allocator 0x000002CB36321FC0:'Unnamed ID3D12CommandAllocator Object' is being reset before previous executions associated with the allocator have completed. [ EXECUTION ERROR #552: COMMAND_ALLOCATOR_SYNC]
[2024-04-15T00:39:18Z ERROR wgpu_hal::auxil::dxgi::exception] ID3D12Resource2::<final-release>: CORRUPTION: An ID3D12Resource object (0x000002CB3575AAD0:'(wgpu internal) initializing unmappable buffer') is referenced by GPU operations in-flight on Command Queue (0x000002CB36BC5650:'Unnamed ID3D12CommandQueue Object').  It is not safe to final-release objects that may have GPU operations pending.  This can result in application instability. [ EXECUTION ERROR #921: OBJECT_DELETED_WHILE_STILL_IN_USE]

Notably, the test takes ~70 seconds before it fails.

edit: after I posted this it stopped consistently failing 😭

@teoxoy
Copy link
Member Author

teoxoy commented Jul 16, 2024

I think #5251 resolved the COMMAND_ALLOCATOR_CANNOT_RESET error.

For the remaining COMMAND_ALLOCATOR_SYNC -> OBJECT_DELETED_WHILE_STILL_IN_USE error, I see a pattern in the logs:

These all point to improper recycling of encoders in pending writes which I guess I recently fixed in 61739d9 (#5910). At the time I wasn't aware that this was an invariant of encoders but it makes sense in retrospect.

@Dinnerbone @torokati44 @Imberflur could you try trunk or any commit after the one referenced above to see if this issue was resolved?

@teoxoy teoxoy self-assigned this Jul 16, 2024
@torokati44
Copy link
Contributor

torokati44 commented Jul 16, 2024

I'm seeing a different error at the moment:

thread '<unnamed>' panicked at C:\Users\runneradmin\.cargo\git\checkouts\wgpu-53e70f8674b08dd4\f44f52a\wgpu-core\src\resource.rs:900:9:
assertion failed: data.len() >= self.size.get() as usize

https://github.com/torokati44/ruffle/actions/runs/9959259396/job/27515702109#step:8:4168

EDIT: But the ones we used to see this error on (stage3d_raytrace and stage3d_sampler_partial_upload) are now passing on the first try! 🎉

@teoxoy
Copy link
Member Author

teoxoy commented Jul 16, 2024

Hmm, can you try 347d902 instead? I probably made a false assumption in 6f16ea4, will check.

@torokati44
Copy link
Contributor

Sure thing: https://github.com/torokati44/ruffle/actions/runs/9960381053/job/27519492645

@torokati44
Copy link
Contributor

Would you look at that, it actually passed! 🥳

@teoxoy
Copy link
Member Author

teoxoy commented Jul 16, 2024

I think we can finally close this then!

@teoxoy teoxoy closed this as completed Jul 16, 2024
@torokati44
Copy link
Contributor

yay! patch release when? 😛

@teoxoy
Copy link
Member Author

teoxoy commented Jul 16, 2024

Next release is scheduled for tomorrow actually, I don't think we can easily do a patch release since there have been a bunch of refactors to that area of the code.

teoxoy added a commit to teoxoy/wgpu that referenced this issue Jul 17, 2024
The size of the given `data` might be less than the size of the staging buffer.
This issue became apparent with the refactor in 6f16ea4 (gfx-rs#5946) since there is now an assert in `StagingBuffer.write()`.

Ruffle ran into this in gfx-rs#3193 (comment).
cwfitzgerald pushed a commit that referenced this issue Jul 17, 2024
The size of the given `data` might be less than the size of the staging buffer.
This issue became apparent with the refactor in 6f16ea4 (#5946) since there is now an assert in `StagingBuffer.write()`.

Ruffle ran into this in #3193 (comment).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: dx12 Issues with DX12 or DXGI area: correctness We're behaving incorrectly type: bug Something isn't working
Projects
Status: Done
7 participants