Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retrieval deal fails with "deal data transfer failed: incomplete response" - miner side view #5530

Closed
stuberman opened this issue Feb 5, 2021 · 13 comments
Labels
area/markets/client area/markets/retrieval team/ignite Issues and PRs being tracked by Team Ignite at Protocol Labs

Comments

@stuberman
Copy link

Describe the bug
When a client requests retrieval of a sealed deal, miner ignored unsealed sector, unseals sealed sector and miner see error "deal data transfer failed: incomplete response". This occurs for different clients. Each deal is 32GiB and miner works fine.

To Reproduce
Steps to reproduce the behavior:

  1. Watch miner logs as various clients try to retrieve their deals.
  2. See error

Expected behavior

  1. Unsealed stored sectors would extract deal and send to client with success.
  2. Sealed sectors will unseal and extract deal and send to client with success.

Screenshots
See attached file for various logs and displays

Version (run lotus version):
lotus-miner version 1.4.1+mainnet+git.ba571794d (and earlier versions)

Additional context
Have seen this on worker and miner. There are no other tasks running (no sealing right now) and plenty of available disk space >
Unseal.txt
1 TiB NVMe in each sealing repo.

@flyworker
Copy link

we are doing some retrieval testing today.

@f8-ptrk
Copy link
Contributor

f8-ptrk commented Feb 5, 2021

i can firm the "incomplete response" behavior after unsealing for hours

@stuberman
Copy link
Author

stuberman commented Feb 5, 2021

Here is another retrieval deal that failed today - sector 305 deal ID 1369755 - 2 GB deal
Stored sealed and unsealed sector final log entry: ""state": "DealStatusErrored", "message": "deal data transfer failed: incomplete response"}"

Unseal sector 305.txt

@dirkmc
Copy link
Contributor

dirkmc commented Feb 5, 2021

Thanks for the bug reports. Right now we're concentrating on getting batch publish deals out but we'll be looking into retrieval deal issues next.

@dirkmc
Copy link
Contributor

dirkmc commented Feb 11, 2021

The fix for this should be coming in the next release: #5417

@dirkmc
Copy link
Contributor

dirkmc commented Feb 11, 2021

Note that there is also work being done to improve unseal performance: filecoin-project/rust-fil-proofs#1402

@dirkmc dirkmc changed the title Retrieval deal fails - miner side view Retrieval deal fails with "deal data transfer failed: incomplete response" - miner side view Feb 17, 2021
@raulk raulk added the team/ignite Issues and PRs being tracked by Team Ignite at Protocol Labs label Feb 23, 2021
@stuberman
Copy link
Author

stuberman commented Mar 5, 2021

Fast retrieval failure today (on miner side - not sure who client is)
lotus-miner version

Daemon: 1.5.0+mainnet+git.55b85f314+api1.0.1
Local: lotus-miner version 1.5.0+mainnet+git.55b85f314

Fast retrieval failure.txt

@dirkmc
Copy link
Contributor

dirkmc commented Mar 5, 2021

Thanks @stuberman

Surfacing that in the previous comment the error on the miner side is

2021-03-05T00:25:37.594Z	INFO	graphsync	responsemanager/responsemanager.go:477	response failed: %!w(traversal.SkipMe={})

I need to investigate but I believe graphsync emits this error when it's unable to find a block.
So it seems that the provider unsealed the data but then graphsync was unable to find all of the blocks for the data.

@stuberman
Copy link
Author

@dirkmc I do get graphsync errors on a regular basis - below are recent graphsync events.
I am a miner (provider?) and able to work with someone who wants to test to replicate or get more details.
As far as I know all of the sectors and deals are intact, even if there are errors retrieving them.
I just looked at the storage array and ran a scrub which returned no errors.

grep graphsync /pool/logs/miner.log

2021-03-04T01:55:33.734Z INFO graphsync impl/graphsync.go:299 Graphsync ReceiveError: stream reset
2021-03-04T01:55:37.183Z INFO graphsync impl/graphsync.go:299 Graphsync ReceiveError: message too large
2021-03-04T01:55:48.987Z INFO graphsync impl/graphsync.go:299 Graphsync ReceiveError: message too large
2021-03-04T01:56:04.489Z INFO graphsync impl/graphsync.go:299 Graphsync ReceiveError: message too large
2021-03-04T01:56:05.165Z WARN dt_graphsync graphsync/graphsync.go:150 graphsync request context cancelled, channel Id: 12D3KooWQzMJbLno5cudsqTUEsrycN1U3gHVtayRj4dZjRnEvrfc-12D3KooWQvdhZSSArWTSBiyjVZa3mAmzoLHXV2QS9nFoaGLv3fL7-47
2021-03-04T01:56:17.083Z INFO graphsync impl/graphsync.go:299 Graphsync ReceiveError: message too large
2021-03-04T01:56:33.370Z INFO graphsync impl/graphsync.go:299 Graphsync ReceiveError: message too large
2021-03-04T01:56:44.607Z INFO graphsync impl/graphsync.go:299 Graphsync ReceiveError: message too large
2021-03-04T01:56:58.517Z INFO graphsync impl/graphsync.go:299 Graphsync ReceiveError: message too large
2021-03-04T01:57:04.437Z INFO graphsync impl/graphsync.go:299 Graphsync ReceiveError: message too large
2021-03-04T01:57:18.759Z INFO graphsync impl/graphsync.go:299 Graphsync ReceiveError: message too large
2021-03-04T01:57:19.754Z INFO graphsync impl/graphsync.go:299 Graphsync ReceiveError: message too large
2021-03-04T02:35:17.346Z INFO graphsync impl/graphsync.go:299 Graphsync ReceiveError: stream reset
2021-03-04T04:09:44.114Z INFO graphsync impl/graphsync.go:299 Graphsync ReceiveError: stream reset
2021-03-04T15:22:29.650Z WARN dt_graphsync graphsync/graphsync.go:150 graphsync request context cancelled, channel Id: 12D3KooWQzMJbLno5cudsqTUEsrycN1U3gHVtayRj4dZjRnEvrfc-12D3KooWQvdhZSSArWTSBiyjVZa3mAmzoLHXV2QS9nFoaGLv3fL7-47
2021-03-04T15:23:15.149Z INFO graphsync messagequeue/messagequeue.go:176 cant open message sender to peer 12D3KooWQzMJbLno5cudsqTUEsrycN1U3gHVtayRj4dZjRnEvrfc: routing: not found
2021-03-04T23:43:30.523Z INFO graphsync responsemanager/responsemanager.go:477 response failed: %!w(traversal.SkipMe={})
2021-03-04T23:44:17.408Z INFO graphsync impl/graphsync.go:299 Graphsync ReceiveError: stream reset
2021-03-05T00:25:37.594Z INFO graphsync responsemanager/responsemanager.go:477 response failed: %!w(traversal.SkipMe={})
2021-03-05T00:26:17.408Z INFO graphsync impl/graphsync.go:299 Graphsync ReceiveError: stream reset
2021-03-05T01:26:03.408Z INFO graphsync responsemanager/responsemanager.go:477 response failed: %!w(traversal.SkipMe={})
2021-03-05T01:27:17.409Z INFO graphsync impl/graphsync.go:299 Graphsync ReceiveError: stream reset
2021-03-05T01:52:02.086Z INFO graphsync responsemanager/responsemanager.go:477 response failed: %!w(traversal.SkipMe={})
2021-03-05T01:53:24.701Z INFO graphsync impl/graphsync.go:299 Graphsync ReceiveError: stream reset
2021-03-05T04:31:28.429Z INFO graphsync responsemanager/responsemanager.go:477 response failed: %!w(traversal.SkipMe={})
2021-03-05T04:35:17.408Z INFO graphsync impl/graphsync.go:299 Graphsync ReceiveError: stream reset
2021-03-05T04:37:48.338Z INFO graphsync responsemanager/responsemanager.go:477 response failed: %!w(traversal.SkipMe={})
2021-03-05T04:40:17.409Z INFO graphsync impl/graphsync.go:299 Graphsync ReceiveError: stream reset
2021-03-05T05:31:10.132Z INFO graphsync responsemanager/responsemanager.go:477 response failed: %!w(traversal.SkipMe={})
2021-03-05T05:32:08.685Z INFO graphsync impl/graphsync.go:299 Graphsync ReceiveError: stream reset
2021-03-05T09:59:22.657Z INFO graphsync responsemanager/responsemanager.go:477 response failed: %!w(traversal.SkipMe={})
2021-03-05T10:00:17.408Z INFO graphsync impl/graphsync.go:299 Graphsync ReceiveError: stream reset
2021-03-05T10:18:49.250Z INFO graphsync responsemanager/responsemanager.go:477 response failed: %!w(traversal.SkipMe={})
2021-03-05T10:23:30.747Z INFO graphsync impl/graphsync.go:299 Graphsync ReceiveError: stream reset

pool: pool
state: ONLINE
scan: scrub repaired 0B in 0 days 08:03:36 with 0 errors on Wed Mar 3 00:25:01 2021
config:

NAME                                    STATE     READ WRITE CKSUM
pool                                    ONLINE       0     0     0
  raidz1-0                              ONLINE       0     0     0
    scsi-SATA_WDC_WUH721414AL_QGKRJ0VT  ONLINE       0     0     0
    scsi-SATA_WDC_WUH721414AL_Y5G6U7DC  ONLINE       0     0     0
    scsi-SATA_WDC_WUH721414AL_Y5G6RNUC  ONLINE       0     0     0
    scsi-SATA_WDC_WUH721414AL_QGKRB2YT  ONLINE       0     0     0
    scsi-SATA_WDC_WUH721414AL_QGKNTH1T  ONLINE       0     0     0
    scsi-SATA_WDC_WUH721414AL_QGKR28KT  ONLINE       0     0     0
  raidz1-1                              ONLINE       0     0     0
    scsi-SATA_WDC_WUH721414AL_QGKMZBJT  ONLINE       0     0     0
    scsi-SATA_WDC_WUH721414AL_QGKAZZPT  ONLINE       0     0     0
    scsi-SATA_WDC_WUH721414AL_QGKM10TT  ONLINE       0     0     0
    scsi-SATA_WDC_WUH721414AL_QGKNJR6T  ONLINE       0     0     0
    scsi-SATA_WDC_WUH721414AL_QGKMGXBT  ONLINE       0     0     0
    scsi-SATA_WDC_WUH721414AL_QGKNHUST  ONLINE       0     0     0
spares
  scsi-SATA_WDC_WUH721414AL_Y5G4PAJC    AVAIL   

errors: No known data errors

@dirkmc
Copy link
Contributor

dirkmc commented Mar 31, 2021

Possible root cause: filecoin-project/go-fil-markets#528

@stuberman were you seeing this issue when you have unseal price > 0?

@stuberman
Copy link
Author

I was seeing this when my retrieval price was above 0, I have recently set the price to 0 so the retrieval deals will not fail. I have not seen the error "incomplete response" since.
I am running 1.5.3 rc2 plus GPU2

@dirkmc
Copy link
Contributor

dirkmc commented Mar 31, 2021

Ok thanks, that's very helpful. I'm going to close this as a dup of #5829

@dirkmc dirkmc closed this as completed Mar 31, 2021
@jimpick
Copy link
Contributor

jimpick commented Mar 31, 2021

@stuberman That was one of my retrieval testing deals ... so I should have some logs on the client side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/markets/client area/markets/retrieval team/ignite Issues and PRs being tracked by Team Ignite at Protocol Labs
Projects
None yet
Development

No branches or pull requests

7 participants