Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: restore portable backup error #20115

Closed
wants to merge 5 commits into from

Conversation

rickif
Copy link
Contributor

@rickif rickif commented Nov 20, 2020

This PR is to solve restore problem in influxdb 1.x

influxd restore -portable -host [host] -db [db] -newdb [newdb] [backup_path]

Restoring portable backup with this command may get error as below

2020/11/20 13:55:02 error updating meta: updating metadata on influxd service failed: err=read tcp x.x.x.x:52286->x.x.x.x:8088: read: connection reset by peer, n=16
restore: updating metadata on fctsdb service failed: err=read tcp x.x.x.x:52286->x.x.x.x:8088: read: connection reset by peer, n=16

Accordingly, the influxdb error log is as below

ts=2020-11-20T13:55:02.555262Z lvl=info msg="failed to decode meta: proto: meta.ShardInfo: illegal tag 0 (wire type 0)" log_id=0QQzP3qG000 service=snapshot

This problem happens at random, but more frequently in network with high latency. The net.func (*TCPConn) Read dose not promise that exactly len(buf) bytes is read when the call is returned, which leading this problem. Replacing the net.func (*TCPConn) Read with io.ReadFull could fix this.

  • CHANGELOG.md updated with a link to the PR (not the Issue)
  • Well-formatted commit messages
  • Rebased/mergeable
  • Tests pass
  • http/swagger.yml updated (if modified Go structs or API)
  • Feature flagged (if modified API)
  • Documentation updated or issue created (provide link to issue/pr)
  • Signed CLA (if not already signed)

@rickif rickif force-pushed the fix/restore-portable-backup branch from 2a56a3a to 69e2054 Compare November 20, 2020 07:47
@danxmoran
Copy link
Contributor

Hi @rickif thanks for the contribution! While we review, could you push an update to the CHANGELOG.md file with a reference to this PR?

@rickif rickif force-pushed the fix/restore-portable-backup branch from 69e2054 to 8f961fa Compare November 21, 2020 06:56
@rickif rickif changed the title fix: restore portable backup failed fix: restore portable backup error Nov 21, 2020
@rickif
Copy link
Contributor Author

rickif commented Nov 21, 2020

@danxmoran OK, the CHANGELOG.md is updated now.

Copy link
Contributor

@danxmoran danxmoran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rickif this change makes sense to me in theory from reading the docs on Read and ReadFull, but I'm not sure what to expect in practice from the use-case you mention (restoring on a high-latency network). In situations where Read used to return "early", will ReadFull now sit and wait for more data? Or will it still result in an error, but the error will be easier to understand?

Do you feel up to try adding tests to services/snapshotter/service.go? The 2 cases I'd ideally want covered are:

  1. conn closes before ReadFull reads enough bytes to fill typ[:]
  2. conn doesn't "contain" enough bytes to fill typ[:] when ReadFull is called, but more data arrives afterwards

@rickif rickif force-pushed the fix/restore-portable-backup branch from 409486a to e7ef85d Compare November 24, 2020 10:06
@rickif
Copy link
Contributor Author

rickif commented Nov 24, 2020

@danxmoran I agree with u and have added some test codes .But I have doubt about the cases you mentioned.

  1. Is it necessary to cover the first case? According to the godoc, If an EOF happens after reading some but not all the bytes, ReadFull returns ErrUnexpectedEOF. The result is clear and well-defined by golang.
  2. On Linux platform, the call to read() may read less than the requested number of bytes, which may occur if there are fewer bytes available in the socket than were requested in the read() call. The possible cause of this is happened in the Linux Kernel Mode, which is not easy to add test to simulate this circumstance. I tried to do this by dividing the payload and sending the pieces one by one, meanwhile Sleep() is added after every sending. But it does not work at all. I wonder whether you have any good idea about this?

@danxmoran
Copy link
Contributor

Is it necessary to cover the first case? According to the godoc, If an EOF happens after reading some but not all the bytes, ReadFull returns ErrUnexpectedEOF. The result is clear and well-defined by golang.

A test case covering only a call to ReadFull would indeed be redundant. The pieces I'm interested in covering are the functions wrapping the ReadFull calls. My impression is that before this change, those functions would either complete successfully (but return partial/corrupted data) or fail with a confusing error. The functions still fail with this change, but with (hopefully) a more clear explanation in the error. If it's easy to capture the new expectation in a test, it'd help guard against future regressions.

That said, I don't want to gatekeep your contribution for too long. If setting up test cases is going to take a significant amount of mocking & other test machinery, I'm ok merging without additional tests. In that case, could you add a comment above each of the ReadFull calls explaining why ReadFull over Read?

@danxmoran
Copy link
Contributor

But it does not work at all. I wonder whether you have any good idea about this?

It sounds like this will be more trouble than it's worth, I wouldn't spend more time on it. Thanks for investigating!

@rickif
Copy link
Contributor Author

rickif commented Nov 24, 2020

The functions still fail with this change, but with (hopefully) a more clear explanation in the error

This change is not aiming at providing a more clear explanation, but is mainly to solve the problem that the conn.Read() may return too early without enough data, which leds a data corruption error even if the data is valid. As you mentioned before, the io.ReadFull would block until expected data is read, in which the data could be parsed correctly.

@rickif rickif force-pushed the fix/restore-portable-backup branch from 43a6e42 to ce0d052 Compare November 24, 2020 15:19
Copy link
Contributor

@danxmoran danxmoran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rickif, one suggestion to curate the CHANGELOG, and some questions in the new test code.

CHANGELOG.md Outdated Show resolved Hide resolved
}

c := snapshotter.NewClient(l.Addr().String())
if _, err := c.UpdateMeta(req, bytes.NewReader(metaBytes)); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything useful we could assert about the success case here?

Copy link
Contributor Author

@rickif rickif Dec 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I've add some codes to cover the success case

services/snapshotter/service_test.go Outdated Show resolved Hide resolved
@rickif rickif force-pushed the fix/restore-portable-backup branch from af50c02 to 1336bc0 Compare December 2, 2020 14:42
@rickif
Copy link
Contributor Author

rickif commented Dec 2, 2020

Thanks @rickif, one suggestion to curate the CHANGELOG, and some questions in the new test code.

@danxmoran Thanks for suggestions. The PR has been updated!

danxmoran
danxmoran previously approved these changes Dec 2, 2020
Copy link
Contributor

@danxmoran danxmoran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@rickif
Copy link
Contributor Author

rickif commented Dec 30, 2020

@danxmoran Hi, the PR has been pending for a long time. Is there any problem?

@rickif rickif force-pushed the fix/restore-portable-backup branch 2 times, most recently from 2e38acd to 67bb4d3 Compare April 1, 2021 02:21
@rickif
Copy link
Contributor Author

rickif commented Apr 1, 2021

@dgnorton Hi, would you please do review for this PR?

@rickif rickif force-pushed the fix/restore-portable-backup branch from 67bb4d3 to 671aaa7 Compare April 1, 2021 02:42
@rickif rickif force-pushed the fix/restore-portable-backup branch from df29ce3 to 1aa3707 Compare April 1, 2021 02:53
@danxmoran
Copy link
Contributor

@rickif very sorry for the delay, I'm not sure how I lost track of this one. I'll do a final review pass now.

@lesam lesam mentioned this pull request Jul 29, 2021
4 tasks
@lesam lesam self-requested a review July 29, 2021 21:17
@rickif
Copy link
Contributor Author

rickif commented Jul 30, 2021

@rickif very sorry for the delay, I'm not sure how I lost track of this one. I'll do a final review pass now.

that would very nice.

@lesam
Copy link
Contributor

lesam commented Jul 30, 2021

Re-opened this PR as #21991 so that I can update the changelog, otherwise it is good.

@lesam lesam closed this Jul 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants