-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
br: fix lightning split large csv file error and adjust s3 seek result #27769
Changes from 1 commit
050e33c
c506864
6338289
3a7e535
d87cdb8
30f32d1
b79e04e
2dc464d
368ced5
429d2ac
3e76255
d631f2b
77855eb
8441774
9878f8f
01c4e53
f78fba4
22afcc9
519ebec
7d23699
7943a3c
88d493e
67b0813
32ad3b6
cfb9255
9db8405
00f0fa3
5d2392a
acb6d0d
4e50b52
25a157e
b247989
a1fbc70
3360a5e
689f411
69a37a3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -648,6 +648,17 @@ func (r *s3ObjectReader) Close() error { | |
return r.reader.Close() | ||
} | ||
|
||
// eofReader is a io.ReaderClose Reader that always return io.EOF | ||
type eofReader struct{} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (optional |
||
|
||
func (r eofReader) Read([]byte) (n int, err error) { | ||
return 0, io.EOF | ||
} | ||
|
||
func (r eofReader) Close() error { | ||
return nil | ||
} | ||
|
||
// Seek implement the io.Seeker interface. | ||
// | ||
// Currently, tidb-lightning depends on this method to read parquet file for s3 storage. | ||
|
@@ -666,6 +677,18 @@ func (r *s3ObjectReader) Seek(offset int64, whence int) (int64, error) { | |
|
||
if realOffset == r.pos { | ||
return realOffset, nil | ||
} else if realOffset >= r.rangeInfo.Size { | ||
// See: https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35 | ||
// because s3's GetObject interface doesn't all a range that matches zero lenghth data, | ||
// so if the position is out of range, we need to always return io.EOF after the seek operation. | ||
|
||
// close current read and open a new one which target offset | ||
if err := r.reader.Close(); err != nil { | ||
log.L().Warn("close s3 reader failed, will ignore this error", logutil.ShortError(err)) | ||
} | ||
|
||
r.reader = eofReader{} | ||
return r.rangeInfo.Size, nil | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed. I think we should return real position which is |
||
} | ||
|
||
// if seek ahead no more than 64k, we discard these data | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the file size if slightly bigger the
int64(cfg.Mydumper.MaxRegionSize)*11/10
? 🤣There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then the file will be split by
cfg.Mydumper.MaxRegionSize
, so the second chunk size is about 1/10 * cfg.Mydumper.MaxRegionSize.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the file size is slightly bigger the
int64(cfg.Mydumper.MaxRegionSize)* 2
, the third chunk size is very small, will this be a problem?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a big problem. The common case is that the data export tool (like dumpling or mydumper) set the exported file size with
cfg.Mydumper.MaxRegionSize
, but the output file size might be slightly bigger or smaller, so we can avoid split a lot of small chunks.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make that 11/10 a named constant...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will make the code a bit ugly because 11/10 is a float. I add a code comment to explain why the threshold need to be increased😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can make "10" a constant and set the upper limit to
MaxRegionSize + MaxRegionSize/10
.