Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate tasks from diffs for exporting #803

Merged
merged 16 commits into from
Oct 14, 2020
Merged

Conversation

arielshaqed
Copy link
Contributor

Part of #534.

Reviewers:

  • @guy-har for everything
  • @nopcoder for changes to diff format, and for Go, and I hope also everything.

@arielshaqed arielshaqed force-pushed the feature/534-diff-to-tasks branch from a4c1e0d to 61caf0d Compare October 12, 2020 14:35
@arielshaqed
Copy link
Contributor Author

hypnotoad says: ignore test failures, review code!

"fmt"
"strings"

rinqueue "github.com/erikdubbelboer/ringqueue"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting - why rinqueue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Auto-happened because of something Go-ish. There is a bug in ringqueue that the package is actually named rinqueue. Anything else causes fun with gofmt while saving etc.


const successFilename = "_lakefs_success"

const CopyActipn = "export:copy"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. s/Actipn/Action/
  2. can scope them in one const - usually, it just telling me they are all part of the same gang
  3. optional add a type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1+2 are great. Not doing 3, because Action is actually a string on TaskData.

return *ret, true
}

func makeDirMatchCache(pred func(path string) bool) *dirMatchCache {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func makeDirMatchCache(pred func(path string) bool) *dirMatchCache {
func newDirMatchCache(pred func(path string) bool) *dirMatchCache {

any reason you prefer 'make' and not the usual 'new' prefix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Value factories are called make in C++. Wait, you mean we're not using C++?!
(Done)

Comment on lines 61 to 85
dir = dirname(dir)
var ok bool
if ret, ok = dmc.upMatchCache[dir]; ok {
break
}
if dmc.pred(dir) {
copy := dir
ret = &copy
break
}
if dir == "" {
break
}
}
for dir = dirname(filename); dir != "" && (ret == nil || dir != *ret); dir = dirname(dir) {
dmc.upMatchCache[dir] = ret
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does pred can return true for empty dir?
if empty dir got no meaning and we do not cache the result for empty dir, I think we can move the dir == "" check to be after dir = dirname(dir)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point: it certainly does make sense to return true for the empty directory.

Exporting the object and writing a test for it, so I can fix its behaviour (in the sense of not changing, and also in the sense of making it behave correctly).

return &dirMatchCache{pred: pred, upMatchCache: make(map[string]*string)}
}

// generateTasksFromDiffs converts diffs into many tasks that depend on startTaskID, with a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/generateTasksFromDiffs/GenerateTasksFromDiffs/

var ErrBadTypeConversion = errors.New("bad type")

// nolint: stylecheck
func (dst *TaskStatusCodeValue) Scan(src interface{}) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually, all the receivers have the same name (and optionally - one letter)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know. But the sql package does it this way

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more than one letter - but I don't think for the same type have a different name for the receiver (src, dst)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, it's pgtype that does this for Scan and Value. Will keep this code as it does follow the gold standard.

return fmt.Errorf("cannot convert %T to TaskStatusCodeValue: %w", src, ErrBadTypeConversion)
}

if !(sc == TaskPending || sc == TaskInProgress || sc == TaskAborted || sc == TaskCompleted) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't commit on previous lines but check the code-doc of these of does not have the matched constant name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand what I need to do here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the tasks here got wrong code-doc comment - ex: TaskAborted is ABORTED

	// TaskPending indicates a task is waiting for an actor to perform it (new or being
	// retried)
	TaskPending TaskStatusCodeValue = "pending"
	// IN_PROGRESS indicates a task is being performed by an actor.
	TaskInProgress TaskStatusCodeValue = "in-progress"
	// ABORTED indicates an actor has aborted this task with message, will not be reissued
	TaskAborted TaskStatusCodeValue = "aborted"
	// TaskCompleted indicates an actor has completed this task with message, will not reissued
	TaskCompleted TaskStatusCodeValue = "completed"
	// TaskInvalid is used by the API to report errors
	TaskInvalid TaskStatusCodeValue = "[invalid]"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, thanks! I went from C++ style to Go style and missed the comments. It's part of parade, so on #812.

t.Helper()
rows, err := db.Query(`SELECT id FROM tasks WHERE id LIKE format('%s%%', $1::text)`, prefix)
if err != nil {
t.Errorf("[I] select remaining IDs for prefix %s: %s", prefix, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fatalf

for rows.Next() {
var id parade.TaskID
if err = rows.Scan(&id); err != nil {
t.Errorf("[I] scan ID value: %s", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fatalf - not sure you can continue to scan after that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not? It's just a broken row (e.g. parse failure). rows.Next is welcome to fail if the database is 🔥 on fire.

So you're actually right, the code is wrong, and I added a test that rows were successfully scanned.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right that calling Next after Scan that failed will probably exit. The code reader that sees Errorf will think that this error is something we can live with and continue - in this case adding the id to the slice, which is not what we wanted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not done, because of a difference in how we two think about test failures.

This is how Moketa the code monkey once tried to teach me how to test. Basically, stop the test only when continuing the test will fail on testing code. Otherwise continue and print more error messages to get more data about the failure. In this case I don't mind pushing a dud id onto the slice: it won't be the first error, but the remaining error messages might help me understand what went wrong. E.g. otherwise I don't see any of the other, possibly successfully-parsed, IDs.

Comment on lines 255 to 264
if err := dst.Scan(c.in); err != nil {
if !errors.Is(err, c.err) {
t.Errorf("got err %v, expected %v", err, c.err)
}
} else {
if dst != c.out {
t.Errorf("expected %s, got %s", c.out, dst)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if err := dst.Scan(c.in); err != nil {
if !errors.Is(err, c.err) {
t.Errorf("got err %v, expected %v", err, c.err)
}
} else {
if dst != c.out {
t.Errorf("expected %s, got %s", c.out, dst)
}
}
err := dst.Scan(c.in)
if !errors.Is(err, c.err) {
t.Fatalf("got err %v, expected %v", err, c.err)
}
if err == nil && dst != c.out {
t.Fatalf("expected %s, got %s", c.out, dst)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Taking it but keeping it an error rather than a fatal. It gives me less indentation (which is good for my 80-column brain) and it tests the case where an error was expected but not received. Thanks!

@codecov-io
Copy link

codecov-io commented Oct 13, 2020

Codecov Report

Merging #803 into master will increase coverage by 0.69%.
The diff coverage is 89.40%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #803      +/-   ##
==========================================
+ Coverage   41.87%   42.57%   +0.69%     
==========================================
  Files         133      134       +1     
  Lines       10435    10583     +148     
==========================================
+ Hits         4370     4506     +136     
- Misses       5483     5491       +8     
- Partials      582      586       +4     
Impacted Files Coverage Δ
catalog/diff.go 18.51% <ø> (ø)
metastore/diff.go 0.00% <0.00%> (ø)
parade/ddl.go 72.50% <87.50%> (+1.14%) ⬆️
export/tasks_generator.go 90.97% <90.97%> (ø)
catalog/cataloger_commit.go 81.81% <0.00%> (+3.03%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1cea346...d7eb030. Read the comment docs.

Copy link
Contributor Author

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @nopcoder ! PTAL...

"fmt"
"strings"

rinqueue "github.com/erikdubbelboer/ringqueue"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Auto-happened because of something Go-ish. There is a bug in ringqueue that the package is actually named rinqueue. Anything else causes fun with gofmt while saving etc.


const successFilename = "_lakefs_success"

const CopyActipn = "export:copy"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1+2 are great. Not doing 3, because Action is actually a string on TaskData.

Comment on lines 61 to 85
dir = dirname(dir)
var ok bool
if ret, ok = dmc.upMatchCache[dir]; ok {
break
}
if dmc.pred(dir) {
copy := dir
ret = &copy
break
}
if dir == "" {
break
}
}
for dir = dirname(filename); dir != "" && (ret == nil || dir != *ret); dir = dirname(dir) {
dmc.upMatchCache[dir] = ret
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point: it certainly does make sense to return true for the empty directory.

Exporting the object and writing a test for it, so I can fix its behaviour (in the sense of not changing, and also in the sense of making it behave correctly).

return *ret, true
}

func makeDirMatchCache(pred func(path string) bool) *dirMatchCache {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Value factories are called make in C++. Wait, you mean we're not using C++?!
(Done)

// "generate success" task after generating all files in each directory that matches
// generateSuccessFor.
func GenerateTasksFromDiffs(exportID string, dstPrefix string, diffs catalog.Differences, generateSuccessFor func(path string) bool) ([]parade.TaskData, error) {
const initialSize = 100000
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure it matters, so reduced 100x.

return nil, fmt.Errorf("%s: failed to serialize %+v: %w", successPath, data, err)
}
bodyStr := string(body)
totalDependencies := td // copy to get new address each time
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh? "I just want an lvalue reference...".

var ErrBadTypeConversion = errors.New("bad type")

// nolint: stylecheck
func (dst *TaskStatusCodeValue) Scan(src interface{}) error {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know. But the sql package does it this way

return fmt.Errorf("cannot convert %T to TaskStatusCodeValue: %w", src, ErrBadTypeConversion)
}

if !(sc == TaskPending || sc == TaskInProgress || sc == TaskAborted || sc == TaskCompleted) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand what I need to do here.

for rows.Next() {
var id parade.TaskID
if err = rows.Scan(&id); err != nil {
t.Errorf("[I] scan ID value: %s", err)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not? It's just a broken row (e.g. parse failure). rows.Next is welcome to fail if the database is 🔥 on fire.

So you're actually right, the code is wrong, and I added a test that rows were successfully scanned.

Comment on lines 255 to 264
if err := dst.Scan(c.in); err != nil {
if !errors.Is(err, c.err) {
t.Errorf("got err %v, expected %v", err, c.err)
}
} else {
if dst != c.out {
t.Errorf("expected %s, got %s", c.out, dst)
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Taking it but keeping it an error rather than a fatal. It gives me less indentation (which is good for my 80-column brain) and it tests the case where an error was expected but not received. Thanks!

Correctly handle enum values and report errors nicely.  This will give better behaviour when
adding new enum values, and also makes code type-safer which is nice.
Only useful for tests, but still necessary for them.
These should *not* decrement total_dependencies, they've already been counted on num_signals.
This will be _partially_ filled in when computing differences, according to some requested set
of fields.  This will allow various applications of differences.
It's actually manually generated, and we missed it again on master.
Code was correct, but efficiency might not matter here.
This is a fairly major change to `DirMatchCache`, so explicitly test it (and capitalize
`dirMatchCache` to export it...).
Turns out this is Go not C++, and constructors are not named like C++ value factories.
Remove dependency on `ringqueue`, which doesn't even name its package correctly.  This might
take a bit more memory, and/or be more CPU cache friendly.  So it probably does not matter.
@arielshaqed arielshaqed force-pushed the feature/534-diff-to-tasks branch from 0cc7e51 to d7eb030 Compare October 14, 2020 06:13
Copy link
Contributor Author

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Pulling without #795. @nopcoder ended up reviewing that too, so it doesn't necessarily need much further review. But it's still better off as a separate PR.

@arielshaqed arielshaqed merged commit 6763612 into master Oct 14, 2020
@arielshaqed arielshaqed deleted the feature/534-diff-to-tasks branch October 14, 2020 06:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants