Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flaky tests: state_application, unicode_normalization #3353

Closed
martinetd opened this issue Aug 11, 2024 · 4 comments · Fixed by #3541
Closed

flaky tests: state_application, unicode_normalization #3353

martinetd opened this issue Aug 11, 2024 · 4 comments · Fixed by #3541

Comments

@martinetd
Copy link
Contributor

Hello,

We run tests as part of the nixos build and it would appear some tests are flaky:

---- scheduler::answering::test::state_application stdout ----
thread 'scheduler::answering::test::state_application' panicked at rslib/src/scheduler/answering/mod.rs:593:9:
assertion `left == right` failed
  left: Normal(Learning(LearnState { remaining_steps: 2, scheduled_secs: 60, elapsed_secs: 0, memory_state: None }))
 right: Normal(Learning(LearnState { remaining_steps: 2, scheduled_secs: 60, elapsed_secs: 1, memory_state: None }))
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


failures:
    scheduler::answering::test::state_application
---- media::check::test::unicode_normalization stdout ----
thread 'media::check::test::unicode_normalization' panicked at rslib/src/media/check.rs:723:13:
assertion `left == right` failed
  left: MediaCheckOutput { unused: ["ぱぱ.jpg"], missing: ["foo[.jpg", "normal.jpg"], missing_media_notes: [NoteId(1581236386334)], renamed: {"は\u{309a}は\u{309a}.jp>
 right: MediaCheckOutput { unused: [], missing: ["foo[.jpg", "normal.jpg"], missing_media_notes: [NoteId(1581236386334)], renamed: {"は\u{309a}は\u{309a}.jpg": "ぱぱ.>
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


failures:
    media::check::test::unicode_normalization

I could reproduce the state_application one, running it in a tight loop (grabbing the exact test command from cargo test -vv to avoid the cargo overhead) reproduces within 10-60s on my laptop (~1/250 - 1/2000 runs)

That can be worked around by masking the elapsed_time explicitly but it's not great, looking at the code self.set_elapsed_secs_equal might interfere with the test if it runs right at a second boundary?

Unfortunately I couldn't reproduce the second one through sheer running it in a loop, either other tests interfere with some global state? or it could be an environment problem, I'm not sure.
I also had a quick look and ぱぱ.jpg apparently shouldn't have been unused but I haven't gone far enough to figure what makes things unused and why it would be the case here...

@dae
Copy link
Member

dae commented Aug 11, 2024

The second one is perhaps a race condition between writing the file, and the FS metadata being visible. ext4? We'll probably need to add a small delay, or force a flush to disk.

@martinetd
Copy link
Contributor Author

File system developer hat on, a file system must make metadata changes visible immediately after the IO returned, even if it's not flushed to disk, or that's a bug in the FS. I'm not aware of such bugs recently so I'd be surprised we ran into one... But that is a valuable hint; if unused is based on file timestamp there are some filesystems with a second granularitywhere it could be easy to imagine that a timestamp didn't change during the duration of the test...

Unfortunately I just tried on one (after fiddling a bit, ext3 works with utf8 and has second granularity), and I can't reproduce with that either; I've checked with strace that the timestamp returned by stat() doesn't change during the duration of the test and the test still passes

$ strace -f -e %filesystem -v ... 2>&1 | grep '\\343'
[pid 3188683] openat(AT_FDCWD, "/tmp/.tmpoK4TgP/media/\343\201\257\343\202\232\343\201\257\343\202\232.jpg", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 8
[pid 3188683] statx(10, "\343\201\257\343\202\232\343\201\257\343\202\232.jpg", AT_STATX_SYNC_AS_STAT|AT_SYMLINK_NOFOLLOW, STATX_ALL, {stx_mask=STATX_BASIC_STATS|STATX_MNT_ID, stx_blksize=1024, stx_attributes=0, stx_nlink=1, stx_uid=0, stx_gid=0, stx_mode=S_IFREG|0644, stx_ino=16392, stx_size=12, stx_blocks=4, stx_attributes_mask=STATX_ATTR_COMPRESSED|STATX_ATTR_IMMUTABLE|STATX_ATTR_APPEND|STATX_ATTR_NODUMP|STATX_ATTR_ENCRYPTED|STATX_ATTR_AUTOMOUNT|STATX_ATTR_MOUNT_ROOT|STATX_ATTR_VERITY|STATX_ATTR_DAX, stx_atime={tv_sec=1723362773, tv_nsec=0} /* 2024-08-11T16:52:53+0900 */, stx_ctime={tv_sec=1723362773, tv_nsec=0} /* 2024-08-11T16:52:53+0900 */, stx_mtime={tv_sec=1723362773, tv_nsec=0} /* 2024-08-11T16:52:53+0900 */, stx_rdev_major=0, stx_rdev_minor=0, stx_dev_major=7, stx_dev_minor=1, stx_mnt_id=0x6fa}) = 0
[pid 3188683] openat(AT_FDCWD, "/tmp/.tmpoK4TgP/media/\343\201\257\343\202\232\343\201\257\343\202\232.jpg", O_RDONLY|O_CLOEXEC) = 11
[pid 3188683] openat(AT_FDCWD, "/tmp/.tmpoK4TgP/media/\343\201\261\343\201\261.jpg", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
[pid 3188683] openat(AT_FDCWD, "/tmp/.tmpoK4TgP/media/\343\201\261\343\201\261.jpg", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 11
[pid 3188683] statx(AT_FDCWD, "/tmp/.tmpoK4TgP/media/\343\201\261\343\201\261.jpg", AT_STATX_SYNC_AS_STAT, STATX_ALL, {stx_mask=STATX_BASIC_STATS|STATX_MNT_ID, stx_blksize=1024, stx_attributes=0, stx_nlink=1, stx_uid=0, stx_gid=0, stx_mode=S_IFREG|0644, stx_ino=16393, stx_size=12, stx_blocks=4, stx_attributes_mask=STATX_ATTR_COMPRESSED|STATX_ATTR_IMMUTABLE|STATX_ATTR_APPEND|STATX_ATTR_NODUMP|STATX_ATTR_ENCRYPTED|STATX_ATTR_AUTOMOUNT|STATX_ATTR_MOUNT_ROOT|STATX_ATTR_VERITY|STATX_ATTR_DAX, stx_atime={tv_sec=1723362773, tv_nsec=0} /* 2024-08-11T16:52:53+0900 */, stx_ctime={tv_sec=1723362773, tv_nsec=0} /* 2024-08-11T16:52:53+0900 */, stx_mtime={tv_sec=1723362773, tv_nsec=0} /* 2024-08-11T16:52:53+0900 */, stx_rdev_major=0, stx_rdev_minor=0, stx_dev_major=7, stx_dev_minor=1, stx_mnt_id=0x6fa}) = 0
[pid 3188683] unlink("/tmp/.tmpoK4TgP/media/\343\201\257\343\202\232\343\201\257\343\202\232.jpg") = 0
[pid 3188683] statx(AT_FDCWD, "/tmp/.tmpoK4TgP/media/\343\201\257\343\202\232\343\201\257\343\202\232.jpg", AT_STATX_SYNC_AS_STAT, STATX_ALL, 0x7f8a79bfd210) = -1 ENOENT (No such file or directory)
[pid 3188683] statx(AT_FDCWD, "/tmp/.tmpoK4TgP/media/\343\201\261\343\201\261.jpg", AT_STATX_SYNC_AS_STAT, STATX_ALL, {stx_mask=STATX_BASIC_STATS|STATX_MNT_ID, stx_blksize=1024, stx_attributes=0, stx_nlink=1, stx_uid=0, stx_gid=0, stx_mode=S_IFREG|0644, stx_ino=16393, stx_size=12, stx_blocks=4, stx_attributes_mask=STATX_ATTR_COMPRESSED|STATX_ATTR_IMMUTABLE|STATX_ATTR_APPEND|STATX_ATTR_NODUMP|STATX_ATTR_ENCRYPTED|STATX_ATTR_AUTOMOUNT|STATX_ATTR_MOUNT_ROOT|STATX_ATTR_VERITY|STATX_ATTR_DAX, stx_atime={tv_sec=1723362773, tv_nsec=0} /* 2024-08-11T16:52:53+0900 */, stx_ctime={tv_sec=1723362773, tv_nsec=0} /* 2024-08-11T16:52:53+0900 */, stx_mtime={tv_sec=1723362773, tv_nsec=0} /* 2024-08-11T16:52:53+0900 */, stx_rdev_major=0, stx_rdev_minor=0, stx_dev_major=7, stx_dev_minor=1, stx_mnt_id=0x6fa}) = 0
[pid 3188683] unlinkat(4, "\343\201\261\343\201\261.jpg", 0) = 0

Is there another timestamp in the media.db ?

@martinetd
Copy link
Contributor Author

rslib/src/collection/timestamps.rs apparently inserts times in milliseconds in media.db, so on the very powerful builders I could see the tests running in <1ms -- adding a 1ms delay somewhere could work around the issue if using a more precise timestamp is more work.

@dae
Copy link
Member

dae commented Oct 26, 2024

@abdnh if this interests you, would you mind looking into this when you have a chance?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants