Skip to content
This repository has been archived by the owner on Sep 16, 2024. It is now read-only.

Refactor src/index.rs and Enhance Documentation #87

Merged
merged 30 commits into from
Mar 20, 2024
Merged

Conversation

tareknaser
Copy link
Collaborator

@tareknaser tareknaser commented Feb 16, 2024

Description

This PR focuses on refactoring the src/index.rs file.

Changes

The changes primarily include:

  • Define constants at the beginning of the file
  • Add doc comments for structs, fields, and methods
  • Use serde::{Deserialize, Serialize} for loading and storing of the index file
  • Rename the discover_paths function to discover_files
  • Refactor certain parts of the code to improve readability, including reducing indentation where possible
  • Add a new test to ensure that only canonical paths are present in the index
  • Use of TempDir for testing
  • Updating platform conditions for improved cross-compatibility (#[cfg(target_family = "unix")])

Edit

Fixes #89 with 633335a

…Update

Signed-off-by: Tarek <tareknaser360@gmail.com>
Signed-off-by: Tarek <tareknaser360@gmail.com>
Signed-off-by: Tarek <tareknaser360@gmail.com>
Signed-off-by: Tarek <tareknaser360@gmail.com>
Signed-off-by: Tarek <tareknaser360@gmail.com>
Signed-off-by: Tarek <tareknaser360@gmail.com>
Signed-off-by: Tarek <tareknaser360@gmail.com>
Signed-off-by: Tarek <tareknaser360@gmail.com>
Signed-off-by: Tarek <tareknaser360@gmail.com>
Signed-off-by: Tarek <tareknaser360@gmail.com>
Signed-off-by: Tarek <tareknaser360@gmail.com>
Signed-off-by: Tarek <tareknaser360@gmail.com>
Signed-off-by: Tarek <tareknaser360@gmail.com>
Signed-off-by: Tarek <tareknaser360@gmail.com>
Signed-off-by: Tarek <tareknaser360@gmail.com>
Signed-off-by: Tarek <tareknaser360@gmail.com>
Copy link

Benchmark for c3d41c0

Click to view benchmark
Test Base PR %
compute_bytes_large/compute_bytes 138.9±1.32µs 140.1±2.81µs +0.86%
compute_bytes_medium/compute_bytes 27.7±0.39µs 27.0±0.35µs -2.53%
compute_bytes_small/compute_bytes 127.4±1.20ns 127.7±2.16ns +0.24%
index_build/index_build/tests/ 157.7±0.66µs 161.5±1.33µs +2.41%
tests/lena.jpg/compute_bytes 13.3±0.09µs 13.5±0.28µs +1.50%
tests/test.pdf/compute_bytes 111.1±0.69µs 136.8±0.54µs +23.13%

src/index.rs Show resolved Hide resolved
src/index.rs Outdated Show resolved Hide resolved
src/index.rs Outdated Show resolved Hide resolved
src/index.rs Outdated
}
let file = File::open(index_path)?;
let reader = BufReader::new(file);
let index: ResourceIndex = serde_json::from_reader(reader)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I right that this changes the current file format to JSON?

In the main branch, index is stored in plain-text format:

1675087824710 58922-3276384608 Info/crdt_vs_ot.jpg
1675087824610 125308-3041567246 Info/Watson-Crick & Hoogsteen.png
1675087831180 386929-63288935 Screenshots/Screenshot_20210717_120856.jpg
1675087831440 945565-3484272718 Screenshots/Screenshot_20220111_040205_com.adobe.reader.jpg
1675087832200 57186-2804307164 Screenshots/lenovo-fedora-20210427-023231-183683953.png
1675087832990 107862-4031618865 Screenshots/lenovo-fedora-20210328-050520-182040364.png
1675087833830 343648-561346264 Screenshots/Screenshot_20210804_003308.jpg
1675087834130 257481-490338418 Info/Technologies.jpg
1675087846130 415620-3764250547 Gif/сitrus.gif

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have found that storing and loading the same index with the main branch doesn't produce the same object. Not a matter of sorting but timestamps might be different as well. The code in the main branch for storing and loading contains more logic that just writing the struct to a file.
The idea here was to make ResourceIndex implement Serialize. with this setup, we can serialize to various formats

but It's up to you, we can revert to the old file format if you prefer it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like your idea, I just don't want to write huge JSON to disk.

Some compact format would be great: postcard, CBOR, what else could serve this purpose?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On other hand, if we use automatic serialization, then the format will be excessive most likely: 2 mappings means creating two separate structures. Right now, we write only one table and derive 2 mappings from it during reading.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On other hand, if we use automatic serialization, then the format will be excessive

Good point. I noticed that as well.

I now reverted ResourceIndex::store() and ResourceIndex::load() to the old code. I just kept the code to make them serializable to have this possibility later.

I just wanted to add some notes about the current implementation of the 2 methods:

  • we currently don't store or load information about collisions in ResourceIndex
  • I noticed some inconsistencies when trying to store an object of ResourceIndex and then load it. I noticed some differences not only in order but also in modified_time. I think this may be a problem with how we parse the time

I could move this conversation to an issue and work on a separate PR to solve it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tareknaser this is interesting

I noticed some inconsistencies when trying to store an object of ResourceIndex and then load it. I noticed some differences not only in order but also in modified_time. I think this may be a problem with how we parse the time

Could you tell more about the timestamps discrepancies? Probably steps to reproduce with ark-cli? Yeah, we could create separate issue if this is a bug indeed.

we currently don't store or load information about collisions in ResourceIndex

Yes, but is it an issue? Collisions should be derivable from the table itself, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw test case fn index_entry_order() could help debug lines ordering.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you tell more about the timestamps discrepancies? Probably steps to reproduce with ark-cli? Yeah, we could create separate issue if this is a bug indeed.

I actually noticed this difference when I wrote a simple test
Here is the test to reproduce

#[test]
fn resource_index_load_store() {
    let temp_dir = TempDir::new("arklib_test")
        .expect("Failed to create temporary directory");
    let temp_dir = temp_dir.into_path();

    create_file_at(
        temp_dir.to_owned(),
        Some(FILE_SIZE_1),
        Some(FILE_NAME_1),
    );
    let index = ResourceIndex::build(temp_dir.to_owned());


    index
        .store()
        .expect("Should store index successfully");


    let loaded_index = ResourceIndex::load(temp_dir.to_owned())
        .expect("Should load index successfully");

    // Assert that the loaded index is equal to the original index
    assert_eq!(index, loaded_index);
}

When I run this test, I get this output

thread 'index::tests::resource_index_load_store' panicked at src/index.rs:824:9:
assertion `left == right` failed
  left: ResourceIndex { id2path: {ResourceId { data_size: 10, crc32: 3817498742 }: "/private/var/folders/fg/kw558kp56h574pchd9zylc500000gn/T/arklib_test.VcRhfirveNvX/test1.txt"}, path2id: {"/private/var/folders/fg/kw558kp56h574pchd9zylc500000gn/T/arklib_test.VcRhfirveNvX/test1.txt": IndexEntry { modified: SystemTime { tv_sec: 1709340915, tv_nsec: 542881544 }, id: ResourceId { data_size: 10, crc32: 3817498742 } }}, collisions: {}, root: "/private/var/folders/fg/kw558kp56h574pchd9zylc500000gn/T/arklib_test.VcRhfirveNvX" }
 right: ResourceIndex { id2path: {ResourceId { data_size: 10, crc32: 3817498742 }: "/private/var/folders/fg/kw558kp56h574pchd9zylc500000gn/T/arklib_test.VcRhfirveNvX/test1.txt"}, path2id: {"/private/var/folders/fg/kw558kp56h574pchd9zylc500000gn/T/arklib_test.VcRhfirveNvX/test1.txt": IndexEntry { modified: SystemTime { tv_sec: 1709340915, tv_nsec: 542000000 }, id: ResourceId { data_size: 10, crc32: 3817498742 } }}, collisions: {}, root: "/private/var/folders/fg/kw558kp56h574pchd9zylc500000gn/T/arklib_test.VcRhfirveNvX" }

Notice how tv_nsec is different

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like rounding.. If you could investigate why is it rounded, it would be great.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand now
We're currently storing and loading timestamps in milliseconds, but SystemTime relies on Timespec under the hood which is:

struct Timespec {
    tv_sec: i64,
    tv_nsec: Nanoseconds,
}

I confirmed locally that if we store and parse the timestamp as nanoseconds, we don't run into this issue. I'll double-check before pushing the code.

src/index.rs Outdated Show resolved Hide resolved
src/index.rs Show resolved Hide resolved
pub fn index_new(&mut self, path: &dyn AsRef<Path>) -> Result<IndexUpdate> {
log::debug!("Indexing a new path");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use name track_addition if it's same function as here https://github.com/ARK-Builders/arklib/blob/index/update-one/src/index.rs#L359

Comment on lines +365 to 382
/// Updates a single entry in the index with a new resource located at the
/// specified path, replacing the old resource associated with the given
/// ID.
///
/// # Restrictions
///
/// The caller must ensure that:
/// * the index is up-to-date except for this single path
/// * the path has been indexed before
/// * the path maps into `old_id`
/// * the content by the path has been modified
///
/// # Errors
///
/// Returns an error if the path does not exist, if the path is a directory
/// or an empty file, if the index cannot find the specified path, or if
/// the content of the path has not been modified.
pub fn update_one(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me, that we should separate this method into track_deletion and track_update as it was done in #72

https://github.com/ARK-Builders/arklib/blob/index/update-one/src/index.rs#L393
https://github.com/ARK-Builders/arklib/blob/index/update-one/src/index.rs#L416

The idea is that track methods serve as performance workaround when end-users know what they are doing. On other hand, update_all should be efficient enough to be called all the time. But from API point of view, it track methods are also simpler sometimes.

OPEN FOR DISCUSSION

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so the API will look like:

  • track_addition(), track_update() and track_deletion() when the user wants to be specific about the action
  • update_all() otherwise

This looks like a clean API but I have some questions about other public methods

  • what are the use cases for ResourceIndex::provide()?
  • we can now remove forget_id() in favor of track_deletion() or they do different tasks?

Copy link
Member

@kirillt kirillt Mar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tareknaser

  • ResourceIndex::provide() is the starting point for any user/app. It combines index loading and updating in single step. This could be done by separate calls, but it seems to me that users will always need updated index before they do something. I can imagine rare use-cases where they need more control, so we should expose separate steps, too.

  • Name forget_id is confusing now. In this PR, it is what is called track_deletion in the other PR. We have private function forget_path that is symmetric to insert_entry, both are called from track methods.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ResourceIndex::provide() is the starting point for any user/app. It combines index loading and updating in single step.

Got it
Thanks for clearing this out.

Name forget_id is confusing now.

The initial idea for this PR was to rewrite the file src/index.rs as it is and then implement the changes to replace #72
Do you prefer to include all the changes in the current PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... It's difficult to keep in mind several PRs because the scope is pretty big.

Let's incorporate useful parts of #72 and unit tests from there into this PR.

src/index.rs Show resolved Hide resolved
// canonicalize the path to avoid duplicates
match fs::canonicalize(&path) {
Ok(canonical_path) => {
discovered_files.insert(canonical_path, entry);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After canonicalization, we should trim the common prefix ("root").

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's related to #86. Do you want me to implement it in this PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah, we can do it in separate PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After spending some time modifying this PR based on #72, I've realized that it might not be a good idea to combine all the changes from both PRs into a single PR. #72 has significant API changes for ResourceIndex, and I think we should discuss these changes further in #91 before proceeding.

This PR focuses only on refactoring certain parts of the code to enhance readability. It doesn't alter any logic for arklib so it might be a good idea to keep it simple.
What do you think?

src/index.rs Show resolved Hide resolved
src/index.rs Outdated Show resolved Hide resolved
Signed-off-by: Tarek <tareknaser360@gmail.com>
Signed-off-by: Tarek <tareknaser360@gmail.com>
Copy link

Benchmark for 1c39010

Click to view benchmark
Test Base PR %
compute_bytes_large/compute_bytes 572.8±1.88µs 137.4±1.54µs -76.01%
compute_bytes_medium/compute_bytes 27.5±0.45µs 26.8±0.43µs -2.55%
compute_bytes_small/compute_bytes 128.6±1.53ns 129.2±3.06ns +0.47%
index_build/index_build/tests/ 157.8±0.82µs 171.0±1.28µs +8.37%
tests/lena.jpg/compute_bytes 13.4±0.07µs 13.4±0.09µs 0.00%
tests/test.pdf/compute_bytes 339.3±16.35µs 108.6±1.37µs -67.99%

src/index.rs Outdated Show resolved Hide resolved
fn forget_path(
&mut self,
path: &CanonicalPath,
path: &Path,
old_id: ResourceId,
) -> Result<IndexUpdate> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I would return just deleted collection, because added is always empty here.
  2. Maybe name it forget_entry since path and id have same importance?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would return just deleted collection, because added is always empty here.

That's why I suggest finalizing the API schema in #91 before implementing changes in a separate PR. #72 modifies IndexUpdate, adds its Default implementation, and integrates it into multiple API methods.

Trying to merge #72 changes into this PR while making further modifications along the way could become complicated.

src/index.rs Outdated Show resolved Hide resolved
Comment on lines +566 to +592
/// Removes the given resource ID from the index and returns an update
/// containing the deleted entries
pub fn forget_id(&mut self, old_id: ResourceId) -> Result<IndexUpdate> {
log::debug!("Forgetting a single entry in the index");

// Collect all paths associated with the old ID
let mut old_paths = Vec::new();
for (path, entry) in &self.path2id {
if entry.id == old_id {
old_paths.push(path.clone());
}
}

// Remove entries from path2id and id2path
for path in &old_paths {
self.path2id.remove(path);
}
self.id2path.remove(&old_id);

let mut deleted = HashSet::new();
deleted.insert(old_id);

Ok(IndexUpdate {
added: HashMap::new(),
deleted,
})
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From index/update-one branch:

    /// The caller must ensure that:
    /// * the index is up-to-date except this single id
    /// * the resource with this id has been indexed before
    /// * the resource with this id doesn't exist anymore
    /// 
    /// Should only be used if reactive updates are not possible.
    pub fn track_deletion(&mut self, id: ResourceId) -> Result<IndexUpdate> {
        log::debug!("Tracking a single deletion in the index");

        let indexed_path = self.id2path.get(&id);
        if indexed_path.is_none() {
            return Err(ArklibError::Path(
                "The id cannot be found in the index".into(),
            )); 
        }   

        let indexed_path = indexed_path.unwrap().clone();
        self.forget_entry(indexed_path.as_canonical_path(), id)?;

        Ok(IndexUpdate::default().deleted(id))
    }   

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in forget_id in #72 don't work on their own because ResourceIndex::forget_entry() and IndexUpdate were also modified.

Signed-off-by: Tarek <tareknaser360@gmail.com>
Copy link

github-actions bot commented Mar 2, 2024

Benchmark for 3ca26c4

Click to view benchmark
Test Base PR %
compute_bytes_large/compute_bytes 472.9±1.29µs 141.5±1.06µs -70.08%
compute_bytes_medium/compute_bytes 26.8±0.22µs 27.5±0.20µs +2.61%
compute_bytes_small/compute_bytes 129.4±2.68ns 129.1±0.89ns -0.23%
index_build/index_build/tests/ 158.8±5.51µs 173.6±1.02µs +9.32%
tests/lena.jpg/compute_bytes 13.3±0.07µs 13.3±0.17µs 0.00%
tests/test.pdf/compute_bytes 111.1±4.27µs 110.1±0.47µs -0.90%

Previously, the IndexEntry::modified field was parsed and loaded from miliseconds, which caused a rounding issue when comparing the loaded index with the stored one.
This was due to SystemTime relying on Timespec, which includes nanoseconds.

To address this issue, this commit updates the parsing and loading of IndexEntry::modified to use nanoseconds instead of miliseconds.

Additionally, a new test resource_index_load_store has been added to verify that the stored index is equal to the loaded one.

Signed-off-by: Tarek <tareknaser360@gmail.com>
…as broken invariants

Signed-off-by: Tarek <tareknaser360@gmail.com>
The function load_raw_metadata() is only called in unit tests and is not used elsewhere in the codebase.

Signed-off-by: Tarek <tareknaser360@gmail.com>
Copy link

github-actions bot commented Mar 8, 2024

Benchmark for b6ce103

Click to view benchmark
Test Base PR %
compute_bytes_large/compute_bytes 141.5±1.56µs 141.5±2.31µs 0.00%
compute_bytes_medium/compute_bytes 27.8±0.22µs 27.9±0.57µs +0.36%
compute_bytes_small/compute_bytes 134.5±1.54ns 134.9±4.61ns +0.30%
index_build/index_build/tests/ 160.4±0.91µs 172.3±0.68µs +7.42%
tests/lena.jpg/compute_bytes 13.9±0.07µs 13.9±0.08µs 0.00%
tests/test.pdf/compute_bytes 114.7±1.19µs 117.5±0.39µs +2.44%

@kirillt
Copy link
Member

kirillt commented Mar 11, 2024

About change from milliseconds to nanoseconds in timestamps: we should double-check that we really can have nanoseconds on mobile platforms. I don't remember precisely, but it seems we had an issue with nanoseconds due to implicit rounding to milliseconds. Could be so even on Linux.

We should investigate why this threshold was introduced: 267b844#diff-701633ed548410624cfcae90d38a986adbd60deb5aeedd92355fe2e7702a9c18L34

src/index.rs Show resolved Hide resolved
src/storage/meta.rs Outdated Show resolved Hide resolved
load_raw_metadata was a method defined to read the metadata file and return its content.
This commit removes it and its unit test because it is not used anywhere in the code and seems to be redundant

Signed-off-by: Tarek <tareknaser360@gmail.com>
@tareknaser
Copy link
Collaborator Author

About change from milliseconds to nanoseconds in timestamps: we should double-check that we really can have nanoseconds on mobile platforms. I don't remember precisely, but it seems we had an issue with nanoseconds due to implicit rounding to milliseconds. Could be so even on Linux.

Before making any changes to arklib on how time is stored, we need to understand the impact of storing time in nanosecond on mobile systems.

If we must keep the millisecond precision, we have two options to consider:

  • We can keep the current implementation, which stores time in milliseconds. Only problem is that IndexEntry objects won't be equal when writing to ark file and then reading from it.
  • We can track only milliseconds when retrieving last_modified of each entry. This can be done by converting the SystemTime to milliseconds in scan_entry as follows:
let duration = modified
        .duration_since(UNIX_EPOCH)
        .expect("SystemTime before UNIX EPOCH!")
        .as_millis();
let modified =
	UNIX_EPOCH + std::time::Duration::from_millis(duration as u64);

@kirillt
Copy link
Member

kirillt commented Mar 19, 2024

@tareknaser we could use nanoseconds, but:

  1. Do we really need such a precision? Can several filesystem writes happen with time difference of e.g. 10ns? I hope milliseconds is enough.
  2. We also should care about compatibility between different platforms. I'm not sure that Android provides developers with timestamps better than milliseconds.

I could imagine us needing microseconds to timestamp realistically filesystem events. Would we have the rounding issues in this case?

We can track only milliseconds when retrieving last_modified of each entry. This can be done by converting the SystemTime to milliseconds in scan_entry

This sounds like the way to go.

We can keep the current implementation, which stores time in milliseconds. Only problem is that IndexEntry objects won't be equal when writing to ark file and then reading from it.

Is this sill a problem if we go the way with converting to milliseconds in scan_entry?

@tareknaser
Copy link
Collaborator Author

Do we really need such a precision? Can several filesystem writes happen with time difference of e.g. 10ns? I hope milliseconds is enough.

Yes I think milliseconds is enough. Only issue is #87 (comment)

Is this sill a problem if we go the way with converting to milliseconds in scan_entry?

No. that should solve it

tareknaser and others added 3 commits March 20, 2024 05:37
The code previously retrieved the last modified time with nanosecond precision.
To ensure compatibility across different file systems, particularly on platforms like Android, where nanosecond precision might not be supported or could lead to compatibility issues, the code now adjusts the precision to milliseconds only.

Signed-off-by: Tarek <tareknaser360@gmail.com>
@tareknaser tareknaser merged commit bf3500d into main Mar 20, 2024
1 of 2 checks passed
@tareknaser tareknaser deleted the rewrite-index branch March 20, 2024 21:11
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Empty index is generated if name of the root folder starts with .
2 participants