Data Movement: different behavior when destination directories do not exist #130

bdevcich · 2024-02-16T19:25:45Z

When doing something like the following where the my-job/ directory does not exist at /lus/global/user/ (the destination for the copy_out directive):

flux run -l -N4 --setattr=dw="#DW jobdw type=gfs2 capacity=10GiB name=my-gfs2 \
    #DW copy_out source=\$DW_JOB_my_gfs2 destination=/lus/global/user/my-job/ profile=no-xattr" \
    bash -c 'fallocate -l1G $DW_JOB_my_gfs2/data.out'

The resulting copy out operation can look like this:

$ cd /lus/global/user/my-job/
$ tree
.
├── data.out
├── rabbit-node-1-1
│   └── data.out
├── rabbit-node-2-0
│   └── data.out
└── rabbit-node-2-1
    └── data.out

We get 4 total files, but one of them is at the root level of my-job/ and the index mount directory (i.e. rabbit-node-1-0) did not get copied over. In this case, no harm, no foul (but confusing) since all the data.out files are present. But this can also result in something like this:

$ cd /lus/global/user/my-job/
$ tree
.
├── data.out
├── rabbit-node-2-0
│   └── data.out
└── rabbit-node-2-1
    └── data.out

Not good.

To break this down, the job/workflow will end up creating 4 NnfDatamovements in the DataOut state since it is ran on 4 compute nodes. That is, 4 different data movement operations to move each compute nodes' data from the rabbit to the global lustre filesystem. Those 4 data movement will run (almost) in parallel. They would look something like this for 2 computes per rabbit:

dcp /mnt/nnf/140d8ac6-4012-4e06-a08e-7ec6bbb65f7d-0/rabbit-node-1-0 /lus/global/user/my-job/
dcp /mnt/nnf/140d8ac6-4012-4e06-a08e-7ec6bbb65f7d-0/rabbit-node-1-1 /lus/global/user/my-job/
dcp /mnt/nnf/140d8ac6-4012-4e06-a08e-7ec6bbb65f7d-0/rabbit-node-2-0 /lus/global/user/my-job/
dcp /mnt/nnf/140d8ac6-4012-4e06-a08e-7ec6bbb65f7d-0/rabbit-node-2-1 /lus/global/user/my-job/

One (or more) of these operations will win. That first time through, the my-job directory does not exist. So that first dcp operation is essentially a directory copy:

cp -r src/ my-job/

Since my-job doesn't exist it, the contents of the source are copied directly to the my-job directory. This is the reason for the lone data.out at the root level above.

Afterwards, each subsequent dcp operation will perform the same request, but that directory now exists. Which results in the index mount directory being copied over.

The text was updated successfully, but these errors were encountered:

bdevcich · 2024-02-16T20:33:13Z

I think this becomes a problem of understanding dcp behavior and how we handle it. You can replicate this same scenario on a local filesystem.

$ tree
.
|-- dest
`-- src
    |-- node-0
    |   `-- data.out
    `-- node-1
        `-- data.out

4 directories, 2 files

$ dcp src/node-0 dest/my-job
...

$ tree
.
|-- dest
|   `-- my-job
|       `-- data.out
`-- src
    |-- node-0
    |   `-- data.out
    `-- node-1
        `-- data.out

5 directories, 3 files

$ dcp src/node-1 dest/my-job
...

$ tree
.
|-- dest
|   `-- my-job
|       |-- data.out
|       `-- node-1
|           `-- data.out
`-- src
    |-- node-0
    |   `-- data.out
    `-- node-1
        `-- data.out

6 directories, 4 files

bdevcich · 2024-02-16T20:38:42Z

This begs the question: should the destination directories be required to preexist? If you were to go one level deeper, dcp would fail:

dcp src/node-0 dest/my-job/rank0
[2024-02-16T20:34:50] [0] [/deps/mpifileutils/src/common/mfu_param_path.c:582] ERROR: Destination parent directory is not writable `/home/mpiuser/dest/my-job' (errno=2 No such file or directory)
[2024-02-16T20:34:50] [0] [/deps/mpifileutils/src/dcp/dcp.c:479] ERROR: Invalid src/dest paths provided. Exiting run: MFU_ERR(-1001)

If this was a really long running job with a copy_out, it would be unfortunate for the user to get all the way through the job and then find out in DataOut that dcp has a No such file or directory error.

bdevcich · 2024-02-16T20:53:41Z

Might be related: hpc/mpifileutils#416

bdevcich · 2024-02-20T20:30:16Z

As discussed in the Flux meeting today, we need to investigate doing two things:

Validate the permissions on the destination path before PreRun (so that the mkdir is successful)
Do a mkdir right before the copy (or bake it into dcp as an option)

Creating the directory to ensure the destination exists is required behavior to work around how dcp works: the directory needs to exist before the data movement operation is attempted.

We are in agreement that we need a way to head off user mistakes early on otherwise the data movement could fail during the CopyOut. That mkdir could fail if the user does not have permission to create the directory in the location given.

Things could also change between 1 and 2 where the user application changes directory structure (for example) - not exactly sure what we can do there. We can't guard against everything the user application is capable of.

Additional idea: lost+found
If data movement fails due to destination issues, move the data to some lost+found directory on the global lustre filesystem. This destination would be defined by an administrator. The workflow name, user, etc could all be used to seperate the data in lost+found

After further discussion on our end, we've decided to explore implementing lost+found alongside step No. 2. In the event that step 2 or the data movement encounters any issues, lost+found will be activated to salvage the data. Given the potential for various events to occur between steps 1 and 2, we believe that performing validation upfront is unnecessary. With the implementation of lost+found, users will have the option to retrieve their data if they encounter any issues.

bdevcich · 2024-04-22T21:20:33Z

mkdir function has been added along with ensuring index mount directories are created on the destination when copying out from gfs2/xfs filesystems.

NearNodeFlash/nnf-dm#167

bdevcich · 2024-04-23T17:53:53Z

Closing this and opened #151 for the lost+found part.

bdevcich added the help wanted Extra attention is needed label Feb 16, 2024

bdevcich self-assigned this Feb 16, 2024

bdevcich mentioned this issue Feb 16, 2024

Data Movement: Possible overwrite on copy out #122

Closed

bdevcich linked a pull request Apr 22, 2024 that will close this issue

Ensure destination directory exists + handle index mount directories NearNodeFlash/nnf-dm#167

Merged

bdevcich mentioned this issue Apr 23, 2024

Data Movement Lost+Found #151

Open

bdevcich closed this as completed Apr 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Movement: different behavior when destination directories do not exist #130

Data Movement: different behavior when destination directories do not exist #130

bdevcich commented Feb 16, 2024 •

edited

Loading

bdevcich commented Feb 16, 2024

bdevcich commented Feb 16, 2024

bdevcich commented Feb 16, 2024

bdevcich commented Feb 20, 2024

bdevcich commented Apr 22, 2024

bdevcich commented Apr 23, 2024

Data Movement: different behavior when destination directories do not exist #130

Data Movement: different behavior when destination directories do not exist #130

Comments

bdevcich commented Feb 16, 2024 • edited Loading

bdevcich commented Feb 16, 2024

bdevcich commented Feb 16, 2024

bdevcich commented Feb 16, 2024

bdevcich commented Feb 20, 2024

bdevcich commented Apr 22, 2024

bdevcich commented Apr 23, 2024

bdevcich commented Feb 16, 2024 •

edited

Loading