Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pydataverse upgrade broke OnlineDataverseDataset.upload_file() #320

Closed
mih opened this issue Jul 19, 2024 · 20 comments
Closed

pydataverse upgrade broke OnlineDataverseDataset.upload_file() #320

mih opened this issue Jul 19, 2024 · 20 comments

Comments

@mih
Copy link
Member

mih commented Jul 19, 2024

We pass replace_file() the following instructions

       datafile.set({
            # if we do not give `label`, it would use the local filename
            'label': remote_path.name,
            # the model enforces this property, despite `label` being the
            # effective setter, and despite it being ignore and replaced
            # we the local filename
            'filename': remote_path.name,
            'directoryLabel': str(remote_path.parent),
            'pid': self._dsid,
        })

It appears that directoryLabel gets ignore with pyDataverse==0.3.3

@mih
Copy link
Member Author

mih commented Jul 19, 2024

What might cause the problem here is a "fix" made for gdcc/pyDataverse#171 that came with v0.3.2 in gdcc/pyDataverse#174

I am not sure what is happening here. I thought the API of replace_file changed, but the documentation of identifier is misleading

image

It is not the identifier of the dataset, but of the file.

@mih
Copy link
Member Author

mih commented Jul 19, 2024

I cannot determine what pid in the metadata record is supposed to be (it is mandatory). I tried switching it to be the file id, but that has no impact.

@mih
Copy link
Member Author

mih commented Jul 19, 2024

And something is out-of-sync with the underlying dataverse API. Here is the validation error I get when intentionally butchering the metadata record:

E           jsonschema.exceptions.ValidationError: 'filename' is a required property
E           
E           Failed validating 'required' in schema:
E               {'$schema': 'http://json-schema.org/draft-07/schema',
E                '$id': 'https://github.com/GDCC/pyDataverse/schemas/json/datafile_upload_schema.json',
E                'type': 'object',
E                'title': 'Datafile JSON upload schema',
E                'description': 'Describes the full Datafile metadata JSON file '
E                               'structure for a Dataverse API upload.',
E                'default': {},
E                'required': ['pid', 'filename'],
E                'additionalProperties': False,
E                'properties': {'description': {'$id': '#/properties/description',
E                                               'type': 'string'},
E                               'categories': {'$id': '#/properties/categories',
E                                              'type': 'array',
E                                              'additionalItems': False,
E                                              'items': {'anyOf': [{'$id': '#/properties/categories/items/anyOf/0',
E                                                                   'type': 'string'}],
E                                                        '$id': '#/properties/categories/items'}},
E                               'restrict': {'$id': '#/properties/restrict',
E                                            'type': 'boolean'},
E                               'pid': {'$id': '#/properties/pid', 'type': 'string'},
E                               'filename': {'$id': '#/properties/filename',
E                                            'type': 'string'},
E                               'label': {'$id': '#/properties/label',
E                                         'type': 'string'},
E                               'directoryLabel': {'$id': '#/properties/directoryLabel',
E                                                  'type': 'string'}}}
E           

However https://guides.dataverse.org/en/latest/api/native-api.html#replacing-files shows that a metadata record like this is possible:

{"description":"My description.","categories":["Data"],"forceReplace":false}

It has neither of the properties that the pydataverse schema requires, and a forceReplace that it does not allow.

The tests added for replace_datafile() in pyDataverse in gdcc/pyDataverse@7242082 bypass this issue by using a plain json.dumps()

@JR-1991
Copy link

JR-1991 commented Jul 19, 2024

Indeed, the documentation is just confusing, because it says "... of the dataset" when it is actually of the file.

Regarding the directoryLabel issue, I am a bit puzzled, because the body of the request is not touched at any point. Do other properties of the body sent change? E.g. the description?

Trying to reproduce locally now and will get back soon.

@JR-1991
Copy link

JR-1991 commented Jul 19, 2024

I found the bug! Dataverse requires the metadata to be sent as form-data. It seems that using the json argument in HTTPX is not enough. However, when passed to the data parameter, the issue is resolved, and all metadata is updated correctly.

I will verify if this change does not cause any problems in other areas and will then prepare a pull request to fix this as soon as possible. 😊

@JR-1991
Copy link

JR-1991 commented Jul 19, 2024

I've created a draft pull request. Can you verify that it is working correctly now?

gdcc/pyDataverse#203

@mih
Copy link
Member Author

mih commented Jul 19, 2024

With the tentative fix in pydataverse we arrive at 5 remaining failures:

FAILED datalad_dataverse/tests/test_add_sibling_dataverse.py::test_asdv_invalid_calls - assert 'doi:no-ffing-datalad-way-this-exists not found' in "CommandError: 'git -c diff.ignoreSubmodules=none -c core.quotepath=false annex initre...
FAILED datalad_dataverse/tests/test_pydataverse.py::test_file_handling - httpx.HTTPStatusError: Client error '400 Bad Request' for url 'https://demo.dataverse.org/api/v1/datasets/:persistentId/add?persistentId=doi%3A10...
FAILED datalad_dataverse/tests/test_pydataverse.py::test_file_removal - AssertionError: failed to upload file 400: {'status': 'ERROR', 'message': 'Error in parsing provided json'}
FAILED datalad_dataverse/tests/test_remote.py::test_remote[yes] - datalad.runner.exception.CommandError: CommandError: 'git -c diff.ignoreSubmodules=none -c core.quotepath=false annex testremote --fast mydv -c a...
FAILED datalad_dataverse/tests/test_remote.py::test_remote[no] - datalad.runner.exception.CommandError: CommandError: 'git -c diff.ignoreSubmodules=none -c core.quotepath=false annex testremote --fast mydv -c a...

It looks a bit as if the next blocker may be in the present implementation of file removal. Needs more investigation.

@JR-1991
Copy link

JR-1991 commented Jul 19, 2024

Can you provide details of the failed tests? I will be on vacation from today until over next week Monday. Happy to get these fixed asap when back 😊

@mih
Copy link
Member Author

mih commented Jul 19, 2024

Thank you. I will, once I know and cannot figure it out myself. Happy vacation!

@adswa
Copy link
Member

adswa commented Jul 22, 2024

I plan to look at the failing tests this afternoon

@adswa
Copy link
Member

adswa commented Jul 22, 2024

I'll comment here for now, but I'm happy to re-file anything of relevance in more appropriate places.

I found an issue with pydataverse.api.upload_datafile, or rather a mismatch between actual behavior and behavior expected according to the Add a file to a dataset API. In datalad-dataverse's test_file_removal (and test_file_handling) , we call upload_file using only the dataset identifier and filename

response = dataverse_admin_api.upload_datafile(
identifier=dataverse_dataset,
filename=fpath,
)

The signature of pydataverse.api.upload_datafile and the API guide - at least to me - read as if this is all that's required. However, I found that pydataverse.api.upload_datafile's json_str can not be None, otherwise the API returns a 400 status code. Debugging inside of pydataverse.api.upload_datafile: The call fails when json_str is None (its default value), but succeeds when its the example payload from the Python script in the API guide, or even any key-value:

(Pdb) p url
'https://demo.dataverse.org/api/v1/datasets/:persistentId/add?persistentId=doi:10.70122/FK2/SG3AT3'
(Pdb) n
> /home/adina/repos/pyDataverse/pyDataverse/api.py(1851)upload_datafile()
-> return self.post_request(
(Pdb) p files
{'file': <_io.BufferedReader name='/tmp/pytest-of-adina/pytest-3/test_file_removal0/dummy.txt'>}
(Pdb) p json_str
None
(Pdb) self.post_request(url, data={"jsonData": json_str}, files=files, auth=True)
<Response [400 Bad Request]>
(Pdb) self.post_request(url, data={"jsonData": '{"description": "Blue skies!", "categories": ["Lily", "Rosemary", "Jack of Hearts"]}'}, files=files, auth=True)
<Response [200 OK]>
(Pdb) self.post_request(url, data={"jsonData": '{"some":"thing"}'}, files=files, auth=True)
<Response [200 OK]>

(I was also a bit confused about the 200 response as the API Guide says "You should expect a 201 (“CREATED”) response and JSON indicating the database id that has been assigned to your newly uploaded file.", but datalad-dataverse code was already adjusted to this and checks for a 200).

Simply adding a random JSON payload to the upload_datafile call fixes two of the failing tests (test_file_handling, test_file_removal)

diff --git a/datalad_dataverse/tests/test_pydataverse.py b/datalad_dataverse/te>
index 93e1719..a6a94fd 100644
--- a/datalad_dataverse/tests/test_pydataverse.py
+++ b/datalad_dataverse/tests/test_pydataverse.py
@@ -83,14 +83,16 @@ def check_duplicate_file_deposition(api, dsid, tmp_path):
 
     response = api.upload_datafile(
         identifier=dsid,
-        filename=tmp_path / 'nonunique1.txt'
+        filename=tmp_path / 'nonunique1.txt',
+        json_str='{"some":"thing"}'
     )
     # we do not expect issues here
     response.raise_for_status()
     # now upload the second file with the same content
     response = api.upload_datafile(
         identifier=dsid,
-        filename=tmp_path / 'nonunique2.txt'
+        filename=tmp_path / 'nonunique2.txt',
+        json_str='{"some":"thing"}'
     )
     response.raise_for_status()
 
@@ -109,6 +111,7 @@ def check_upload(api, dsid, fcontent, fpath, src_md5):
     response = api.upload_datafile(
         identifier=dsid,
         filename=fpath,
+        json_str='{"some":"thing"}'
     )
     # worked
     assert response.status_code == 200
@@ -161,6 +164,7 @@ def test_file_removal(
     response = dataverse_admin_api.upload_datafile(
         identifier=dataverse_dataset,
         filename=fpath,
+        json_str='{"some":"thing"}'
     )
     # worked
     assert response.status_code == 200, \
@@ -184,6 +188,7 @@ def test_file_removal(
     response = dataverse_admin_api.upload_datafile(
         identifier=dataverse_dataset,
         filename=fpath,
+        json_str='{"some":"thing"}'
     )
     assert response.status_code == 200, \
         f"failed to upload file {response.status_code}: {response.json()}"

adswa added a commit that referenced this issue Jul 23, 2024
The Backstory is in #320 (comment).
A change outside of datalad-dataverse resulted in a 400 bad request whenever
pydataverse's upload_file() function was called without passing anything
as json_str, and its default value None was used.
In the pydataverse tests we perform a simplistic upload, where this parameter .
hasn't yet benn used. This change adds empty json to make the API call succeed.
@adswa
Copy link
Member

adswa commented Jul 23, 2024

The failure in test_add_sibling_dataverse.py::test_asdv_invalid_calls is curious (probably because I'm not used to debugging special remotes): The test expects a CommandError when trying to add a non-existent dataset on Dataverse. And it expects the doi to be reported in the exception (assert 'doi:no-ffing-datalad-way-this-exists not found' in str(ve.value)).

What it gets, currently, is a CommandError from git annex initremote:

<ExceptionInfo CommandError: 'git -c diff.ignoreSubmodules=none -c core.quotepath=false annex initremote dataverse-storage type=exter...dataverse-storage 
failed'] [err: 'git-annex: external special remote error: Cannot find dataset
initremote: 1 failed'] tblen=12>

This in turn comes from

if not dv_ds.status_code < 400:
raise RuntimeError("Cannot find dataset")

The status code is 404 not found, as should be expected.

I'm not sure what changed in between, or where the DOI was previously report.

I'm not sure if I'm making it too easy for me with this patch:

diff --git a/datalad_dataverse/tests/test_add_sibling_dataverse.py b/datalad_dataverse/tests/test_add_sibling_dataverse.py
index d5b4ed3..bdbac15 100644
--- a/datalad_dataverse/tests/test_add_sibling_dataverse.py
+++ b/datalad_dataverse/tests/test_add_sibling_dataverse.py
@@ -25,7 +25,7 @@ def test_asdv_invalid_calls(
             credential="dataverse",
             **ckwa
         )
-    assert 'doi:no-ffing-datalad-way-this-exists not found' in str(ve.value)
+    assert 'Cannot find dataset' in str(ve.value)
 
 
 @pytest.mark.parametrize("mode", ["annex", "filetree"])

@adswa
Copy link
Member

adswa commented Jul 23, 2024

The failing tests in git annex's testremote are

E               storeKey when already present:                   FAIL
E                 Exception: external special remote error: Client error '400 Bad Request' for url 'https://demo.dataverse.org/api/v1/files/2357440/replace?User-Agent=pydataverse&key=<redacted>'

(or the following when I use gdcc/pyDataverse#201 where the authentication is moved to the header:

E               storeKey when already present:                   FAIL
E                 Exception: external special remote error: Client error '400 Bad Request' for url 'https://demo.dataverse.org/api/v1/files/2357403/replace'

)
and I've struggled to debug this inside the special remote. It currently looks like the internal call to PyDataverse's replace_datafile() yields a bad request.

if replace_id is not None:
response = self._api.replace_datafile(
identifier=replace_id,
filename=local_path,
json_str=datafile.json(),
# we are shipping the database fileid (int)
is_filepid=False,
)

The API Guide's curl call with an ID succeeds, so maybe the way that pydataverse calls replace_datafile is the issue, but I haven't yet figured out how.

shoeffner added a commit to shoeffner/pyDataverse that referenced this issue Jul 24, 2024
shoeffner added a commit to shoeffner/datalad-dataverse that referenced this issue Jul 24, 2024
The docker compose setup as used in pyDataverse uses a different storage
backend than demo.dataverse.org, so the storageIdentifier reports
a non-s3:// address.

This commit detects the two most common ways of specifying localhost
(leaving out 127.1 and the like) to allow for `local://` storageIdentifiers.

Relates to datalad#320.
@shoeffner
Copy link
Contributor

shoeffner commented Jul 24, 2024

upload fix

I used the main branch (0621e2d) and installed pyDataverse from @JR-1991's branch (gdcc/pyDataverse#203).
I was then able to fix this issue on @JR-1991's branch on pyDataverse's side, which I think is somewhat nicer for end users than having to work around it as @adswa shows in #322, but it essentially comes down to solving the same issue of passing "no metadata" causing a Bad Request when passing None.
You can find my updates to @JR-1991's branch in gdcc/pyDataverse#207 as a merge request to gdcc/pyDataverse#203.

local docker compose setup fix

(see #323)
Further, the test_pydataverse.py::test_file_handling would still fail for me because I was not testing with the demo.dataverse.org server, but a local server instead – I changed the check for the storageIdentifier to also allow local:// instead of s3://demo-dataverse. One could even argue that this assert is not that important, but I didn't want to decide on that.

@adswa's observed 400 Bad Request

I checked the error @adswa is facing now:

E               storeKey when already present:                   FAIL
E                 Exception: external special remote error: Client error '400 Bad Request' for url 'https://demo.dataverse.org/api/v1/files/342/replace'

by printing the response text if the error is a 400:

{"status":"ERROR","message":"Error! You may not replace a file with a file that has duplicate content."}

So it should be better to somehow get this error message into git-annex to debug such issues.
Anyways, the issues seems to be that one can (no longer?) replace a file with its identical content.
The docs don't seem to mention this: https://guides.dataverse.org/en/6.3/developers/s3-direct-upload-api.html, but it seems they have "forceReplace":"true" in their example, which the DVObject does not provide in its .json() method.
Even changing json_str=datafile.json(), to this beautiful expression (which makes me wonder if a json string is the best way to pass this information):

json_str=json.dumps(json.loads(datafile.json()) | {"forceReplace": "true"}),  # yes, the API docs have "true" string, not a bool

does not seem to solve it and still causes the duplicate content error. I think this is something worth mentioning upstream at Dataverse – I think a 409 Conflict might be a better suited response code here. Plus, the docs should mention this.
Alas, I checked the dataverse code and the error message is stored in the string bundle as new_file_same_as_replacement and emitted here:

https://github.com/IQSS/dataverse/blob/f7cd9f5f80ffcd6d3dbfbe75b8a83131ee1f55b8/src/main/java/edu/harvard/iq/dataverse/datasetutility/AddReplaceFileHelper.java#L1327-L1335

This piece of code has not changed in the last four years, so I guess there is something else which now causes this to be treated as an error – in fact, the comment above it even says that this would be a warning, not an error.

Do you roughly know since when the test (test_remote.py::test_remote[yes] is failing? We might be able to pin it down further by going through the Dataverse code. Otherwise, the best idea might be to either detect this error and delete the file, then uploading a new one, or simply not uploading a file for which the md5 already matches.
Alternatively, updating git-annex testremote --fast mydv could work, such that it attempts to upload a file with a different hash or at least expects a duplicate upload like this to fail.

Sidenote

(Unrelated sidenote: I also attempted to merge or rebase #280 onto main but git would refuse to do so due to unrelated histories or various conflicts on the initial commit, but I didn't dig deeper.)

@adswa
Copy link
Member

adswa commented Aug 23, 2024

Thanks A LOT for all this digging, @shoeffner!

Do you roughly know since when the test (test_remote.py::test_remote[yes] is failing? We might be able to pin it down further by going through the Dataverse code. Otherwise, the best idea might be to either detect this error and delete the file, then uploading a new one, or simply not uploading a file for which the md5 already matches.
Alternatively, updating git-annex testremote --fast mydv could work, such that it attempts to upload a file with a different hash or at least expects a duplicate upload like this to fail.

I think I found a mention in the API docs, finally: https://guides.dataverse.org/en/latest/user/dataset-management.html#duplicate-files

Beginning with Dataverse Software 5.0, the way a Dataverse installation handles duplicate files (filename and checksums) is changing to be more flexible. Specifically:

  • Files with the same checksum can be included in a dataset, even if the files are in the same directory.
  • Files with the same filename can be included in a dataset as long as the files are in different directories.
  • If a user uploads a file to a directory where a file already exists with that directory/filename combination, the Dataverse installation will adjust the file path and names by adding “-1” or “-2” as applicable. This change will be visible in the list of files being uploaded.
  • If the directory or name of an existing or newly uploaded file is edited in such a way that would create a directory/filename combination that already exists, the Dataverse installation will display an error.
  • If a user attempts to replace a file with another file that has the same checksum, an error message will be displayed and the file will not be able to be replaced.
  • If a user attempts to replace a file with a file that has the same checksum as a different file in the dataset, a warning will be displayed.

In practice, though, this shouldn't affect real usecases.
upload_file() calls replace_datafile when we have a replace_id:
https://github.com/datalad/datalad-dataverse/blob/bf402b8636f3fb13b0d46107bd7cda489d2f3b46/datalad_dataverse/dataset.py#L199C9-L223
The replace_id gets set in transfer_store or transferexport_store if the remote path already exists on Dataverse. However, git annex would run a checkpresent operation before, which would determine whether the key in question already exists on the remote, and then not call the transfer operations. Unless my thinking is wrong, this means that identical file names with identical content would never get replaced, simply as this is how git-annex works.

Sadly, there not an elegant option to alter the tests - the storeKey when already present test of git-annex's testremote is ran even with the smaller test suite of testremote --fast - and as far as I know, there is no way to disable specific tests of the test suite. I'm proposing a rather ugly check whether the right amount of tests has failed via a PR now, but I'd be happy if someone has a better idea.

@pdurbin
Copy link

pdurbin commented Aug 23, 2024

This may be small comfort but @shoeffner @JR-1991 and I talked about this issue on Wednesday. It was first on the agenda if you want to try watching some of the recording. I'm honestly having a little trouble following what changed and why the test stopped working. Please feel free to reach out at https://chat.dataverse.org if you want to talk! Also, gdcc/pyDataverse#203 was merged yesterday. Hope it helps!

@shoeffner
Copy link
Contributor

shoeffner commented Aug 23, 2024 via email

@adswa
Copy link
Member

adswa commented Aug 25, 2024

If I understand it correctly, it appears that since DV 5, it is no longer
possible to upload identical file contents for an existing file, but the
failing test does just that.

Yes exactly. And the test that fails is git-annex testremote, which we call in one of our tests. Its git-annex testsuite to test special remotes. And four out of 125 tests in this suite try replacing files with identical files and fail.

However, my current thinking is that despite these tests failing, it does not affect real world use. It just needs a work-around in our tests.

@adswa
Copy link
Member

adswa commented Sep 19, 2024

Closing this, thanks for everyone's work at fixing things from various angles! Everything is released and functioning.

@adswa adswa closed this as completed Sep 19, 2024
@JR-1991
Copy link

JR-1991 commented Sep 19, 2024

Thanks for making us aware of the problem! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants