pydataverse upgrade broke `OnlineDataverseDataset.upload_file()` #320

mih · 2024-07-19T11:40:41Z

We pass replace_file() the following instructions

       datafile.set({
            # if we do not give `label`, it would use the local filename
            'label': remote_path.name,
            # the model enforces this property, despite `label` being the
            # effective setter, and despite it being ignore and replaced
            # we the local filename
            'filename': remote_path.name,
            'directoryLabel': str(remote_path.parent),
            'pid': self._dsid,
        })

It appears that directoryLabel gets ignore with pyDataverse==0.3.3

The text was updated successfully, but these errors were encountered:

mih · 2024-07-19T12:47:34Z

What might cause the problem here is a "fix" made for gdcc/pyDataverse#171 that came with v0.3.2 in gdcc/pyDataverse#174

I am not sure what is happening here. I thought the API of replace_file changed, but the documentation of identifier is misleading

It is not the identifier of the dataset, but of the file.

mih · 2024-07-19T13:31:02Z

I cannot determine what pid in the metadata record is supposed to be (it is mandatory). I tried switching it to be the file id, but that has no impact.

mih · 2024-07-19T13:45:04Z

And something is out-of-sync with the underlying dataverse API. Here is the validation error I get when intentionally butchering the metadata record:

E           jsonschema.exceptions.ValidationError: 'filename' is a required property
E           
E           Failed validating 'required' in schema:
E               {'$schema': 'http://json-schema.org/draft-07/schema',
E                '$id': 'https://github.com/GDCC/pyDataverse/schemas/json/datafile_upload_schema.json',
E                'type': 'object',
E                'title': 'Datafile JSON upload schema',
E                'description': 'Describes the full Datafile metadata JSON file '
E                               'structure for a Dataverse API upload.',
E                'default': {},
E                'required': ['pid', 'filename'],
E                'additionalProperties': False,
E                'properties': {'description': {'$id': '#/properties/description',
E                                               'type': 'string'},
E                               'categories': {'$id': '#/properties/categories',
E                                              'type': 'array',
E                                              'additionalItems': False,
E                                              'items': {'anyOf': [{'$id': '#/properties/categories/items/anyOf/0',
E                                                                   'type': 'string'}],
E                                                        '$id': '#/properties/categories/items'}},
E                               'restrict': {'$id': '#/properties/restrict',
E                                            'type': 'boolean'},
E                               'pid': {'$id': '#/properties/pid', 'type': 'string'},
E                               'filename': {'$id': '#/properties/filename',
E                                            'type': 'string'},
E                               'label': {'$id': '#/properties/label',
E                                         'type': 'string'},
E                               'directoryLabel': {'$id': '#/properties/directoryLabel',
E                                                  'type': 'string'}}}
E

However https://guides.dataverse.org/en/latest/api/native-api.html#replacing-files shows that a metadata record like this is possible:

{"description":"My description.","categories":["Data"],"forceReplace":false}

It has neither of the properties that the pydataverse schema requires, and a forceReplace that it does not allow.

The tests added for replace_datafile() in pyDataverse in gdcc/pyDataverse@7242082 bypass this issue by using a plain json.dumps()

JR-1991 · 2024-07-19T14:08:02Z

Indeed, the documentation is just confusing, because it says "... of the dataset" when it is actually of the file.

Regarding the directoryLabel issue, I am a bit puzzled, because the body of the request is not touched at any point. Do other properties of the body sent change? E.g. the description?

Trying to reproduce locally now and will get back soon.

JR-1991 · 2024-07-19T14:32:11Z

I found the bug! Dataverse requires the metadata to be sent as form-data. It seems that using the json argument in HTTPX is not enough. However, when passed to the data parameter, the issue is resolved, and all metadata is updated correctly.

I will verify if this change does not cause any problems in other areas and will then prepare a pull request to fix this as soon as possible. 😊

JR-1991 · 2024-07-19T15:32:14Z

I've created a draft pull request. Can you verify that it is working correctly now?

gdcc/pyDataverse#203

mih · 2024-07-19T16:14:04Z

With the tentative fix in pydataverse we arrive at 5 remaining failures:

FAILED datalad_dataverse/tests/test_add_sibling_dataverse.py::test_asdv_invalid_calls - assert 'doi:no-ffing-datalad-way-this-exists not found' in "CommandError: 'git -c diff.ignoreSubmodules=none -c core.quotepath=false annex initre...
FAILED datalad_dataverse/tests/test_pydataverse.py::test_file_handling - httpx.HTTPStatusError: Client error '400 Bad Request' for url 'https://demo.dataverse.org/api/v1/datasets/:persistentId/add?persistentId=doi%3A10...
FAILED datalad_dataverse/tests/test_pydataverse.py::test_file_removal - AssertionError: failed to upload file 400: {'status': 'ERROR', 'message': 'Error in parsing provided json'}
FAILED datalad_dataverse/tests/test_remote.py::test_remote[yes] - datalad.runner.exception.CommandError: CommandError: 'git -c diff.ignoreSubmodules=none -c core.quotepath=false annex testremote --fast mydv -c a...
FAILED datalad_dataverse/tests/test_remote.py::test_remote[no] - datalad.runner.exception.CommandError: CommandError: 'git -c diff.ignoreSubmodules=none -c core.quotepath=false annex testremote --fast mydv -c a...

It looks a bit as if the next blocker may be in the present implementation of file removal. Needs more investigation.

JR-1991 · 2024-07-19T16:49:22Z

Can you provide details of the failed tests? I will be on vacation from today until over next week Monday. Happy to get these fixed asap when back 😊

mih · 2024-07-19T18:14:52Z

Thank you. I will, once I know and cannot figure it out myself. Happy vacation!

adswa · 2024-07-22T06:28:21Z

I plan to look at the failing tests this afternoon

adswa · 2024-07-22T13:56:28Z

I'll comment here for now, but I'm happy to re-file anything of relevance in more appropriate places.

I found an issue with pydataverse.api.upload_datafile, or rather a mismatch between actual behavior and behavior expected according to the Add a file to a dataset API. In datalad-dataverse's test_file_removal (and test_file_handling) , we call upload_file using only the dataset identifier and filename

datalad-dataverse/datalad_dataverse/tests/test_pydataverse.py

Lines 161 to 164 in 0621e2d

    
           response = dataverse_admin_api.upload_datafile( 
        
               identifier=dataverse_dataset, 
        
               filename=fpath, 
        
           )

The signature of pydataverse.api.upload_datafile and the API guide - at least to me - read as if this is all that's required. However, I found that pydataverse.api.upload_datafile's json_str can not be None, otherwise the API returns a 400 status code. Debugging inside of pydataverse.api.upload_datafile: The call fails when json_str is None (its default value), but succeeds when its the example payload from the Python script in the API guide, or even any key-value:

(Pdb) p url
'https://demo.dataverse.org/api/v1/datasets/:persistentId/add?persistentId=doi:10.70122/FK2/SG3AT3'
(Pdb) n
> /home/adina/repos/pyDataverse/pyDataverse/api.py(1851)upload_datafile()
-> return self.post_request(
(Pdb) p files
{'file': <_io.BufferedReader name='/tmp/pytest-of-adina/pytest-3/test_file_removal0/dummy.txt'>}
(Pdb) p json_str
None
(Pdb) self.post_request(url, data={"jsonData": json_str}, files=files, auth=True)
<Response [400 Bad Request]>
(Pdb) self.post_request(url, data={"jsonData": '{"description": "Blue skies!", "categories": ["Lily", "Rosemary", "Jack of Hearts"]}'}, files=files, auth=True)
<Response [200 OK]>
(Pdb) self.post_request(url, data={"jsonData": '{"some":"thing"}'}, files=files, auth=True)
<Response [200 OK]>

(I was also a bit confused about the 200 response as the API Guide says "You should expect a 201 (“CREATED”) response and JSON indicating the database id that has been assigned to your newly uploaded file.", but datalad-dataverse code was already adjusted to this and checks for a 200).

Simply adding a random JSON payload to the upload_datafile call fixes two of the failing tests (test_file_handling, test_file_removal)

diff --git a/datalad_dataverse/tests/test_pydataverse.py b/datalad_dataverse/te>
index 93e1719..a6a94fd 100644
--- a/datalad_dataverse/tests/test_pydataverse.py
+++ b/datalad_dataverse/tests/test_pydataverse.py
@@ -83,14 +83,16 @@ def check_duplicate_file_deposition(api, dsid, tmp_path):
 
     response = api.upload_datafile(
         identifier=dsid,
-        filename=tmp_path / 'nonunique1.txt'
+        filename=tmp_path / 'nonunique1.txt',
+        json_str='{"some":"thing"}'
     )
     # we do not expect issues here
     response.raise_for_status()
     # now upload the second file with the same content
     response = api.upload_datafile(
         identifier=dsid,
-        filename=tmp_path / 'nonunique2.txt'
+        filename=tmp_path / 'nonunique2.txt',
+        json_str='{"some":"thing"}'
     )
     response.raise_for_status()
 
@@ -109,6 +111,7 @@ def check_upload(api, dsid, fcontent, fpath, src_md5):
     response = api.upload_datafile(
         identifier=dsid,
         filename=fpath,
+        json_str='{"some":"thing"}'
     )
     # worked
     assert response.status_code == 200
@@ -161,6 +164,7 @@ def test_file_removal(
     response = dataverse_admin_api.upload_datafile(
         identifier=dataverse_dataset,
         filename=fpath,
+        json_str='{"some":"thing"}'
     )
     # worked
     assert response.status_code == 200, \
@@ -184,6 +188,7 @@ def test_file_removal(
     response = dataverse_admin_api.upload_datafile(
         identifier=dataverse_dataset,
         filename=fpath,
+        json_str='{"some":"thing"}'
     )
     assert response.status_code == 200, \
         f"failed to upload file {response.status_code}: {response.json()}"

The Backstory is in #320 (comment). A change outside of datalad-dataverse resulted in a 400 bad request whenever pydataverse's upload_file() function was called without passing anything as json_str, and its default value None was used. In the pydataverse tests we perform a simplistic upload, where this parameter . hasn't yet benn used. This change adds empty json to make the API call succeed.

adswa · 2024-07-23T12:54:14Z

The failure in test_add_sibling_dataverse.py::test_asdv_invalid_calls is curious (probably because I'm not used to debugging special remotes): The test expects a CommandError when trying to add a non-existent dataset on Dataverse. And it expects the doi to be reported in the exception (assert 'doi:no-ffing-datalad-way-this-exists not found' in str(ve.value)).

What it gets, currently, is a CommandError from git annex initremote:

<ExceptionInfo CommandError: 'git -c diff.ignoreSubmodules=none -c core.quotepath=false annex initremote dataverse-storage type=exter...dataverse-storage 
failed'] [err: 'git-annex: external special remote error: Cannot find dataset
initremote: 1 failed'] tblen=12>

This in turn comes from

datalad-dataverse/datalad_dataverse/dataset.py

Lines 97 to 98 in 0621e2d

    
           if not dv_ds.status_code < 400: 
        
               raise RuntimeError("Cannot find dataset")

The status code is 404 not found, as should be expected.

I'm not sure what changed in between, or where the DOI was previously report.

I'm not sure if I'm making it too easy for me with this patch:

diff --git a/datalad_dataverse/tests/test_add_sibling_dataverse.py b/datalad_dataverse/tests/test_add_sibling_dataverse.py
index d5b4ed3..bdbac15 100644
--- a/datalad_dataverse/tests/test_add_sibling_dataverse.py
+++ b/datalad_dataverse/tests/test_add_sibling_dataverse.py
@@ -25,7 +25,7 @@ def test_asdv_invalid_calls(
             credential="dataverse",
             **ckwa
         )
-    assert 'doi:no-ffing-datalad-way-this-exists not found' in str(ve.value)
+    assert 'Cannot find dataset' in str(ve.value)
 
 
 @pytest.mark.parametrize("mode", ["annex", "filetree"])

adswa · 2024-07-23T13:33:07Z

The failing tests in git annex's testremote are

E               storeKey when already present:                   FAIL
E                 Exception: external special remote error: Client error '400 Bad Request' for url 'https://demo.dataverse.org/api/v1/files/2357440/replace?User-Agent=pydataverse&key=<redacted>'

(or the following when I use gdcc/pyDataverse#201 where the authentication is moved to the header:

E               storeKey when already present:                   FAIL
E                 Exception: external special remote error: Client error '400 Bad Request' for url 'https://demo.dataverse.org/api/v1/files/2357403/replace'

)
and I've struggled to debug this inside the special remote. It currently looks like the internal call to PyDataverse's replace_datafile() yields a bad request.

datalad-dataverse/datalad_dataverse/dataset.py

Lines 216 to 223 in 0621e2d

    
           if replace_id is not None: 
        
               response = self._api.replace_datafile( 
        
                   identifier=replace_id, 
        
                   filename=local_path, 
        
                   json_str=datafile.json(), 
        
                   # we are shipping the database fileid (int) 
        
                   is_filepid=False, 
        
               )

The API Guide's curl call with an ID succeeds, so maybe the way that pydataverse calls replace_datafile is the issue, but I haven't yet figured out how.

Relates to datalad/datalad-dataverse#320

The docker compose setup as used in pyDataverse uses a different storage backend than demo.dataverse.org, so the storageIdentifier reports a non-s3:// address. This commit detects the two most common ways of specifying localhost (leaving out 127.1 and the like) to allow for `local://` storageIdentifiers. Relates to datalad#320.

shoeffner · 2024-07-24T22:16:14Z

upload fix

I used the main branch (0621e2d) and installed pyDataverse from @JR-1991's branch (gdcc/pyDataverse#203).
I was then able to fix this issue on @JR-1991's branch on pyDataverse's side, which I think is somewhat nicer for end users than having to work around it as @adswa shows in #322, but it essentially comes down to solving the same issue of passing "no metadata" causing a Bad Request when passing None.
You can find my updates to @JR-1991's branch in gdcc/pyDataverse#207 as a merge request to gdcc/pyDataverse#203.

local docker compose setup fix

(see #323)
Further, the test_pydataverse.py::test_file_handling would still fail for me because I was not testing with the demo.dataverse.org server, but a local server instead – I changed the check for the storageIdentifier to also allow local:// instead of s3://demo-dataverse. One could even argue that this assert is not that important, but I didn't want to decide on that.

@adswa's observed 400 Bad Request

I checked the error @adswa is facing now:

E               storeKey when already present:                   FAIL
E                 Exception: external special remote error: Client error '400 Bad Request' for url 'https://demo.dataverse.org/api/v1/files/342/replace'

by printing the response text if the error is a 400:

{"status":"ERROR","message":"Error! You may not replace a file with a file that has duplicate content."}

So it should be better to somehow get this error message into git-annex to debug such issues.
Anyways, the issues seems to be that one can (no longer?) replace a file with its identical content.
The docs don't seem to mention this: https://guides.dataverse.org/en/6.3/developers/s3-direct-upload-api.html, but it seems they have "forceReplace":"true" in their example, which the DVObject does not provide in its .json() method.
Even changing json_str=datafile.json(), to this beautiful expression (which makes me wonder if a json string is the best way to pass this information):

json_str=json.dumps(json.loads(datafile.json()) | {"forceReplace": "true"}),  # yes, the API docs have "true" string, not a bool

does not seem to solve it and still causes the duplicate content error. I think this is something worth mentioning upstream at Dataverse – I think a 409 Conflict might be a better suited response code here. Plus, the docs should mention this.
Alas, I checked the dataverse code and the error message is stored in the string bundle as new_file_same_as_replacement and emitted here:

https://github.com/IQSS/dataverse/blob/f7cd9f5f80ffcd6d3dbfbe75b8a83131ee1f55b8/src/main/java/edu/harvard/iq/dataverse/datasetutility/AddReplaceFileHelper.java#L1327-L1335

This piece of code has not changed in the last four years, so I guess there is something else which now causes this to be treated as an error – in fact, the comment above it even says that this would be a warning, not an error.

Do you roughly know since when the test (test_remote.py::test_remote[yes] is failing? We might be able to pin it down further by going through the Dataverse code. Otherwise, the best idea might be to either detect this error and delete the file, then uploading a new one, or simply not uploading a file for which the md5 already matches.
Alternatively, updating git-annex testremote --fast mydv could work, such that it attempts to upload a file with a different hash or at least expects a duplicate upload like this to fail.

Sidenote

(Unrelated sidenote: I also attempted to merge or rebase #280 onto main but git would refuse to do so due to unrelated histories or various conflicts on the initial commit, but I didn't dig deeper.)

adswa · 2024-08-23T13:34:19Z

Thanks A LOT for all this digging, @shoeffner!

Do you roughly know since when the test (test_remote.py::test_remote[yes] is failing? We might be able to pin it down further by going through the Dataverse code. Otherwise, the best idea might be to either detect this error and delete the file, then uploading a new one, or simply not uploading a file for which the md5 already matches.
Alternatively, updating git-annex testremote --fast mydv could work, such that it attempts to upload a file with a different hash or at least expects a duplicate upload like this to fail.

I think I found a mention in the API docs, finally: https://guides.dataverse.org/en/latest/user/dataset-management.html#duplicate-files

Beginning with Dataverse Software 5.0, the way a Dataverse installation handles duplicate files (filename and checksums) is changing to be more flexible. Specifically:

Files with the same checksum can be included in a dataset, even if the files are in the same directory.

Files with the same filename can be included in a dataset as long as the files are in different directories.

If a user uploads a file to a directory where a file already exists with that directory/filename combination, the Dataverse installation will adjust the file path and names by adding “-1” or “-2” as applicable. This change will be visible in the list of files being uploaded.

If the directory or name of an existing or newly uploaded file is edited in such a way that would create a directory/filename combination that already exists, the Dataverse installation will display an error.

If a user attempts to replace a file with another file that has the same checksum, an error message will be displayed and the file will not be able to be replaced.

If a user attempts to replace a file with a file that has the same checksum as a different file in the dataset, a warning will be displayed.

In practice, though, this shouldn't affect real usecases.
upload_file() calls replace_datafile when we have a replace_id:
https://github.com/datalad/datalad-dataverse/blob/bf402b8636f3fb13b0d46107bd7cda489d2f3b46/datalad_dataverse/dataset.py#L199C9-L223
The replace_id gets set in transfer_store or transferexport_store if the remote path already exists on Dataverse. However, git annex would run a checkpresent operation before, which would determine whether the key in question already exists on the remote, and then not call the transfer operations. Unless my thinking is wrong, this means that identical file names with identical content would never get replaced, simply as this is how git-annex works.

Sadly, there not an elegant option to alter the tests - the storeKey when already present test of git-annex's testremote is ran even with the smaller test suite of testremote --fast - and as far as I know, there is no way to disable specific tests of the test suite. I'm proposing a rather ugly check whether the right amount of tests has failed via a PR now, but I'd be happy if someone has a better idea.

pdurbin · 2024-08-23T19:36:45Z

This may be small comfort but @shoeffner @JR-1991 and I talked about this issue on Wednesday. It was first on the agenda if you want to try watching some of the recording. I'm honestly having a little trouble following what changed and why the test stopped working. Please feel free to reach out at https://chat.dataverse.org if you want to talk! Also, gdcc/pyDataverse#203 was merged yesterday. Hope it helps!

shoeffner · 2024-08-23T20:10:21Z

If I understand it correctly, it appears that since DV 5, it is no longer possible to upload identical file contents for an existing file, but the failing test does just that. And, again if I understand it correctly, it is not easily fixed as this is expected to be working by git-annex.

…

On Fri, Aug 23, 2024, 21:37 Philip Durbin ***@***.***> wrote: This may be small comfort but @shoeffner <https://github.com/shoeffner> @JR-1991 <https://github.com/JR-1991> and I talked about this issue on Wednesday. It was first on the agenda <https://docs.google.com/document/d/1feGc2cU9eNOJSvkcljw-l7fSym98gqgQRbtc3JaFoJQ/edit?usp=sharing> if you want to try watching some of the recording <https://drive.google.com/file/d/1VtNNpSQ3_97HGU6pAnpRoTFCPMSKLTVT/view?usp=share_link>. I'm honestly having a little trouble following what changed and why the test stopped working. Please feel free to reach out at https://chat.dataverse.org if you want to talk! Also, gdcc/pyDataverse#203 <gdcc/pyDataverse#203> was merged yesterday. Hope it helps! — Reply to this email directly, view it on GitHub <#320 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAOAOD4TV7MQSJHP3BTCA5DZS6FOJAVCNFSM6AAAAABLENV7H2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMBXG4YDENJYGI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

adswa · 2024-08-25T13:31:05Z

If I understand it correctly, it appears that since DV 5, it is no longer
possible to upload identical file contents for an existing file, but the
failing test does just that.

Yes exactly. And the test that fails is git-annex testremote, which we call in one of our tests. Its git-annex testsuite to test special remotes. And four out of 125 tests in this suite try replacing files with identical files and fail.

However, my current thinking is that despite these tests failing, it does not affect real world use. It just needs a work-around in our tests.

adswa · 2024-09-19T12:07:06Z

Closing this, thanks for everyone's work at fixing things from various angles! Everything is released and functioning.

JR-1991 · 2024-09-19T13:18:54Z

Thanks for making us aware of the problem! :)

mih mentioned this issue Jul 19, 2024

doc(replace_datafile): clarify and correct parameter description gdcc/pyDataverse#202

Merged

JR-1991 mentioned this issue Jul 19, 2024

Fix jsonData not passed correctly gdcc/pyDataverse#203

Merged

adswa mentioned this issue Jul 23, 2024

Fix(pydataverse tests): Adapt to behavior change in upload file #322

Closed

shoeffner added a commit to shoeffner/pyDataverse that referenced this issue Jul 24, 2024

Fix handling of empty metadata

eaead23

Relates to datalad/datalad-dataverse#320

shoeffner mentioned this issue Jul 24, 2024

Fix handling of empty metadata when uploading data files gdcc/pyDataverse#207

Merged

shoeffner mentioned this issue Jul 24, 2024

Allow check_upload assets to use docker #323

Merged

This was referenced Aug 23, 2024

test: Adjust testremote for expected failure amount #324

Merged

Test: Adjust assertion for new message #325

Merged

adswa closed this as completed Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pydataverse upgrade broke `OnlineDataverseDataset.upload_file()` #320

pydataverse upgrade broke `OnlineDataverseDataset.upload_file()` #320

mih commented Jul 19, 2024

mih commented Jul 19, 2024

mih commented Jul 19, 2024

mih commented Jul 19, 2024 •

edited

Loading

JR-1991 commented Jul 19, 2024

JR-1991 commented Jul 19, 2024

JR-1991 commented Jul 19, 2024

mih commented Jul 19, 2024

JR-1991 commented Jul 19, 2024

mih commented Jul 19, 2024

adswa commented Jul 22, 2024

adswa commented Jul 22, 2024

adswa commented Jul 23, 2024

adswa commented Jul 23, 2024

shoeffner commented Jul 24, 2024 •

edited

Loading

adswa commented Aug 23, 2024

pdurbin commented Aug 23, 2024

shoeffner commented Aug 23, 2024 via email

adswa commented Aug 25, 2024

adswa commented Sep 19, 2024

JR-1991 commented Sep 19, 2024

pydataverse upgrade broke OnlineDataverseDataset.upload_file() #320

pydataverse upgrade broke OnlineDataverseDataset.upload_file() #320

Comments

mih commented Jul 19, 2024

mih commented Jul 19, 2024

mih commented Jul 19, 2024

mih commented Jul 19, 2024 • edited Loading

JR-1991 commented Jul 19, 2024

JR-1991 commented Jul 19, 2024

JR-1991 commented Jul 19, 2024

mih commented Jul 19, 2024

JR-1991 commented Jul 19, 2024

mih commented Jul 19, 2024

adswa commented Jul 22, 2024

adswa commented Jul 22, 2024

adswa commented Jul 23, 2024

adswa commented Jul 23, 2024

shoeffner commented Jul 24, 2024 • edited Loading

upload fix

local docker compose setup fix

@adswa's observed 400 Bad Request

Sidenote

adswa commented Aug 23, 2024

pdurbin commented Aug 23, 2024

shoeffner commented Aug 23, 2024 via email

adswa commented Aug 25, 2024

adswa commented Sep 19, 2024

JR-1991 commented Sep 19, 2024

pydataverse upgrade broke `OnlineDataverseDataset.upload_file()` #320

pydataverse upgrade broke `OnlineDataverseDataset.upload_file()` #320

mih commented Jul 19, 2024 •

edited

Loading

shoeffner commented Jul 24, 2024 •

edited

Loading