Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the save() method to set relative paths for resources. Fixes frictionlessdata#41. #51

Merged

Conversation

jonrmitchell
Copy link
Contributor

Overview

Update the save() method to set relative paths for resources. Clones the current package so the path changes only apply to the outputted package, which is what they do in the datapackage-js library (https://github.com/frictionlessdata/datapackage-js/blob/master/src/package.js#L266) Fixes #41.

This is only the second time I've ever submitted a pull request so if there's anything I'm doing wrong please let me know.


Please preserve this line to notify @DiegoPino (lead of this repository)

@DiegoPino
Copy link
Collaborator

Hi @jonrmitchell, thanks for your contribution. Use case makes sense and quick first look at the code looks good. I have some time today in my afternoon to do a proper code review and human test. Stay tuned and thx again!

@DiegoPino DiegoPino added the bug label Apr 23, 2021
@DiegoPino DiegoPino self-assigned this Apr 23, 2021
@DiegoPino DiegoPino added this to the v0.2.0 milestone Apr 23, 2021
Copy link
Collaborator

@DiegoPino DiegoPino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jonrmitchell I added a few comments but mostly question. Happy to discuss them (no rush). I have not yet tested the code but probably what requires my better understanding of your code/logic is the path assignment question, in the case where you have files and URIs and the copied Datapackage may get path overwritten and loose the URIs. I may be missing something. Good work here! We will get this done =)

$base = tempnam(sys_get_temp_dir(), 'datapackage-zip-');
$files = [
'datapackage.json' => $base.'datapackage.json',
];
$ri = 0;
foreach ($this as $resource) {
foreach ($packageCopy->resources() as $resource) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonrmitchell hi. Since this class extends /Iterator I think you can iterate directly over $packageCopy here without having to request all resources. The cursor will start at 0 and may be more optimal for large Packages?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that makes sense and I'd be fine with that.

@@ -210,19 +217,24 @@ public function save($zip_filename)
}
}

protected function copy()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you are here, this could need a PHP DOC saying what it does. I know it is obvious but makes future contributor's work easier. Like the fact that it creates a new Object but skips validation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, docblock, of course. I can add that. I can fix that in a new commit and put it in this same pull request, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Once the pull is open you can keep pushing commits into your local/forked same branch.

}
$resource->descriptor()->path = count($resourceFiles) == 1 ? $resourceFiles[0] : $resourceFiles;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may need to test this further but let me see if I understand this: path property according to the schema can either be a string or an array of strings.

"path": {
"propertyOrder": 30,
"title": "Path",
"description": "A reference to the data for this resource, as either a path as a string, or an array of paths as strings. of valid URIs.",
"oneOf": [
{
"title": "Path",
"description": "A fully qualified URL, or a POSIX file path..",
"type": "string",
"examples": [
"{\n \"path\": \"file.csv\"\n}\n",
"{\n \"path\": \"http://example.com/file.csv\"\n}\n"
],
"context": "Implementations need to negotiate the type of path provided, and dereference the data accordingly."
},

If you have only one resource file for a resource you use the the file path/string, if not you use the array, but if fileNames (save) is empty (should not happen really) but What happens if the resource has also e.g URIs for the same resource?
let's say mytable.csv and https://myrepo.org/data/mytable.csv?

Maybe you should take the original paths from the descriptor first and only replace the original absolute paths that match the files you are iterating over but keep the URIs if any?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct, I hadn't considered the case when path is a URI. I'll make sure that's covered in the tests and then work to fix the implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I played around with remote URLs some and it turns out the BaseDatapackage->save() method currently writes remote resources to disk as part of saving (even before my pull request changes). As I understand it, that's not how it's supposed to work, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi. You are right, but i would say (personal opinion) this may be seen as a not so good documented expected behavior/facility that can save some steps for people building Packages. Since fopen in PHP can deal with URLs same way as with a local file you get this for free. A datastream level on ::save you already saw this

public function save($filename)
{
$target = fopen($filename, 'w');
stream_copy_to_stream($this->fopenResource, $target);
fclose($target);
}

Of course it would be nice if one could flag if the local path is going to be the only one to be kept or also the original remote one,
but does not mean (I guess) that you can not still for a local resource/file add an additional ->path manually to mimic that behavior.

Not sure how the Python and JS versions deal with this. I may want to check with their interfaces to ensure we do this the same way. Probably a different ISSUE/Pull.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it would be useful to be able to download remote files and save them locally into the datapackage. Perhaps that'd be a good parameter/option for the Datapackage->save method? Or on the resource object with maybe a makeLocal() call or similar. Either way I don't think it should be the default.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. probably at the Datastream level we should have flag. Will dig into the other language implementations and see how they do this. Will let you know how this goes. thx

],
(object) [
'name' => 'my-renamed-tabular-resource',
'path' => "resource-1.csv",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where were should test for paths that are also file name and URI maybe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Thanks for taking the time to review this!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks to you! This is all great and the needed changes are not many. Good work

…ackage->save to not attempt to save remote resources;

Added a test to verify that http resources work as expected.
Made DatapackageTest to use php-generated temporary files rather than hard-coded names
Added a few docblocks
@DiegoPino
Copy link
Collaborator

@jonrmitchell i have tomorrow a bit of of a busy day but can give your updates a spin on Tuesday morning (EST time) and come back. All looking good so far!

@jonrmitchell
Copy link
Contributor Author

@DiegoPino thanks! I'm not in a big rush really, this is just a package I'd like to use on a side project (if we can get to where it meets my use cases). If it takes a little longer to get reviewed, I don't mind.

@DiegoPino
Copy link
Collaborator

@jonrmitchell so sorry for the delay. It has been a crazy workweek (and still not finished). The code looks good and all works. I need to refresh my memory so will give the code a second read tomorrow and will then approve and merge. If you want to talk about the optional argument for remote resources to be downloaded/not downloaded, happy to give that a look. Still, to preserve current (expected functionality) we probably will want to make the default to download, but happy to be spoken out of that. I need to check the other packages and see their interfaces, its easier to make a point if there is consistency between language. Again sorry, this will be merged by tomorrow. Thanks!

@jonrmitchell
Copy link
Contributor Author

jonrmitchell commented May 6, 2021

No worries, thanks for taking a look.

From looking at the python library, to_zip() does not download remote files:
https://github.com/frictionlessdata/frictionless-py/blob/main/frictionless/package.py#L587-L589

Also, from what I can tell, the python library won't move files into a temporary location for zipping either (unless explicitly listed as in-memory resources). I think it expects you to already have the data files with a path relative to the datapackage.json file
https://github.com/frictionlessdata/frictionless-py/blob/main/frictionless/package.py#L617-L618
and then zipping is a simple matter of directly zipping what is on-disk.

And if this is the intent... then perhaps this entire pull request is misguided, and adding a resource with an absolute path and expecting it to "fix" the path to be relative should not work, but paths for resources need to be expected to be used and exported exactly as provided. That would explain why this functionality that I thought was so basic could be missing.

Another thing I recognized from reviewing the python library is that save() is too simplistic and missing nuance (does it save over what's there? does it make a copy? etc) and instead having a toZip() method would more closely explain what is going to happen when called.

In any case don't feel like you need to merge this tomorrow, let's talk this through.

@jonrmitchell
Copy link
Contributor Author

But then again, if resources weren't meant to be downloaded when zipped, well that behavior wasn't working before I came along either... so I don't know.

@DiegoPino
Copy link
Collaborator

@jonrmitchell it took me (again) some time to think about your last post and here is my take:

  • Not every language specific implementation is going to be the same and the Python one for sure is meant to be more "manual" than this one. I would say the JS one may be closer to PHP in terms of expectations.
  • fopen() in php deals with remote sources as with files. PHP devs know that. Probably the original coder had/not had that in mind but I do not see this as an issue that should be solved by this Pull.
  • Your pull solves the relative path problem which is in touch (correctly) with the specs and it does it well
  • Future work can be done to allow remote Only resources (special class? or even better at Datastream level...) instead of trying to build the logic into the current implementation. For PHP developers having an automatic download when adding a URL given that the Resource Itself is not marked (or markable) as remote only is Ok for me for now and please feel free to open an ISSUE if that is a concern.

Time to merge.This is all good work and very thankful for your patience. Let's keep in touch to make this package better =)

@DiegoPino DiegoPino merged commit b60bd4e into frictionlessdata:master May 18, 2021
@jonrmitchell
Copy link
Contributor Author

Sounds good. Thanks for taking the time to review it.

@jonrmitchell jonrmitchell deleted the use-relative-paths-on-save branch May 18, 2021 16:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Saving a datapackage does not point "path" to the new local resource
2 participants