-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read version number from the schema #159
Conversation
Looks like #24 was what added |
One issue with the script that generates the |
It looks like the |
8782825
to
d73fa35
Compare
Parquet writers usually include the name of the implementation in the metadata, so it won't be bitwise equal across implementations import pandas as pd
import pyarrow.parquet as pq
df = pd.DataFrame({'a': [1, 2, 3, 4]})
df.to_parquet('tmp.parquet')
pq.read_metadata('tmp.parquet').created_by
# 'parquet-cpp-arrow version 10.0.1' |
It's probably good to rely on this for the example data. For one, we don't guarantee it to be stable as we might update it with new naturalearth data releases (which is maybe not a problem for our use case), but the dataset also has some political problems (like Crimea belonging to Russia, which is a reason we would prefer to remove it from geopandas on the long term). Maybe we can find some other online data source to use as example? |
I personally prefer conda, and would rather not use poetry. But I also don't think it is worth supporting both since in the end it is just for running this script, which is not something one will have to do often yourself. So I am OK with continue to use poetry here. But we should maybe avoid relying on GDAL (fiona) then to read in the example data to keep the installation of the env for the scripts simpler (eg if we read from a geojson file, GDAL is not necessarily needed). |
I agree that we could reduce dependencies and simplify creation of the example file. And I agree that something besides political boundaries would be good. Let's take this up separately from the changes here. |
@kylebarron or @jorisvandenbossche - Any more comments on these changes? We definitely have more to discuss and probably do related to CI, dependencies, example data, etc. But I'm wondering if we can get in the changes that reduce the duplicated version identifiers to simplify the release process (#146). I updated the description of the PR to include more details on the changes and their motivation in case that helps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, all my suggestions about a different data set / not using GDAL were just ideas that can certainly be left for other issues/PRs
The goal of this branch is to reduce the number of places the version number is repeated. Here is a summary of the changes and motivation for each:
examples/example.py
script currently includes the version number. I changed this to read fromformat-specs/schema.json
instead. Theexamples
directory has anenvironment.yml
file that apparently allowsexamples/example.py
to be run, but because there was no readme, I moved the script to thescripts
directory where there are instructions on how to install python dependencies.tests/test_json_schema.py
script currently includes the version number. As above, I changed this to read fromformat-specs/schema.json
instead. The script also has dependencies and no instructions on how to install them, so I moved it to thescripts
directory where this is documented and updated the dependency list there..github/workflows/scripts.yml
workflow to run the script that generates the example data before asserting that the example metadata matches expectations. This should catch cases where someone updates the expected metadata but not the example or vice versa (if someone forgets to change both but changes the metadata schema, ideally the validator job should catch that).