Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: fix logic to check for nested GTFS files in ZIP #1972

Merged

Conversation

sylvansson
Copy link
Contributor

Summary:

This PR fixes a bug with our logic to check whether a ZIP file we're loading has GTFS files in a subfolder. It looks like ZipInputStream.getNextEntry doesn't always return subfolders, depending on how the ZIP file was created. The subfolder and ZIP file having the same name in #1912 was a red herring.

``` $ unzip -l piercetransit-wa-us--flex-v2.zip Archive: piercetransit-wa-us--flex-v2.zip Length Date Time Name --------- ---------- ----- ---- 170 11-28-2023 15:22 piercetransit-wa-us--flex-v2/timetables.txt 81 11-28-2023 15:22 piercetransit-wa-us--flex-v2/fare_attributes.txt 18 11-28-2023 15:22 piercetransit-wa-us--flex-v2/stop_attributes.txt 56 11-28-2023 15:22 piercetransit-wa-us--flex-v2/transfers.txt 183 11-28-2023 15:22 piercetransit-wa-us--flex-v2/agency.txt 12 11-28-2023 15:22 piercetransit-wa-us--flex-v2/areas.txt 54 11-28-2023 15:22 piercetransit-wa-us--flex-v2/fare_rules.txt 437 11-28-2023 15:22 piercetransit-wa-us--flex-v2/calendar_dates.txt 4367 11-28-2023 15:22 piercetransit-wa-us--flex-v2/stop_times.txt 374 11-28-2023 15:22 piercetransit-wa-us--flex-v2/location_groups.txt 137 11-28-2023 15:22 piercetransit-wa-us--flex-v2/directions.txt 53 11-28-2023 15:22 piercetransit-wa-us--flex-v2/frequencies.txt 18 11-28-2023 15:22 piercetransit-wa-us--flex-v2/farezone_attributes.txt 895 11-28-2023 15:22 piercetransit-wa-us--flex-v2/shapes.txt 983 11-28-2023 15:22 piercetransit-wa-us--flex-v2/trips.txt 355 11-28-2023 15:22 piercetransit-wa-us--flex-v2/feed_info.txt 2051 11-28-2023 15:22 piercetransit-wa-us--flex-v2/locations.geojson 104 11-28-2023 15:22 piercetransit-wa-us--flex-v2/runcut.txt 2170 11-28-2023 15:22 piercetransit-wa-us--flex-v2/stops.txt 117 11-28-2023 15:22 piercetransit-wa-us--flex-v2/linked_datasets.txt 131 11-28-2023 15:22 piercetransit-wa-us--flex-v2/calendar_attributes.txt 62 11-28-2023 15:22 piercetransit-wa-us--flex-v2/timetable_stop_order.txt 1745 11-28-2023 15:22 piercetransit-wa-us--flex-v2/booking_rules.txt 265 11-28-2023 15:22 piercetransit-wa-us--flex-v2/calendar.txt 520 11-28-2023 15:22 piercetransit-wa-us--flex-v2/routes.txt --------- ------- 15358 25 files $ mv piercetransit-wa-us--flex-v2.zip foobar.zip $ unzip -l foobar.zip Archive: foobar.zip Length Date Time Name --------- ---------- ----- ---- 170 11-28-2023 15:22 piercetransit-wa-us--flex-v2/timetables.txt 81 11-28-2023 15:22 piercetransit-wa-us--flex-v2/fare_attributes.txt 18 11-28-2023 15:22 piercetransit-wa-us--flex-v2/stop_attributes.txt 56 11-28-2023 15:22 piercetransit-wa-us--flex-v2/transfers.txt 183 11-28-2023 15:22 piercetransit-wa-us--flex-v2/agency.txt 12 11-28-2023 15:22 piercetransit-wa-us--flex-v2/areas.txt 54 11-28-2023 15:22 piercetransit-wa-us--flex-v2/fare_rules.txt 437 11-28-2023 15:22 piercetransit-wa-us--flex-v2/calendar_dates.txt 4367 11-28-2023 15:22 piercetransit-wa-us--flex-v2/stop_times.txt 374 11-28-2023 15:22 piercetransit-wa-us--flex-v2/location_groups.txt 137 11-28-2023 15:22 piercetransit-wa-us--flex-v2/directions.txt 53 11-28-2023 15:22 piercetransit-wa-us--flex-v2/frequencies.txt 18 11-28-2023 15:22 piercetransit-wa-us--flex-v2/farezone_attributes.txt 895 11-28-2023 15:22 piercetransit-wa-us--flex-v2/shapes.txt 983 11-28-2023 15:22 piercetransit-wa-us--flex-v2/trips.txt 355 11-28-2023 15:22 piercetransit-wa-us--flex-v2/feed_info.txt 2051 11-28-2023 15:22 piercetransit-wa-us--flex-v2/locations.geojson 104 11-28-2023 15:22 piercetransit-wa-us--flex-v2/runcut.txt 2170 11-28-2023 15:22 piercetransit-wa-us--flex-v2/stops.txt 117 11-28-2023 15:22 piercetransit-wa-us--flex-v2/linked_datasets.txt 131 11-28-2023 15:22 piercetransit-wa-us--flex-v2/calendar_attributes.txt 62 11-28-2023 15:22 piercetransit-wa-us--flex-v2/timetable_stop_order.txt 1745 11-28-2023 15:22 piercetransit-wa-us--flex-v2/booking_rules.txt 265 11-28-2023 15:22 piercetransit-wa-us--flex-v2/calendar.txt 520 11-28-2023 15:22 piercetransit-wa-us--flex-v2/routes.txt --------- ------- 15358 25 files ```

Closes #1912

Expected behavior:

We get a invalid_input_files_in_subfolder notice even if the subfolder is not treated as a standalone entry.

Testing:

Before:
image

After:
image

Please make sure these boxes are checked before submitting your pull request - thanks!

  • Run the unit tests with gradle test to make sure you didn't break anything
  • Add or update any needed documentation to the repo
  • Format the title like "feat: [new feature short description]". Title must follow the Conventional Commit Specification(https://www.conventionalcommits.org/en/v1.0.0/).
  • Linked all relevant issues
  • Include screenshot(s) showing how this pull request works and fixes the issue(s)

@qcdyx
Copy link
Contributor

qcdyx commented Feb 11, 2025

Hey @skalexch could you take a look at the 14 datasets that contains new errors? (You can see all of them by clicking on the arrow)

New Errors (14 out of 1808 datasets, ~1%) ✅

Details of new errors due to code change, which is less than the provided threshold of 1%.

Dataset Notice Code
gh-ashanti-ingerop-gtfs-1814 invalid_input_files_in_subfolder
jp-hokkaido-donan-bus-gtfs-1019 invalid_input_files_in_subfolder
pt-porto-metro-do-porto-gtfs-2147 invalid_input_files_in_subfolder
us-california-city-of-wasco-gtfs-1788 invalid_input_files_in_subfolder
us-california-flex-v2-developer-test-feed-1-gtfs-1817 invalid_input_files_in_subfolder
us-california-flex-v2-developer-test-feed-2-gtfs-1818 invalid_input_files_in_subfolder
us-california-flex-v2-developer-test-feed-3-gtfs-1819 invalid_input_files_in_subfolder
us-colorado-greeley-evans-transit-get-gtfs-612 invalid_input_files_in_subfolder
us-florida-citrus-county-transit-gtfs-630 invalid_input_files_in_subfolder
us-florida-lakexpress-gtfs-342 invalid_input_files_in_subfolder
us-georgia-cobb-community-transit-cct-gtfs-354 invalid_input_files_in_subfolder
us-georgia-xpress-gtfs-2355 invalid_input_files_in_subfolder
us-michigan-detroit-people-mover-gtfs-417 invalid_input_files_in_subfolder
us-virginia-jaunt-inc-gtfs-1324 invalid_input_files_in_subfolder

@skalexch
Copy link

@qcdyx the screenshot below shows the affected datasets and above them the folders that I extracted from them. I also included mdb-2854 as control. It does seem like for all of the concerned datasets, the GTFS files exist within a subfolder. For the control dataset, the extracted folder has the same name as the zipfile, which means that the files reside in the root directory.
Screenshot 2025-02-11 at 11 50 18 AM

Please note that I could not download mdb-612 and mdb-1324

@sylvansson sylvansson force-pushed the 1912-fix-nested-gtfs-file-detection branch from c1342e4 to 13bf3f5 Compare February 11, 2025 22:33
Copy link
Contributor

@qcdyx qcdyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for your contribution!

@qcdyx qcdyx merged commit 22ee726 into MobilityData:master Feb 12, 2025
134 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Notice invalid_input_files_in_subfolder not triggered if zip file and subfolder have the same name
3 participants