Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get v1 data, partial implementation #18

Merged
merged 26 commits into from
Aug 9, 2024
Merged

Get v1 data, partial implementation #18

merged 26 commits into from
Aug 9, 2024

Conversation

e-kotov
Copy link
Member

@e-kotov e-kotov commented Aug 7, 2024

Hi @Robinlovelace , I have finished some of the work on v1 data and tried to envision how it would be united with current code for v2, as well as how I would suggest the v2 code may be changed and improved upon. Could you please review what I have for v1 (and minor changes to v2) below?

Setup:

remotes::install_github("Robinlovelace/spanishoddata@get-v1-data")
Sys.setenv(SPANISH_OD_DATA_DIR = "/your/cache/path")

Available data / metadata

V1 Data

Fetch available data for version 1 with an optional check for local files:

v1 <- spod_available_data_v1(check_local_files = TRUE)

This function changes the default paths of the MITMA data locally to arrange it in a hive style for faster filtering and/or import with duckdb/arrow.

Function Wrapping Suggestion

Consider wrapping both functions below into a single function with a version argument, for example, spod_available_data(ver = 1). Alternatively, you could name them spod_available_data_v1() and spod_available_data_v2(), or spod_get_metadata_v1() and spod_get_metadata_v2(). Another option could be spod_metadata_v1() and spod_metadata_v2(). The current function handles the XML file cache by checking when it was last downloaded and fetches it automatically if it is older than one day. It might make sense to add an argument to load the latest cached XML for offline use.

V2 Data

Fetch metadata for version 2:

v2 <- spod_get_metadata()

I have not modified this function much, but I plan to add the path-changing feature for v2 data as well if it is appreciated in v1. The few changes I have made include updating the XML file name so it differs from v1, renaming the file mask to data_links, and removing the current_timestamp argument from the XML fetching function, as the current timestamp should be used automatically.

The naming of both functions is still up for discussion, and I am open to suggestions.

Implemented Functions for V1 Data

Get Latest V1 File List

v1_file <- spod_get_latest_v1_file_list()

This is only really for internal use or for advanced users, so I would have this explicit long name and change to similar for v2, or even combine the functions into one.

Available Data for V1

v1 <- spod_available_data_v1() # with optional file check for local files as noted in the beginning of the message

Zones Loader

The zones loader with English parameter names and their variations that are converted to Spanish using internal helper functions. These functions load the shapefiles, check the geometry validity, and replace the uppercase ID with lowercase 'id'. In the future, I plan to attach metadata such as municipality names.

zones_municip <- spod_get_zones_v1("muni")
zones_municip <- spod_get_zones_v1("municip")
zones_municip <- spod_get_zones_v1("municipalities")
zones_districts <- spod_get_zones_v1("dist")
zones_districts <- spod_get_zones_v1("distr")
zones_districts <- spod_get_zones_v1("districts")

V2 Zones

For v2 zones, more adjustments are needed, such as attaching accompanying names and population numbers metadata that are available alongside the files. This is a TODO.

Download Function

The new download function handles most non-spatial data, automatically checks for the required data version, accepts multiple types of zoning, and processes date arguments automatically.

spod_download_data(type = "od", zones = "districts", dates = "2020032[0-5]")
spod_download_data(type = "od", zones = "municip", dates = "20200302_20200305")
spod_download_data(type = "od", zones = "municip", dates = "2020-03-02_2020-03-05")
spod_download_data(type = "od", zones = "municip", dates = c(start = "2020-03-02", end = "2020-03-05"))

To download any two dates without interpreting them as a range, use the following. More than two dates are always interpreted as an arbitrary sequence and not converted to a range.

spod_download_data(type = "od", zones = "municip", dates = c("2020-03-02", "2020-03-05"))

I think this advanced handling of dates would make the package more accessible for beginners, as I do not expect everyone to easily grasp what a regex is and they may be more comfortable using simpler dates argument.

There are also "testthat" tests for the underlying function that handles the magic behind the dates argument. That includes tests for data incompatibility (i.e. the helper functions check if the requested date range is spanning across multiple data versions and stops with a warning).

Next Steps

The next step is to implement an equivalent of spod_get_zones() and similar functions for v1 data that would rely on the already downloaded data or download it automatically if it is missing, this will use the same mechanics that I implemented for the download function as they would simply wrap it. I would appreciate your feedback on the current implementation and naming of the functions, as well as my suggestions on the overall structure of the package.

We can discuss during our next scheduled meeting.

e-kotov added 24 commits July 31, 2024 14:11
@Robinlovelace
Copy link
Collaborator

Good stuff @e-kotov, thanks for the clear statement of changes and options, will test in due course, likely on Friday.

@Robinlovelace Robinlovelace self-assigned this Aug 7, 2024
@e-kotov
Copy link
Member Author

e-kotov commented Aug 7, 2024

@Robinlovelace great! Take your time. Meanwhile, I added some last minute additions to finally pass the R CMD check. We now have a snapshot of metadata bundled with the package (compressed xmls are tiny and xml2::read_xml() reads them directly from gz just fine). These can be used for future internal tests of some other functions too.

@Robinlovelace
Copy link
Collaborator

Sounds good, more soon!

@e-kotov e-kotov added this to the v1 data (2020-2021) support milestone Aug 7, 2024
@Robinlovelace
Copy link
Collaborator

OK, going to give this a quick test and hopefully review now. Have quite limited time, ~15 minutes, due to family commitments later in the day.

@Robinlovelace
Copy link
Collaborator

Update here: looking great!

Copy link
Collaborator

@Robinlovelace Robinlovelace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an epic set of changes Egor, many thanks! Big +1, all tested on README and no further comment in limited time I have today. Let's get it merged to keep ball rolling but will leave that to you.

DESCRIPTION Show resolved Hide resolved
NAMESPACE Show resolved Hide resolved
R/download_data.R Show resolved Hide resolved
R/download_data.R Show resolved Hide resolved
R/folders.R Show resolved Hide resolved
R/get.R Show resolved Hide resolved
R/get.R Show resolved Hide resolved
R/get.R Show resolved Hide resolved
R/get_v1_data.R Show resolved Hide resolved
R/get_v1_data.R Show resolved Hide resolved
@e-kotov
Copy link
Member Author

e-kotov commented Aug 9, 2024

Ok, merging, but new big updates are coming up)

@e-kotov e-kotov merged commit 2ce6f68 into main Aug 9, 2024
2 checks passed
@e-kotov
Copy link
Member Author

e-kotov commented Aug 12, 2024

Partially fulfils #5

@Robinlovelace
Copy link
Collaborator

1 version down, 1 to go!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants