Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spatial join in neighbourhood analysis code can retain points with no grid ID, causing type issues #214

Closed
carlhiggs opened this issue Mar 13, 2023 · 0 comments

Comments

@carlhiggs
Copy link
Collaborator

While experimenting running Melbourne with an official Australian 1km population grid instead of the GHSL grid (that can be done, I've now verified it works #213 ), a bug was observed in the neighbourhood analysis subprocess where points that weren't joined with the data were retained with the grid ID as NA values. The result of that is the grid ID is treated as a float64, and in turn can't be cast as an int (or int64) at a later processing stage.

Here is the function as currently implemented

https://github.com/global-healthy-liveable-cities/global-indicators/blob/7c7b974b2c0b47dba20646c3d9c0e77a5b0b0b93/process/subprocesses/setup_sp.py#L19-L40

Here is an updated version, that is configured to dropna (non matches) by default.

def spatial_join_index_to_gdf(
    gdf, join_gdf, join_type='within', dropna=True
):
    """Append to a geodataframe the named index of another using spatial join.

    Parameters
    ----------
    gdf: GeoDataFrame
    join_gdf: GeoDataFrame
    join_type: str (default 'within')
    dropna: True

    Returns
    -------
    GeoDataFrame
    """
    gdf_columns = list(gdf.columns)
    gdf = gpd.sjoin(gdf, join_gdf, how='left', predicate=join_type)
    gdf = gdf[gdf_columns + ['index_right']]
    gdf.columns = gdf_columns + [join_gdf.index.name]
    if dropna:
        gdf = gdf[~gdf[join_gdf.index.name].isna()]
    gdf[join_gdf.index.name] = gdf[join_gdf.index.name].astype(join_gdf.index.dtype)
    return gdf

This also gets rid of the explicit 'right_index_name' argument -- as that doesn't need to be provided, it is identifiable from the data itself.

While writing this, it occurred that maybe an inner join would have done the same thing -- but in any case, the above meant the code ran successfully in the Melbourne test case, and I also confirmed that re-running a new analysis for Las Palmas resulted in the same results as before.

While doing the above I also re-factored ID filtering.

This replaced the following code:

https://github.com/global-healthy-liveable-cities/global-indicators/blob/7c7b974b2c0b47dba20646c3d9c0e77a5b0b0b93/process/subprocesses/_12_neighbourhood_analysis.py#L267-L297

with this function in setup_sp.py

def filter_ids(df, query, message):
        print(message)
        pre_discard = len(df)
        df = df.query(query)
        post_discard = len(df)
        print(
            f'  {pre_discard - post_discard} sample points discarded, '
            f'leaving {post_discard} remaining.',
        )
        return df

and this code that uses it in _12_neighbourhood_analysis.py

    samplePointsData = filter_ids(
        df = samplePointsData,
        query = f"""grid_id not in {list(grid.query(f'pop_est < {population["pop_min_threshold"]}').index.values)}""",
        message = 'Restrict sample points to those not located in grids with a population below '
        f"the minimum threshold value ({population['pop_min_threshold']})...",
    )
    samplePointsData = filter_ids(
            df = samplePointsData,
            query = f"""n1 in {list(gdf_nodes_simple.index.values)} and n2 in {list(gdf_nodes_simple.index.values)}""",
            message = 'Restrict sample points to those with two associated sample nodes...',
        )

I'll do a commit and pull request referencing this issue with the above fixes shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant