Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve zipcode counting notebook by adding GPU backed WKT parser #1130

Merged
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
208 changes: 145 additions & 63 deletions notebooks/ZipCodes_Stops_PiP_cuSpatial.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -135,33 +135,114 @@
"metadata": {},
"outputs": [],
"source": [
"#Import CSV of Zipcodes\n",
"# Import CSV of ZipCodes\n",
"d_zip = cudf.read_csv(\n",
" path_of(\"USA_Zipcodes_2019_Tiger.csv\"),\n",
" usecols=[\"WKT\", \"ZCTA5CE10\", \"INTPTLAT10\", \"INTPTLON10\"])\n",
"d_zip.INTPTLAT10 = d_zip.INTPTLAT10.astype(\"float\")\n",
"d_zip.INTPTLON10 = d_zip.INTPTLON10.astype(\"float\")"
]
},
{
"cell_type": "markdown",
"id": "50b8d8bc-378f-4faa-b60c-e8f0ff507b2a",
"metadata": {},
"source": [
"The geometries are stored in [Well Known Text (WKT)](https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry) format.\n",
"Parsing the geoseries to geometry objects on host is possible, but can be very slow (uncomment to run):"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "583856a2-0b29-48c1-ade7-3840f640c5e7",
"metadata": {},
"outputs": [],
"source": [
"# Load WKT as shapely objects\n",
"h_zip = d_zip.to_pandas()\n",
"h_zip[\"WKT\"] = h_zip[\"WKT\"].apply(wkt.loads)\n",
"h_zip = gpd.GeoDataFrame(h_zip, geometry=\"WKT\", crs='epsg:4326')\n",
"# %%time\n",
"# # Load WKT as shapely objects\n",
"# h_zip = d_zip.to_pandas()\n",
"# h_zip[\"WKT\"] = h_zip[\"WKT\"].apply(wkt.loads)\n",
"# h_zip = gpd.GeoDataFrame(h_zip, geometry=\"WKT\", crs='epsg:4326')\n",
"\n",
"# # Transfer back to GPU with cuSpatial\n",
"# d_zip = cuspatial.from_geopandas(h_zip)"
]
},
{
"cell_type": "markdown",
"id": "6fdaedfc-2d7f-4d73-a9b8-e3a8131bea2f",
"metadata": {},
"source": [
"Instead, we can use cudf list and string method to parse the wkt into coordinates and build a geoseries.\n",
"Without roundtripping to host, cudf provides ~40x speed up by computing on GPU. \n",
harrism marked this conversation as resolved.
Show resolved Hide resolved
"\n",
"# Transfer back to GPU with cuSpatial\n",
"d_zip = cuspatial.from_geopandas(h_zip)"
"Reference machine: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz v.s. NVIDIA Tesla V100 SXM2 32GB\n",
"\n",
"Caveats: geopandas also perform coordinate transform when loading WKT, since the dataset CRS is natively epsg:4326, loading on device can skip this step."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "fd3a5139-3b8b-4311-b966-7d2f08bff21f",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 1.5 s, sys: 890 ms, total: 2.39 s\n",
"Wall time: 2.37 s\n"
]
}
],
"source": [
"%%time\n",
"def parse_multipolygon_WKT_cudf(wkts, dtype=\"f8\"):\n",
" def offsets_from_listlen(list_len):\n",
" return cudf.concat([cudf.Series([0]), list_len.cumsum()])\n",
" \n",
" def traverse(s, split_pat, regex=False):\n",
" \"\"\"Traverse one level lower into the geometry hierarchy,\n",
" using `split_pat` as the child delimiter.\n",
" \"\"\"\n",
" s = s.str.split(split_pat, regex=regex)\n",
" list_len = s.list.len()\n",
" return s.explode(), list_len\n",
" \n",
" wkts = (wkts.str.lstrip(\"MULTIPOLYGON \") \n",
" .str.strip(\"(\") \n",
" .str.strip(\")\"))\n",
" # split into list of polygons\n",
" wkts, num_polygons = traverse(wkts, \"\\)\\),\\s?\\(\\(\", regex=True)\n",
" # split polygons into rings\n",
" wkts, num_rings = traverse(wkts, \"\\),\\s?\\(\", regex=True)\n",
" # split coordinates into lists\n",
" wkts, num_coords = traverse(wkts, \",\", regex=True)\n",
" # split into x-y coordinates\n",
" wkts = wkts.str.split(\" \")\n",
" wkts = wkts.explode().astype(cp.dtype(dtype))\n",
" \n",
" # compute ring_offsets\n",
" ring_offsets = offsets_from_listlen(num_coords)\n",
" # compute part_offsets\n",
" part_offsets = offsets_from_listlen(num_rings)\n",
" # compute geometry_offsets\n",
" geometry_offsets = offsets_from_listlen(num_polygons)\n",
" \n",
" return cuspatial.GeoSeries.from_polygons_xy(\n",
" wkts, ring_offsets, part_offsets, geometry_offsets)\n",
"\n",
"d_wkt = parse_multipolygon_WKT_cudf(d_zip.WKT)\n",
"d_zip.WKT = d_wkt"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "a13b228d-9a60-4f32-b548-fa6f4240e75e",
"metadata": {
"tags": []
Expand All @@ -175,7 +256,7 @@
},
{
"cell_type": "code",
"execution_count": 8,
"execution_count": 9,
"id": "33da801e-01a3-4c9f-bbba-0c61dc7677d9",
"metadata": {
"tags": []
Expand Down Expand Up @@ -330,7 +411,7 @@
},
{
"cell_type": "code",
"execution_count": 9,
"execution_count": 10,
"id": "c0cadafb-acae-41d6-bbca-c10a8201699c",
"metadata": {
"tags": []
Expand All @@ -342,7 +423,7 @@
"text": [
"/raid/wangm/dev/rapids/cuspatial/python/cuspatial/cuspatial/core/spatial/indexing.py:174: UserWarning: scale -1 is less than required minimum scale 0.009837776664632286. Clamping to minimum scale\n",
" warnings.warn(\n",
"/raid/wangm/dev/rapids/cuspatial/python/cuspatial/cuspatial/core/spatial/join.py:150: UserWarning: scale -1 is less than required minimum scale 0.009837776664632286. Clamping to minimum scale\n",
"/raid/wangm/dev/rapids/cuspatial/python/cuspatial/cuspatial/core/spatial/join.py:146: UserWarning: scale -1 is less than required minimum scale 0.009837776664632286. Clamping to minimum scale\n",
" warnings.warn(\n"
]
}
Expand All @@ -365,7 +446,7 @@
},
{
"cell_type": "code",
"execution_count": 10,
"execution_count": 11,
"id": "d2571a4a-a898-4e04-9fd2-21eb6b7a7f3e",
"metadata": {
"tags": []
Expand All @@ -377,7 +458,7 @@
"(1762, 33144)"
]
},
"execution_count": 10,
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
Expand Down Expand Up @@ -406,7 +487,7 @@
},
{
"cell_type": "code",
"execution_count": 11,
"execution_count": 12,
"id": "370ee37c-1311-4f54-9b0c-afd862c489aa",
"metadata": {
"tags": []
Expand All @@ -418,7 +499,7 @@
"text": [
"/raid/wangm/dev/rapids/cuspatial/python/cuspatial/cuspatial/core/spatial/indexing.py:174: UserWarning: scale -1 is less than required minimum scale 0.0029100948550503493. Clamping to minimum scale\n",
" warnings.warn(\n",
"/raid/wangm/dev/rapids/cuspatial/python/cuspatial/cuspatial/core/spatial/join.py:150: UserWarning: scale -1 is less than required minimum scale 0.0029100948550503493. Clamping to minimum scale\n",
"/raid/wangm/dev/rapids/cuspatial/python/cuspatial/cuspatial/core/spatial/join.py:146: UserWarning: scale -1 is less than required minimum scale 0.0029100948550503493. Clamping to minimum scale\n",
" warnings.warn(\n"
]
}
Expand All @@ -436,7 +517,7 @@
},
{
"cell_type": "code",
"execution_count": 12,
"execution_count": 13,
"id": "5674f74a-9315-4e1f-ac0d-c45a1b97ae3e",
"metadata": {
"tags": []
Expand Down Expand Up @@ -471,49 +552,49 @@
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>-121.858094</td>\n",
" <td>37.280787</td>\n",
" <td>95136</td>\n",
" <td>-117.649068</td>\n",
" <td>33.494571</td>\n",
" <td>92675</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>-121.856648</td>\n",
" <td>37.278295</td>\n",
" <td>95136</td>\n",
" <td>-117.649226</td>\n",
" <td>33.494498</td>\n",
" <td>92675</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>-121.855441</td>\n",
" <td>37.280375</td>\n",
" <td>95136</td>\n",
" <td>-117.649102</td>\n",
" <td>33.494483</td>\n",
" <td>92675</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>-121.856343</td>\n",
" <td>37.283195</td>\n",
" <td>95136</td>\n",
" <td>-117.646427</td>\n",
" <td>33.494877</td>\n",
" <td>92675</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>-121.856604</td>\n",
" <td>37.281005</td>\n",
" <td>95136</td>\n",
" <td>-117.647351</td>\n",
" <td>33.499920</td>\n",
" <td>92675</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" x y ZCTA5CE10\n",
"0 -121.858094 37.280787 95136\n",
"1 -121.856648 37.278295 95136\n",
"2 -121.855441 37.280375 95136\n",
"3 -121.856343 37.283195 95136\n",
"4 -121.856604 37.281005 95136\n",
"0 -117.649068 33.494571 92675\n",
"1 -117.649226 33.494498 92675\n",
"2 -117.649102 33.494483 92675\n",
"3 -117.646427 33.494877 92675\n",
"4 -117.647351 33.499920 92675\n",
"(GPU)"
]
},
"execution_count": 12,
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
Expand All @@ -534,7 +615,7 @@
},
{
"cell_type": "code",
"execution_count": 13,
"execution_count": 14,
"id": "247d716c-4718-4aba-8d4f-5f816852194d",
"metadata": {
"tags": []
Expand All @@ -544,15 +625,15 @@
"data": {
"text/plain": [
"ZCTA5CE10\n",
"94901 131\n",
"94535 205\n",
"95112 103\n",
"95407 126\n",
"93933 205\n",
"91107 13\n",
"91941 29\n",
"93730 17\n",
"94512 3\n",
"92553 43\n",
"Name: stop_count, dtype: int32"
]
},
"execution_count": 13,
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
Expand All @@ -565,7 +646,7 @@
},
{
"cell_type": "code",
"execution_count": 14,
"execution_count": 15,
"id": "ccf31694-275d-4987-a318-79bc1ea79e73",
"metadata": {
"tags": []
Expand Down Expand Up @@ -599,39 +680,48 @@
},
{
"cell_type": "code",
"execution_count": 15,
"execution_count": 16,
"id": "2d3d09b1-d42c-471d-b197-d3d705b2b109",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# host_df = stop_counts_and_bounds.to_geopandas()\n",
"# host_df = host_df.rename({\"WKT\": \"geometry\"}, axis=1).set_geometry(\"geometry\")\n",
"# host_df.head()"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "6a9cb2d6-c7d3-4063-9b24-47101dba0044",
"metadata": {},
"outputs": [],
"source": [
"# # Visualize the Dataset\n",
"\n",
"# # Move dataframe to host for visualization\n",
"# host_df = stop_counts_and_bounds.to_geopandas()\n",
"# host_df = host_df.rename({\"WKT\": \"geometry\"}, axis=1)\n",
"# host_df.head()\n",
"\n",
"# # Geo Center of CA: 120°4.9'W 36°57.9'N\n",
"# view_state = pdk.ViewState(\n",
"# **{\"latitude\": 33.96500, \"longitude\": -118.08167, \"zoom\": 6, \"maxZoom\": 16, \"pitch\": 95, \"bearing\": 0}\n",
"# )\n",
"\n",
"# gpd_layer = pdk.Layer(\n",
"# \"GeoJsonLayer\",\n",
"# data=host_df[[\"geometry\", \"stop_count\", \"ZCTA5CE10\"]],\n",
"# data=host_df,\n",
"# get_polygon=\"geometry\",\n",
"# get_elevation=\"stop_count\",\n",
"# extruded=True,\n",
"# elevation_scale=50,\n",
"# get_fill_color=[227,74,51],\n",
"# get_line_color=[255, 255, 255],\n",
"# auto_highlight=True,\n",
"# auto_highlight=False,\n",
"# filled=True,\n",
"# wireframe=True,\n",
"# pickable=True\n",
"# )\n",
"\n",
"# tooltip = {\"html\": \"<b>Stop Sign Count:</b> {stop_count} <br> <b>ZipCode: {ZCTA5CE10}\"}\n",
"# tooltip = {\"html\": \"<b>Stop Sign Count:</b> {stop_count} <br> <b>ZipCode:</b> {ZCTA5CE10}\"}\n",
"\n",
"# r = pdk.Deck(\n",
"# gpd_layer,\n",
Expand All @@ -640,7 +730,7 @@
"# tooltip=tooltip,\n",
"# )\n",
"\n",
"# r.to_html(\"geopandas_layer.html\", notebook_display=False)"
"# r.to_html(\"geopandas_layer.html\", notebook_display=True)"
]
},
{
Expand All @@ -652,14 +742,6 @@
"\n",
"![stop_per_state_map](https://github.com/isVoid/cuspatial/raw/notebook/zipcode_counting/notebooks/stop_states.png)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b7611be2-6dbe-40a5-ae9e-51283737d3f2",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
Expand All @@ -678,7 +760,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.9"
"version": "3.10.10"
}
},
"nbformat": 4,
Expand Down