The quadord
key’s properties give us the following promises:
-
Its descendants always come after it in sort order
-
Once the order comes to a tile that is not one of its descendants, that tile will never again be relevant.
Put another way: all relevant records for a tile come in an unbroken stretch from the first time its key is seen, up to (but not including) the first time a non-descendant key is seen.
While it may seem like we need to keep around a lot of candidate tiles for matching, in fact the only tiles we need to retail
A good way to think of this is to pretend we were doing a depth-first enumeration of each tile and its descendants. (Which we are — we just skip all the keys that don’t show up). At every point up down and across the tree, the only relevant rows are for the current tile and its ancestors, and we only need to keep a tile around while we investigate its descendants
(TODO: make a similar diagram showing this point more specifically.)
If you paid good attention back in CS312, hearing "depth first search" makes you think "I wonder if I’m about to use a Stack data structure". If so, give yourself a gold star. We’re not going to actually do a depth-first iteration
-
Visit each stack and pop elements until you hit one that is either the same as or an ancestor of our key (or until the stack is empty)
-
Push the current element onto its stack
-
Iterate over all elements in the other stack, emitting a tuple for each match.
Let’s walk through it. As with the other walkthroughs in this chapter, we’re going to only use four zoom levels and write keys in our "string handle" notation. The string-handle keys have exactly the same sorting properties as our quadord
keys do, so the explanation carries over.
quad slot place | -- stack 0: tiles for first slot (rooms) -- ! -- stack 1: people slot -- 0 0 West Wing | 0 WestWing ! 00 0 Kitchen | 0 WestWing 00 Kitchen ! 01 0 Kitchen | 0 WestWing 01 Kitchen ! 0120 1 Mme La Rose | 0 WestWing 01 Kitchen ! 0120 Mme La Rose 0123 0 Pantry | 0 WestWing 01 Kitchen 0123 Pantry ! 0132 0 Pantry | 0 WestWing 01 Kitchen 0132 Pantry ! 0133 0 Pantry | 0 WestWing 01 Kitchen 0133 Pantry ! 0133 1 Mr Saffron | 0 WestWing 01 Kitchen 0133 Pantry ! 0133 Mr Saffron 02 0 Dining | 0 WestWing 02 Dining ! 030 0 Closet | 0 WestWing 030 Closet ! 0300 1 Red Kelly | 0 WestWing 030 Closet ! 0300 Red Kelly 031 0 Stairs | 0 WestWing 031 Stairs ! 032 0 Closet | 0 WestWing 032 Closet ! 033 0 Stairs | 0 WestWing 033 Stairs ! ... | ! 10 0 Conservatory | 10 Conservatory ! 10 0 East Wing | 10 Conservatory 10 EastWing ! 110 0 Conservatory | 110 Conservatory ! 110 0 East Wing | 110 Conservatory 110 EastWing ! 112 0 Conservatory | 112 Conservatory ! 112 0 East Wing | 112 Conservatory 112 EastWing ! 1123 1 Ms Peach | 112 Conservatory 112 EastWing ! 1123 Ms Peach
The 0 West Wing
covers all of the 0*
blocks, so it comes first in line. There’s nothing to sweep off the stacks, and nothing to pair with, so we just push it onto stack 0. Next is the 00 Kitchen
block, covering 00
. We keep the 0 WestWing
block (as 0
is a prefix of 00
), push 00 Kitchen
onto stack 0, and since there’s nothing to pair with, continue.
The 01 Kitchen
block is next. It evicts the neighboring 00 Kitchen
block, but not its ancestor 0 WestWing
block. Since there’s still nothing in the people
stack, we push 01 Kitchen
and move on. Mme La Rose’s record, the first we’ve seen from the peopl slot, now finally gets the party started. Both keys on the stack (0
and 01
) are ancestors of 0120
, and so nothing is swept and we push Mme La Rose onto the people table. The matching phase generates pairs indicating that Mme La Rose is 01 Kitchen
and 0 WestWing
at the time of the incident.
The 0120 Pantry
tile sweeps its predecessor but produces no matches, as do the next two Pantry tiles.
0133 Mr Saffron
finds himself paired with three containing shapes: the WestWing, Kitchen and Pantry. The next step continues with the 02 Dining Room
in the lower left of the 0
block. This sweeps out both the Pantry and the Kitchen from the stack, but retains the parent 0 WestWing
.
We’ve supplied a few more of the blocks — trace through them until you’re getting the hang of it. Let’s skip ahead though. The diagonal south wall separating the Dining room from the Lounge means that the 2002, 2003, 2011, and 2012 blocks contain parts of each room, and it’s useful to see where the ambiguity is resolved.
-
| ! 20 0 West Wing | 20 WestWing ! 2000 0 Dining | 20 WestWing 2000 Dining ! 2001 0 Dining | 20 WestWing 2001 Dining ! 2002 0 Dining | 20 WestWing 2002 Dining ! 2002 0 Lounge | 20 WestWing 2002 Dining 2002 Lounge ! 2003 0 Dining | 20 WestWing 2003 Dining ! 2003 0 Lounge | 20 WestWing 2003 Dining 2003 Lounge ! 2003 1 Dr Jade | 20 WestWing 2003 Dining 2003 Lounge ! 2003 Dr Jade 2003 1 Sr Azul | 20 WestWing 2003 Dining 2003 Lounge ! 2003 Dr Jade 2003 Sr Azul 2010 0 Dining | 20 WestWing 2010 Dining !
The 2003 Lounge
record comes up with the 20 WestWing
and 2003 Dining
records already on the stack. The 2003 Dining
record has the same tile id, and so according to our rules it is not evicted. The 2003 Dr Jade
record sweeps nothing from the stack and generates pairs for the WestWing, Dining Room and Lounge. The 2003 Sr Azul
record sweeps nothing from either stack (even the Dr Jade
record). Its only pairings are with WestWing, Dining Room and Lounge, though — the continued presence of 2003 Dr Jade
in the second stack has nothing to do with the matchmaking.
Here is a sample of the pairs that come out of all of this:
0 WestWing 0120 55 30 Mme La Rose 01 Kitchen 0120 55 30 Mme La Rose 0 WestWing 0133 90 40 Mr Saffron 01 Kitchen 0133 90 40 Mr Saffron 0133 Pantry 0133 90 40 Mr Saffron ... 20 WestWing 2003 20 115 Dr Jade 2003 Dining 2003 20 115 Dr Jade 2003 Lounge 2003 20 115 Dr Jade 20 WestWing 2003 23 122 Sr Azul 2003 Dining 2003 23 122 Sr Azul 2003 Lounge 2003 23 122 Sr Azul
The output of this step only considered tile membership, and so Dr Jade and Sr Azul are each listed in candidate pairings with the Dining Room and the Lounge. That’s by design; our next step is to use each room’s geometry object to filter out the non-matches.
peep_rooms_f = FILTER peep_rooms_g BY GeoIntersects(room.geom, peep.pt);
Using only Hadoop’s built-in sort and little memory overhead, we were able to assemble records into groups even when they weren’t contiguous in the sort order.
An important difference from the conventional COGROUP comes in how we designed the sorting keys and data structures. In a conventional COGROUP, we order the data by the partition key, then the table slot (all records from the left-mentioned input precede those from the last-mentioned input), then any secondary sort keys. That means we don’t need a data structure for the last-mentioned input and don’t even hold its records in memory — all possible matches for a record from the last input are already sitting hot in RAM ready to make beautiful output tuples. In the spatial COGROUP, we partition on the coarse zoom-level prefix, then sort on the full quadord
key before the table slot index. Since the keys must be sorted to support the depth-first-like traversal, it’s likely that matching rows from each slot will intermingle. So while the regular COGROUP doesn’t have to allocate a data structure for the records in its last-mentioned input, a spatial join of two tables needs to maintain two stacks.
Frequently, geospatial data is, for practical reasons, sampled at discrete points but should be understood to represent measurements at all points in space. For example, the measurements in the NCDC datasets are gathered at locations chosen for convenience or value — in some cases, neighboring stations are separated by blocks, in other cases by hundreds of miles. It is useful to be able to reconstruct the underlying spatial distribution from point-sample measurements.
Given a set of locations — broadcast towers, 7-11 stores, hospitals — it is also useful to be able to determine, for any point in space, which of those objects is nearest. When the distribution of objects is even, this is straightforward: choose a bounding box or quad tile you are sure will encompass the point in question and all candidate locations, then choose the nearest candidate. When the distribution is highly uneven, though, the bounding box that works well in rural Montana may return overwhelmingly many results in midtown Manhattan.
We can solve both those problems with a single elegant approach known as Voronoi partitioning. Given a set of seed locations, the Voronoi partitioning returns a set of polygons with the following properties:
-
The polygon’s ‘partition’ is the space divided such that every piece of the plane belongs to exactly one polygon.
-
There is exactly one polygon for each seed location and all points within it are closer to that seed location than to any other seed location.
-
All points on the boundary of two polygons are equidistant from the two neighboring seed locations; and all vertices where Voronoi polygons meet are equidistant from the respective seed locations.
This effectively precomputes the “nearest x” problem: For any point in question, find the unique polygon within which it resides (or rarely, the polygon boundaries upon which it lies). Breaking those polygons up by quad tile at a suitable zoom level makes it easy to either store them in HBase (or equivalent) for fast querying or as data files optimized for a spatial JOIN.
It also presents a solution to the spatial sampling problem by assigning the measurements taken at each sample location to its Voronoi region. You can use these piece-wise regions directly or follow up with some sort of spatial smoothing as your application requires. Let’s dive in and see how to do this in practice.
The weather station data is sampled at each weather station, and forms our best estimate for the surrounding region’s weather.
So weather data is gathered at a point, but imputes information about a region. You can’t just slap each point down on coarse-grained tiles — the closest weather station might lie just over on the next quad, and you’re writing a check for very difficult calculations at run time.
We also have a severe version of the multiscale problem. The coverage varies wildly over space: a similar number of weather stations cover a single large city as cover the entire Pacific ocean. It also varies wildly over time: in the 1970s, the closest weather station to Austin, TX was about 150 km away in San Antonio. Now, there are dozens in Austin alone.
There’s an elegant solution known as a Voronoi diagram [1].
The Voronoi diagram covers the plane with polygons, one per point — I’ll call that the "centerish" of the polygon. Within each polygon, you are closer to its centerish than any other. By extension, locations on the boundary of each Voronoi polygon are equidistant from the centerish on either side; polygon corners are equidistant from centerishes of all touching polygons [2].
Also known as local auto-correlation
Gi-star (Gettis-Ord)