While spatially aggregating pointwise data required nothing terribly sophisticated, doing the same for objects with spatial extent will motivate a couple new tricks. As an example, let’s take agricultural production for each country [1] and smooth it onto a uniform grid. (We’ll find some interesting connections to this data in the final chapter).
Fair warning: though the first version of this script we demonstrate for you is correct as it stands, it will be sub-optimal. But we want to demonstate the problem and then discuss its elegant solution.
The key first step is to assemble a table giving the weight of each region’s contribution to the tile.
-
Break up the geographic shapes onto tiles. Luxembourg breaks up into two tiles representing XXX% and 1-XXX% of its area. Russia breaks into XXX tiles (!).
-
Bring all fragments one each tile together. For example, the southern of the tiles holding Luxembourg also holds parts of Belgium, France, and Germany.
-
Find the area fraction of each segment on its tile along with the combined shapes. Our example four-country tile is covered by XXX% Luxembourg, XXX% Belgium, XXX% France, and XXX% Germany. Meanwhile, most of Russia’s tiles lie 100% within its interior.
(... code ...) quad_area_frac: fragment's pull / sum of fragments' pulls (id) (name) (area on quad) (area of whole) (fraction of occupied area on quad) (shape)
With this table in hand, a simple join of the each region’s statistics with the table of weights gives the aggregated fraction. For rate or aggregated statistics, such as commodity price index or proportion of imports to exports, we only need to weight the contribution by fraction-on-tile. The relative price index of eggs in Luxembourg in 2011 was 118, 135 in France 170 in Germany and 144 in Belgium, and so our four-country exemplar scale has an averaged price index of XXX (EQUATION).
For counting statistics such as "number of head of cattle", we need to first use the fraction-of-country figure to find the contributed count, and then use the Fraction-on-tile to generate the aggregate. It isn’t fair to pit all of Germany’s XXX head against Luxembourg’s XXX on that tile; we should consider that Germany weighed in with a proportional XXX head (XXX%) against Luxembourg’s XXX head (XXX%).
Note
|
Yes, it’s problematic that we are smearing all of each country’s production uniformly throughout its interior. Florida grows far more cane sugar than Alaska. Your options? First, it might not matter. Predictive analytics algorithms are often quite happy to accept imperfect data if it means more signal. The audience for an "Every possible statistic about where I’m Standing" mobile app would have to accept that we don’t quite have everything yet; just be upfront about the data’s imperfections. If that’s not acceptable, you can try to find better data: through commercial vendors, or through each country’s agricultural statistics website, if any. A big expense either way. Too often, people are reluctant to pursue the middle ground of supplementing the data with their own expertise. Torturing insufficient data into a misleading result is a mortal sin, but acting on only that which can be quantified will send you to perdition just the same. We all just agreed that there are no sugar cane farms in Alaska — finding out what things Alaska, Greenland, Siberia and the Australian Outback do produce, and excluding those regions from your analysis otherwise, would yield a large improvement at modest effort. Even if we don’t know exactly how production is distributed within a country, we know that (a) this only matters for countries so large they span different ecologies; (b) the smaller countries give you a good sense of where those ecologies are. You could hold out the too-large countries from a first pass and use that background distribution to allocate This is only justifiable as a way to turn a bad estimate into a serviceable one, but sometimes that’s enough. The best way to get good at this is to watch masters at work. Besides peer-reviewed articles in your field, one of the best things you can do to advance along your path is to regularly read the data-journalism articles on the FiveThirtyEight blog. Nate Silver and Emily Oster especially excel at supplementing what they can measure with what they know, without either muddying the plot or corrupting the outcome. [2] |
Running the job will produce output like the graph at the right (REF). It should also produce a sense of frustration at the run time and output size.
It’s all those tiny trivial tiles from Russia, the US, and other countries with a large interior. As our approach stands right now, 200-something countries produced XXX number of output tiles. Joining that to the 2.3 million rows of crop production data produces big data from small, which is just the opposite of what we like to do.
Instead of breaking down
The QuadDecompose UDF accepts an outer and an inner zoom level. Every region will be broken down to at least the coarser zoom level; you’ll use this grid size as the partition key (in map/reduce) or group key (in Pig). No region will be broken into tiles smaller than the inner zoom level, which is important for both managing the data volume and for ensuring we don’t group data into finer bins than it can support (see the section on "Choosing a Histogram Bin Size" (REF)).
In the graphic (REF), you can see the result: the top two images, showing the tiles decomposed naively (on the left) and hierarchically (on the right), are identical. The lower row has the same data but with a light border around each tile in the heatmap. The large interior portions of Russia, China, the USA, Brazil and others are now represented by coarser zoom-level blocks. Even France and Argentina manage to line up their borders fortuitously for a zoom-level 6 (TODO check size) block. In all, we reduced the number of tiles from XXX down to XXX without impacting accuracy. TODO: move to join part below.
Since every point on the map is covered by at most one region for the current data set, you’ll find that every tile is either (a) wholly in the interior of a country, or (b) at the finest zoom level: there’s no way for Russia to send data off to a huge zoom-level 5 tile while Latvia sends data to one of its zoom-level 7 children.
The zoom-level chosen for each tile in the FOREACH…QuadDecompose
pass was the correct granularity for the result.
Let’s look closer at the quadtree scheme.
Our Reindeer friends, with their ample free time and bizarre sense of humor, sometimes vacation at a quaint vacation estate known as Spatial Manor. On the one hand, every once in a while somebody attacks a fellow guest with candlestick or rope or whatever’s close at hand. On the other hand, though, most people aren’t killed and get to enjoy the rousing sport of solving a mystery!
We can apply our spatial analysis tools to its simple geometry and help the investigators make a data-backed decision. Here is a map of the grounds.
Features following a constant bearing in any direction — Manhattan’s Broadway, or the borderlines of Algeria or Nevada — remain straight lines on the map. This is important for navigational purposes
The locality properties of quadtiles indexing
Typically you will store the shape clipped to the given quad-tile. metadata about that region — population, metric tons of bananas exported annually, an image of its flag, lyrics to its national anthem — is stored under the same key but in independent columns footnote:[typically the regions are heavyweight, heavily requested and read-only, so they deserve their own table or at least their own column family.
You don’t have to break What we do is When tile 0231_1 is requested
0230_103 us (California 0230_11X us 0230_2XX pacific 0230_30* 0320_31* 0320_32X 0320_33* 0231_0XX US 0231_1xx US 0231_2** Mexico, SW US, pacific 0231_3** Mexico, SW US, Gulf of Mexico 0231_322 Monterey 0231_323 Tampico 0231_33 Caribbean east of Mexico 0232_3333 4 between Hawaii and Baka calif
Querying on the ZL-5 tile 0231_0 (using key 0231_033), any of its four children (using keys 0231_003, 0231_013, 0231_023, or 0231_033) or any of the sixteen ZL-7 descendants 0231_000 through 0231_033 will retrieve a single tile. Querying on its parent 0231 will return tiles for 0231_0 and 0231_1, and a collection of zoom level 6 and 7 tiles covering the southwest US, Mexico and the Gulf of Mexico.
TODO: screenshot.
Our data is stored with no duplication, but
Its easy to decompose or clip a super-tile to a requested (finer) zoom level. The longitude just divides normally along the tile: a hypothetical tile from 16 to 32 would have spatial coordinates 16, 18, 20, …. The latitude cut points do not subdivide directly, but only need to be calculated for one edge: if tile 0230_00 has bottom edge XX latitude, so does tile 0230_01, 0230_10, 0231_11, and others in its horizontal grid row.
The quadtile trick is to the world into four and make a Z pattern across them:
Within each of those, make a Z again:
As you go along, index each tile, as shown in Quadtile Numbering:
This is a 1-d index into a 2-d space! What’s more, nearby points in space are typically nearby in index value. By applying Hadoop’s fundamental locality operation — sorting — geographic locality falls out of numerical locality.
Note: you’ll sometimes see people refer to quadtile coordinates as X/Y/Z
or Z/X/Y
; the 'Z' here refers to zoom level, not a traditional third coordinate.
Each grid cell is contained in its parent
You can also think of it as a tree:
The quadkey is a string of 2-bit tile selectors for a quadtile
@example infochimps_hq = Geo::Place.receive("Infochimps HQ", -97.759003, 30.273884) infochimps_hq.quadkey(8) # ⇒ "02313012"
First, some preliminaries:
EARTH_RADIUS = 6371000 # meters MIN_LONGITUDE = -180 MAX_LONGITUDE = 180 MIN_LATITUDE = -85.05112878 MAX_LATITUDE = 85.05112878 ALLOWED_LONGITUDE = (MIN_LONGITUDE..MAX_LONGITUDE) ALLOWED_LATITUDE = (MIN_LATITUDE..MAX_LATITUDE) TILE_PIXEL_SIZE = 256
# Width or height in number of tiles def map_tile_size(zl) 1 << zl end
The maximum latitude this projection covers is plus/minus 85.05112878
degrees. With apologies to the elves of chapter (TODO: ref), this is still well north of Alert, Canada, the northernmost populated place in the world (latitude 82.5 degrees, 817 km from the North Pole).
It’s straightforward to calculate tile_x indices from the longitude (because all the brutality is taken up in the Mercator projection’s severe distortion).
Finding the Y tile index requires a slightly more complicated formula:
This makes each grid cell be an increasingly better locally-flat approximation to the earth’s surface, palliating the geographers anger at our clumsy map projection.
In code:
# Convert longitude, latitude in degrees to _floating-point_ tile x,y coordinates at given zoom level def lat_zl_to_tile_yf(longitude, latitude, zl) tile_size = map_tile_size(zl) xx = (longitude.to_f + 180.0) / 360.0 sin_lat = Math.sin(latitude.to_radians) yy = Math.log((1 + sin_lat) / (1 - sin_lat)) / (4 * Math::PI) # [ (map_tile_size(zl) * xx).floor, (map_tile_size(zl) * (0.5 - yy)).floor ] end
# Convert from tile_x, tile_y, zoom level to longitude and latitude in # degrees (slight loss of precision). # # Tile coordinates may be floats or integer; they must lie within map range. def tile_xy_zl_to_lng_lat(tile_x, tile_y, zl) tile_size = map_tile_size(zl) raise ArgumentError, "tile index must be within bounds ((#{tile_x},#{tile_y}) vs #{tile_size})" unless ((0..(tile_size-1)).include?(tile_x)) && ((0..(tile_size-1)).include?(tile_x)) xx = (tile_x.to_f / tile_size) yy = 0.5 - (tile_y.to_f / tile_size) lng = 360.0 * xx - 180.0 lat = 90 - 360 * Math.atan(Math.exp(-yy * 2 * Math::PI)) / Math::PI [lng, lat] end
Note
|
Take care to put coordinates in the order "longitude, latitude", maintaining consistency with the (X, Y) convention for regular points. Natural english idiom switches their order, a pernicious source of error — but the convention in geographic systems is unambiguously to use |
-
The quadkey’s length is its zoom level.
-
To zoom out (lower zoom level, larger quadtile), just truncate the quadkey: austin at ZL=8 has quadkey "02313012"; at ZL=3, "023"
-
Nearby points typically have "nearby" quadkeys: up to the smallest tile that contains both, their quadkeys will have a common prefix. If you sort your records by quadkey,
-
Nearby points are nearby-ish on disk. (hello, HBase/Cassandra database owners!) This allows efficient lookup and caching of "popular" regions or repeated queries in an area.
-
the tiles covering a region can be covered by a limited, enumerable set of range scans. For map-reduce programmers, this leads to very efficient reducers
-
-
The quadkey is the bit-interleaved combination of its tile ids:
tile_x 58 binary 0 0 1 1 1 0 1 0 tile_y 105 binary 0 1 1 0 1 0 0 1 interleaved binary 00 10 11 01 11 00 01 10 quadkey 0 2 3 1 3 0 1 2 # "02313012" packed 11718
-
You can also form a "packed" quadkey — the integer formed by interleaving the bits as shown above. At zoom level 15, the packed quadkey is a 30-bit unsigned integer — meaning you can store it in a pig
int
; for languages with anunsigned int
type, you can go to zoom level 16 before you have to use a less-efficient type. Zoom level 15 has a resolution of about one tile per kilometer (about 1.25 km/tile near the equator; 0.75 km/tile at London’s latitude). It takes 1 billion tiles to tile the world at that scale. -
a limited number of range scans suffice to cover any given area
-
each grid cell’s parents are a 2-place bit shift of the grid index itself.
A 64-bit quadkey — corresponding to zoom level 32 — has an accuracty of better than 1 cm over the entire globe. In some intensive database installs, rather than storing longitude and latitude separately as floating-point numbers, consider storing either the interleaved packed quadkey, or the individual 32-bit tile ids as your indexed value. The performance impact for Hadoop is probably not worth it, but for a database schema it may be.
Though quadtile properties do vary, the variance is modest within most of the inhabited world:
The (ref table) gives the full coordinates at every zoom level for our exemplar set.