TODO:
Will be reorganizing below in this order:
-
do a "nearness" query example,
-
reveal that it is such a thing known as the spatial join, and broaden your mind as to how you think about locality.
-
cover the geographic data model, GeoJSON etc.
-
Spatial concept of Quadtiles — none of the mechanics of the projection yet
-
Something with Points and regions, using quadtiles
-
Actual mechanics of Quadtile Projection — from lng/lat to quadkey
-
mutiscale quadkey assignment
-
(k-means will move to ML chapter)
-
complex nearness — voronoi cells and weather data
also TODO: untangle the following two paragraphs, and figure out whether to put them at beginning or end (probably as sidebar, at beginning)
It not only unwinds two dimensions to one, but any system it to spatial analysis in more dimensions — see [brain_example], which also extends the coordinate handling to three dimensions
Geographic data shows up in the form of
-
Points — a pair of coordinates. When given as an ordered pair (a "Position"), always use
[longitude,latitude]
in that order, matching the familiarX,Y
order for mathematical points. When it’s a point with other metadata, it’s a Place [1], and the coordinates are named fields. -
Paths — an array of points
[[longitude,latitude],[longitude,latitude],…]
-
Region — an array of paths, understood to connect and bound a region of space.
[
. Your array will be of length one unless there are holes or multiple segments -
"Feature" — a generic term for "Point or Path or Region".
-
"Bounding Box" (or
bbox
) — a rectangular bounding region,[-5.0, 30.0, 5.0, 40.0]
*
Note
|
Features of Features
The term "feature" is somewhat muddied — to a geographer, "feature" indicates a thing being described (places, regions, paths are all geographic features). In the machine learning literature, "feature" describes a potentially-significant attribute of a data element (manufacturer, top speed and weight are features of a car). Since we’re here as data scientists dabbling in geography, we’ll reserve the term "feature" for its machine learning sense only just say "object" in place of "geographic feature" (and ). |
Geospatial Information Science ("GIS") is a deep subject, ////Say how, like, ", which focuses on the study of…" Amy////treated here shallowly — we’re interested in models that have a geospatial context, not in precise modeling of geographic features themselves. Without apology we’re going to use the good-enough WGS-84 earth model and a simplistic map projection. We’ll execute again the approach of using existing traditional tools on partitioned data, and Hadoop to reshape and orchestrate their output at large scale. [2]
Doing a "what’s nearby" query on a large dataset is difficult unless you can ensure the right locality. Large-scale geodata processing in hadoop starts with the quadtile grid system, a simple but powerful idea.
Doing a "what’s nearby" query on a large dataset is difficult. No matter how you divide up the data, some features that are nearby in geographic distance will become far away in data locality.
We also need to teach our elephant a new trick for providing data locality
Sort your places west to east, and
Large-scale geodata processing in hadoop starts with the quadtile grid system, a simple but powerful idea.
We’ll start by adopting the simple, flat Mercator projection — directly map longitude and latitude to (X,Y). This makes geographers cringe, because of its severe distortion at the poles, but its computational benefits are worth it.
Now divide the world into four and make a Z pattern across them:
Within each of those, make a Z again:
As you go along, index each tile, as shown in Quadtile Numbering:
This is a 1-d index into a 2-d space! What’s more, nearby points in space are typically nearby in index value. By applying Hadoop’s fundamental locality operation — sorting — geographic locality falls out of numerical locality.
Note: you’ll sometimes see people refer to quadtile coordinates as X/Y/Z
or Z/X/Y
; the 'Z' here refers to zoom level, not a traditional third coordinate.
////Introduce/buffer a bit first — like, "The following approach can also be used to analyze x, y, or z…" Root in real-world applications, first. Amy////
Let’s put Hadoop into practice for something really important: understanding where a likely alien invasion will take place. The National UFO Reporting Center has compiled a dataset of 60,000+ documented UFO sightings, with metadata. We can combine that with the 7 million labelled points of interest in the Geonames dataset: airports and zoos, capes to craters, schools, churches and more.
Going in to this, I predict that UFO sightings will generally follow the population distribution (because you need people around to see them) but that sightings in cities will be under-represented per capita. I also suspect UFO sightings will be more likely near airports and military bases, and in the southwestern US. We will restrict attention only to the continental US; coverage of both datasets is spotty elsewhere, which will contaminate our results.
Looking through some weather reports, visibilities of ten to fifteen kilometers (6-10 miles) are a reasonable midrange value; let’s use that distance to mean "nearby". Given this necessarily-fuzzy boundary, let’s simplify matters further by saying two objects are nearby if one point lies within the 20-km-per-side bounding box centered on the other:
+---------+---------+ | B | | | | | | | + A + | | C | | | | | | B is nearby A; C is not. Sorry, C. +---------+---------+ |- 10 km -|
What we will do is partition the world by quadtile, and ensure that each candidate pair of points arrives at the same quadtile.
Our mappers will send the highly-numerous geonames points directly to their quadtile, where they will wait individually. But we can’t send each UFO sighting only to the quadtile it sits on: it might be nearby a place on a neighboring tile.
If the quadtiles are always larger than our nearbyness bounding box, then it’s enough to just look at each of the four corners of our bounding box; all candidate points for nearbyness must live on the 1-4 quadtiles those corners touch. Consulting the geodata ready reference (TODO: ref) later in the book, zoom level 11 gives a grid size of 13-20km over the continental US, so it will serve.
So for UFO points, we will use the bbox_for_radius
helper to get the left-top and right-bottom points, convert each to quadtile id’s, and emit the unique 1-4 tiles the bounding box covers.
Example values:
longitude latitude left top right bottom nw_tile_id se_tile_id ... ... ... ...
Data is cheap and code is expensive, so for these 60,000 points we’ll just serialize out the bounding box coordinates with each record rather than recalculate them in the reducer. We’ll discard most of the UFO sightings fields, but during development let’s keep the location and time fields in so we can spot-check results.
Mapper output:
////Introduce this - (it’s true, you’ll need to reorient the reader pretty consistently). "Here, we are looking for…" Amy////
The reducer is now fairly simple. Each quadtile will have a handful of UFO sightings, and a potentially large number of geonames places to test for nearbyness. The nearbyness test is straightforward:
# from wukong/geo helpers
class BoundingBox def contains?(obj) ( (obj.longitude >= left) && (obj.latitude <= top) && (obj.longitude <= right) && (obj.latitude >= btm) end end
# nearby_ufos.rb
class NearbyReducer
def process_group(group) # gather up all the sightings sightings = [] group.gather(UfoSighting) do |sighting| sightings << sighting end # the remaining records are places group.each do |place| sighted = false sightings.each do |sighting| if sighting.contains?(place) sighted = true yield combined_record(place, sighting) end end yield unsighted_record(place) if not sighted end end
def combined_record(place, sighting) (place.to_tuple + [1] + sighting.to_tuple) end def unsighted_record(place) place.to_tuple + [0] end end
For now I’m emitting the full place and sighting record, so we can see what’s going on. In a moment we will change the combined_record
method to output a more disciplined set of fields.
Output data:
...
We now have a set of [place, sighting]
pairs, and we want to understand how the distribution of coincidences compares to the background distribution of places.
(TODO: don’t like the way I’m currently handling places near multiple sightings)
That is, we will compare the following quantities:
count of sightings count of features for each feature type, count of records for each feature type, count of records near a sighting
The dataset at this point is small enough to do this locally, in R or equivalent; but if you’re playing along at work your dataset might not be. So let’s use pig.
place_sightings = LOAD "..." AS (...);
features = GROUP place_sightings BY feature;
feature_stats = FOREACH features { sighted = FILTER place_sightings BY sighted; GENERATE features.feature_code, COUNT(sighted) AS sighted_count, COUNT_STAR(sighted) AS total_count ; };
STORE feature_stats INTO '...';
results:
-
TODO move results over from cluster …
We’ll represent geographic features in two different ways, depending on focus:
-
If the geography is the focus — it’s a set of features with data riding sidecar — use GeoJSON data structures.
-
If the object is the focus — among many interesting fields, some happen to have a position or other geographic context — use a natural Wukong model.
-
If you’re drawing on traditional GIS tools, if possible use GeoJSON; if not use the legacy format it forces, and a lot of cursewords as you go.
GeoJSON is a new but well-thought-out geodata format; here’s a brief overview. The GeoJSON spec is about as readable as I’ve seen, so refer to it for anything deeper.
The fundamental GeoJSON data structures are:
module GeoJson class Base ; include Wukong::Model ; end class FeatureCollection < Base field :type, String field :features, Array, of: Feature field :bbox, BboxCoords end class Feature < Base field :type, String, field :geometry, Geometry field :properties field :bbox, BboxCoords end class Geometry < Base field :type, String, field :coordinates, Array, doc: "for a 2-d point, the array is a single `(x,y)` pair. For a polygon, an array of such pairs." end # lowest value then highest value (left low, right high; class BboxCoords < Array def left ; self[0] ; end def btm ; self[1] ; end def right ; self[2] ; end def top ; self[3] ; end end end
GeoJSON specifies these orderings for features:
-
Point:
[longitude, latitude]
-
Polygon:
[ ]
— you must repeat the first point. The first array is the outer ring; other paths in the array are interior rings or holes (eg South Africa/Lesotho). For regions with multiple parts (US/Alaska/Hawaii) use a MultiPolygon. -
Bbox:
[left, btm, right, top]
, ie[xmin, ymin, xmax, ymax]
An example hash, taken from the spec:
{ "type": "FeatureCollection", "features": [ { "type": "Feature", "properties": {"prop0": "value0"}, "geometry": {"type": "Point", "coordinates": [102.0, 0.5]} }, { "type": "Feature", "properties": { "prop0": "value0", "prop1": {"this": "that"} }, "bbox": [ "geometry": { "type": "Polygon", "coordinates": [ [ [-10.0, 0.0], [5.0, -1.0], [101.0, 1.0], [100.0, 1.0], [-10.0, 0.0] ] ] } } ] }
Each grid cell is contained in its parent
You can also think of it as a tree:
The quadkey is a string of 2-bit tile selectors for a quadtile
@example infochimps_hq = Geo::Place.receive("Infochimps HQ", -97.759003, 30.273884) infochimps_hq.quadkey(8) # ⇒ "02313012"
First, some preliminaries:
EARTH_RADIUS = 6371000 # meters MIN_LONGITUDE = -180 MAX_LONGITUDE = 180 MIN_LATITUDE = -85.05112878 MAX_LATITUDE = 85.05112878 ALLOWED_LONGITUDE = (MIN_LONGITUDE..MAX_LONGITUDE) ALLOWED_LATITUDE = (MIN_LATITUDE..MAX_LATITUDE) TILE_PIXEL_SIZE = 256
# Width or height in number of tiles def map_tile_size(zl) 1 << zl end
The maximum latitude this projection covers is plus/minus 85.05112878
degrees. With apologies to the elves of chapter (TODO: ref), this is still well north of Alert, Canada, the northernmost populated place in the world (latitude 82.5 degrees, 817 km from the North Pole).
It’s straightforward to calculate tile_x indices from the longitude (because all the brutality is taken up in the Mercator projection’s severe distortion).
Finding the Y tile index requires a slightly more complicated formula:
This makes each grid cell be an increasingly better locally-flat approximation to the earth’s surface, palliating the geographers anger at our clumsy map projection.
In code:
# Convert longitude, latitude in degrees to _floating-point_ tile x,y coordinates at given zoom level def lat_zl_to_tile_yf(longitude, latitude, zl) tile_size = map_tile_size(zl) xx = (longitude.to_f + 180.0) / 360.0 sin_lat = Math.sin(latitude.to_radians) yy = Math.log((1 + sin_lat) / (1 - sin_lat)) / (4 * Math::PI) # [ (map_tile_size(zl) * xx).floor, (map_tile_size(zl) * (0.5 - yy)).floor ] end
# Convert from tile_x, tile_y, zoom level to longitude and latitude in # degrees (slight loss of precision). # # Tile coordinates may be floats or integer; they must lie within map range. def tile_xy_zl_to_lng_lat(tile_x, tile_y, zl) tile_size = map_tile_size(zl) raise ArgumentError, "tile index must be within bounds ((#{tile_x},#{tile_y}) vs #{tile_size})" unless ((0..(tile_size-1)).include?(tile_x)) && ((0..(tile_size-1)).include?(tile_x)) xx = (tile_x.to_f / tile_size) yy = 0.5 - (tile_y.to_f / tile_size) lng = 360.0 * xx - 180.0 lat = 90 - 360 * Math.atan(Math.exp(-yy * 2 * Math::PI)) / Math::PI [lng, lat] end
Note
|
Take care to put coordinates in the order "longitude, latitude", maintaining consistency with the (X, Y) convention for regular points. Natural english idiom switches their order, a pernicious source of error — but the convention in geographic systems is unambiguously to use |
-
Exemplars
-
Tokyo
-
San Francisco
-
The Posse East Bar in Austin, TX [3]
-
-
The quadkey’s length is its zoom level.
-
To zoom out (lower zoom level, larger quadtile), just truncate the quadkey: austin at ZL=8 has quadkey "02313012"; at ZL=3, "023"
-
Nearby points typically have "nearby" quadkeys: up to the smallest tile that contains both, their quadkeys will have a common prefix. If you sort your records by quadkey,
-
Nearby points are nearby-ish on disk. (hello, HBase/Cassandra database owners!) This allows efficient lookup and caching of "popular" regions or repeated queries in an area.
-
the tiles covering a region can be covered by a limited, enumerable set of range scans. For map-reduce programmers, this leads to very efficient reducers
-
-
The quadkey is the bit-interleaved combination of its tile ids:
tile_x 58 binary 0 0 1 1 1 0 1 0 tile_y 105 binary 0 1 1 0 1 0 0 1 interleaved binary 00 10 11 01 11 00 01 10 quadkey 0 2 3 1 3 0 1 2 # "02313012" packed 11718
-
You can also form a "packed" quadkey — the integer formed by interleaving the bits as shown above. At zoom level 15, the packed quadkey is a 30-bit unsigned integer — meaning you can store it in a pig
int
; for languages with anunsigned int
type, you can go to zoom level 16 before you have to use a less-efficient type. Zoom level 15 has a resolution of about one tile per kilometer (about 1.25 km/tile near the equator; 0.75 km/tile at London’s latitude). It takes 1 billion tiles to tile the world at that scale. -
a limited number of range scans suffice to cover any given area
-
each grid cell’s parents are a 2-place bit shift of the grid index itself.
A 64-bit quadkey — corresponding to zoom level 32 — has an accuracty of better than 1 cm over the entire globe. In some intensive database installs, rather than storing longitude and latitude separately as floating-point numbers, consider storing either the interleaved packed quadkey, or the individual 32-bit tile ids as your indexed value. The performance impact for Hadoop is probably not worth it, but for a database schema it may be.
# converts from even/odd state of tile x and tile y to quadkey. NOTE: bit order means y, x BIT_TO_QUADKEY = { [false, false] => "0", [false, true] => "1", [true, false] => "2", [true, true] => "3", } # converts from quadkey char to bits. NOTE: bit order means y, x QUADKEY_TO_BIT = { "0" => [0,0], "1" => [0,1], "2" => [1,0], "3" => [1,1]}
# Convert from tile x,y into a quadkey at a specified zoom level def tile_xy_zl_to_quadkey(tile_x, tile_y, zl) quadkey_chars = [] tx = tile_x.to_i ty = tile_y.to_i zl.times do quadkey_chars.push BIT_TO_QUADKEY[[ty.odd?, tx.odd?]] # bit order y,x tx >>= 1 ; ty >>= 1 end quadkey_chars.join.reverse end
# Convert a quadkey into tile x,y coordinates and level def quadkey_to_tile_xy_zl(quadkey) raise ArgumentError, "Quadkey must contain only the characters 0, 1, 2 or 3: #{quadkey}!" unless quadkey =~ /\A[0-3]*\z/ zl = quadkey.to_s.length tx = 0 ; ty = 0 quadkey.chars.each do |char| ybit, xbit = QUADKEY_TO_BIT[char] # bit order y, x tx = (tx << 1) + xbit ty = (ty << 1) + ybit end [tx, ty, zl] end
Though quadtile properties do vary, the variance is modest within most of the inhabited world:
The (ref table) gives the full coordinates at every zoom level for our exemplar set.
The smallest tile that fully encloses a set of points is given by the tile with the largest common quadtile prefix. For example, the University of Texas (quad 0231_3012_0331_1131
) and my office (quad 0231_3012_0331_1211
) are covered by the tile 0231_3012_0331_1
.
When points cross major tile boundaries, the result is less pretty. Austin’s airport (quad 0231301212221213
) shares only the zoom-level 8 tile 02313012
:
To find the distance between two points on the globe, we use the Haversine formula
in code:
# Return the haversine distance in meters between two points def haversine_distance(left, top, right, btm) delta_lng = (right - left).abs.to_radians delta_lat = (btm - top ).abs.to_radians top_rad = top.to_radians btm_rad = btm.to_radians
aa = (Math.sin(delta_lat / 2.0))**2 + Math.cos(top_rad) * Math.cos(btm_rad) * (Math.sin(delta_lng / 2.0))**2 cc = 2.0 * Math.atan2(Math.sqrt(aa), Math.sqrt(1.0 - aa)) cc * EARTH_RADIUS end
# Return the haversine midpoint in meters between two points def haversine_midpoint(left, top, right, btm) cos_btm = Math.cos(btm.to_radians) cos_top = Math.cos(top.to_radians) bearing_x = cos_btm * Math.cos((right - left).to_radians) bearing_y = cos_btm * Math.sin((right - left).to_radians) mid_lat = Math.atan2( (Math.sin(top.to_radians) + Math.sin(btm.to_radians)), (Math.sqrt((cos_top + bearing_x)**2 + bearing_y**2))) mid_lng = left.to_radians + Math.atan2(bearing_y, (cos_top + bearing_x)) [mid_lng.to_degrees, mid_lat.to_degrees] end
# From a given point, calculate the point directly north a specified distance def point_north(longitude, latitude, distance) north_lat = (latitude.to_radians + (distance.to_f / EARTH_RADIUS)).to_degrees [longitude, north_lat] end
# From a given point, calculate the change in degrees directly east a given distance def point_east(longitude, latitude, distance) radius = EARTH_RADIUS * Math.sin(((Math::PI / 2.0) - latitude.to_radians.abs)) east_lng = (longitude.to_radians + (distance.to_f / radius)).to_degrees [east_lng, latitude] end
(TODO: Section under construction)
This section will show how to
-
efficiently segment region polygons (county boundaries, watershed regions, etc) into grid cells
-
store data pertaining to such regions in a grid-cell form: for example, pivoting a population-by-county table into a population-of-each-overlapping-county record on each quadtile.