Let’s return to the first example from the book — identifying the geographic flavor of words using wikipedia — with actual code and more detail.
taking as a whole the terms that have a strong geographic flavor, we should largely see cultural terms (foods, sports, etc)
Terms like "beach" or "mountain" will clearly
Common words like "couch" or "hair"
Words like 'town' or 'street' will be
You don’t have to stop exploring when you find a new mystery, but no data exploration is complete until you uncover at least one.
Next, we’ll choose some exemplars: familiar records to trace through "Barbeque" should cover ;
The Wikipedia corpus is large, unruly — thirty million human-edited article
It’s also <remark>TODO verify</remark>200 million links
-
article → wordbag
-
join on page data to get geolocation
-
use pagelinks to get larger pool of implied geolocations
-
turn geolocations into quadtile keys
-
aggregate topics by quadtile
-
take summary statistics aggregated over term and quadkey
-
combine those statistics to identify terms that occur more frequently than the base rate would predict
-
explore and validate the results
-
filter to find strongly-flavored words, and other reductions of the data for visualization