Let’s return to the first example from the book — identifying the geographic flavor of words using wikipedia — with actual code and more detail.

taking as a whole the terms that have a strong geographic flavor, we should largely see cultural terms (foods, sports, etc)

Terms like "beach" or "mountain" will clearly

Common words like "couch" or "hair"

Words like 'town' or 'street' will be

You don’t have to stop exploring when you find a new mystery, but no data exploration is complete until you uncover at least one.

Next, we’ll choose some exemplars: familiar records to trace through "Barbeque" should cover ;

The Wikipedia corpus is large, unruly — thirty million human-edited article

It’s also <remark>TODO verify</remark>200 million links

article → wordbag
join on page data to get geolocation
use pagelinks to get larger pool of implied geolocations
turn geolocations into quadtile keys
aggregate topics by quadtile
take summary statistics aggregated over term and quadkey
combine those statistics to identify terms that occur more frequently than the base rate would predict
explore and validate the results
filter to find strongly-flavored words, and other reductions of the data for visualization

Provide feedback

Saved searches