Graph propogation

Since the connections among pages are robot-legible, links within topics can be read to imply a geolocation — the movie "Dazed and Confused" (which took place in austin) and the artist Janis Joplin (who got her start in Austin) can be identifiably paired with the loose geolocation of Austin, TX.

Basic pagerank

link:page_counts.pig[role=include]

Exercises

If you run the server_logs/page_page_edges script on the server_logs/star_wars_kid dataset, you’ll get a graph connecting pages to pages, weighted by the number of times a user who visited one also visited the other.

Before you dive in, browse through the data a bit and think about what you expect:

How should the graph characteristics of an HTML page differ from those of a .wmv (video) file?
What pages should be the most prominent?
What types of pages should have the highest clustering coefficient?
By looking through the data, or by reasoning from what you know, which pages should have high similarity?

Now, answer those questions:

Calculate the degree distribution, etc of the graph, broken down by file type.
Find the pagerank. How well does it agree with the degree distribution?
Make a table listing each file on the site along with its intrinsic and graph features: file type, size, visit count, clustering coefficient, degree and pagerank.
Run a clustering algorithm on the page co-visit graph.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

15c-pagerank.asciidoc

15c-pagerank.asciidoc

Graph propogation

Basic pagerank

Exercises

Files

15c-pagerank.asciidoc

Latest commit

History

15c-pagerank.asciidoc

File metadata and controls

Graph propogation

Basic pagerank

Exercises