Resources



Public Data Sets on AWS

Wikipedia Page Traffic Statistics

Click for a printer friendly version of this document Printer Friendly Save to del.icio.us
 

Contains 7 months of hourly pageview statistics for all articles in Wikipedia

Submitted By: Peter N. Skomoroch  
US Snapshot ID (Linux/Unix): snap-753dfc1c
Size: 320 GB
Creation Date: 06/06/2009
Last Updated: 06/06/2009
License: GNU Free Documentation License 1.3
Source: Data Wrangling

Wikipedia Traffic Statistics Dataset

This dataset contains a 320 GB sample of the data used to power trendingtopics.org. It includes 7 months of hourly page traffic statistics for over 2.5 Million wikipedia articles (~ 1 TB uncompressed) along with the associated wikipedia content, linkgraph, & metadata.

Compiled by Peter Skomoroch at Data Wrangling, LLC on May, 31, 2009

To mount the snapshot:

	localmachine $ ec2-create-volume --snapshot snap-753dfc1c -z us-east-1a
	localmachine $ ec2-attach-volume vol-ec123456 -i i-df123456 -d /dev/sdf
	root@domU-XX-XX-XX-XX-XX-XX:/mnt# mkdir /mnt/wikidata
	root@domU-XX-XX-XX-XX-XX-XX:/mnt# mount /dev/sdf /mnt/wikidata

Contents of the snapshot:

Like Wikipedia itself, All text content is licensed under the GNU Free Documentation License (GFDL). All statistics and link data is also licensed under the GNU Free Documentation License (GFDL). http://www.gnu.org/copyleft/fdl.html

wikidata/wikistats (260G)

Contains hourly wikipedia article traffic statistics dataset covering 7 month period from October 01 2008 to April 30 2009, this data is regularly logged from the wikipedia squid proxy by Domas Mituzas.

Each log file is named with the date and time of collection: pagecounts-20090430-230000.gz

Each line has 4 fields: projectcode, pagename, pageviews, bytes

	en Barack_Obama 997 123091092
	en Barack_Obama%27s_first_100_days 8 850127
	en Barack_Obama,_Jr 1 144103
	en Barack_Obama,_Sr. 37 938821
	en Barack_Obama_%22HOPE%22_poster 4 81005
	en Barack_Obama_%22Hope%22_poster 5 102081

wikidata/wikilinks (1.1G)

Contains a wikipedia linkgraph dataset provided by Henry Haselgrove.

These files contain all links between proper english language Wikipedia pages, that is pages in "namespace 0". This includes disambiguation pages and redirect pages.

In links-simple-sorted.txt, there is one line for each page that has links from it. The format of the lines is:

    from1: to11 to12 to13 ...
    from2: to21 to22 to23 ...
    ...

where from1 is an integer labelling a page that has links from it, and to11 to12 to13 ... are integers labelling all the pages that the page links to. To find the page title that corresponds to integer n, just look up the n-th line in the file titles-sorted.txt.

wikidata/wikidump (29G)

Contains raw wikipedia dumps along with some processed versions using data from: http://en.wikipedia.org/wiki/Wikipedia_database

See README files in the corresponding subdirectories for more details

- Data Wrangling -


Discussion
Click to start a discussion on this document Create a New Discussion
No discussion has been created for this document.

Reviews
Create Review Write a Review
Be the first to review this.
Welcome, Guest Help
Login Login