How to download Wikipedia


So you're looking for some dummy data?  Well how about downloading the wikipedia???!! 


There are over 2 milliion pages on the wikipedia.  Don't try to crawl the site, it won't let you.  No robots allowed!


Go to http://download.wikipedia.org and you'll see a list of all the databases.  If you're looking for the English one it's "enwiki".  Then you can choose to download a whole bunch of stuff ... but the file you generally want to download is "pages-articles.xml.bz2".  This contains current versions of article content, and is the archive most mirror sites will probably want.  The latest version at the time of writing is 1.7GB.


Now you can run some decent content through your search engine or proof of conept applcation!


 

Comments (4)
  1. Thank you for higligting this, this is so cool!

  2. bradsmith007 says:

    Not a problem Hannes 🙂

  3. Very cool. Don’t forget you can use DataDude to generate data too… I believe it’s now out as CTP 7

  4. Update: DataDude is now RTM 1.0 🙂

Comments are closed.

Skip to main content