Wikipedia is an amazing data set to do all different kinds of research which will go far beyond text mining. The best thing about Wikipedia is that it is licensed under creative common license. So you are allowed to download Wikipedia and use it in any way you want. The articles have almost no spelling mistakes and a great structure with meaningful headings and subheadings. This makes Wikipedia to a frequently used data set in computer science. No surprise that I decided to download and examine Wikipedia. I first wanted to gain experience in natural language processing. Furthermore I wanted to test some graph mining algorithms and I wanted to obtain some statistics about my mother tong German.
Even though it is very well documented how to download this great data set there are some tiny obstacles which made me struggle once in a while. For the experienced data miner these challenges will probably be easy to master but I still think it is worth wile blogging about them.
Don’t crawl Wikipedia please!
After reading Toby Segaran’s book “Programming collective intelligence” about 2 years ago I wanted to build my first simple web crawler and download Wikipedia by crawling Wikipedia. After installing python and the library beautiful soup which is recommended by Toby I realized that my script could not download Wikipedia pages. I also didn’t get any meaningful error message which I could have typed in Google. After a moment of thinking I realized that Wikipedia might not be happy with to many unwanted crawler since crappy crawlers can produce a lot of load on the web server. So I had a quick look at http://de.wikipedia.org/robots.txt and quickly realized that Wikipedia is not to happy with strangers crawling and downloading it.
I once have heard that database dumps of Wikipedia are available for download. So why not downloading a database dump, installing a web server on my notebook and crawl my local version of Wikipedia? This should be much faster anyway.
Before going over to the step of downloading a database dump I tried to change my script in order to send “better” http-headers to Wikipedia while requesting pages to download. That was not because I wanted to go on crawling Wikipedia anyway I just wanted to see weather I would be able to trick them. Even though I set my user agent to mozilla I was not able to download one single Wikipedia page using my python script.
Wikipedia is huge!
Even though I went for the German Wikipadia which doesn’t even have half the size of the English one I ran into serious trouble due to the huge amount of data. As we know data mining usually is not complex because the algorithms are so difficult but rather because the amount of data is so huge. I would consider Wikipedia to be a relatively small data set but as stated above sufficient big to cause problems.
After downloading the correct data base dump which was about 2 GB in size I had to unzip it. Amazingly no zip program was able to unzip the 7.9 GB huge XML file that contained all current Wikipedia articles. I realized that changing to my Linux operating system might have been a better Idea. So I put the file on my external hard drive and reboot my system. Well Linux didn’t work ether. After exactly 4 GB the unzipping process would stop. Even though I am aware of the fact that 2^32 = 4 GB I was confused an asked Yann for advice and he immediately asked weather I would use Windows or Linux. I told him that I just switched to Linux but then it also came to my mind. My external hard drive has Fat32 as a file system which cannot handle files bigger than 4 GB.
After copying the zipped database dump to my Linux file system the unzipping problem was successfully solved. I installed a media wiki on my local system in order to have all the necessary data base tables. Mediawiki also comes with an import script. This script is php based and incredibly slow. About two articles per second will be parsed and imported to the data base. 1 million articles would therefor need about 138 hours or more than five and a half days. For a small data set and experiment this is unacceptable. Fortunately Wikipedia also provides a java file called mwdumper which can process about 70 articles per second. The first time I was running this java program it crashed after 150’000 articles. I am still not to sure what reason caused the crash but I decided to tweak the mysql settings in /etc/mysql/my.cnf a little bit. So after assigning more memory to mysql I started the script for a second time realizing that it couldn’t continue to import the dump. After truncating all tables I restarted the whole import process again. Right now it is still ongoing but it already has included 1’384’000 articles and my system still seems stable.
Summery: How to download Wikipedia in a nutshell?
- Install some Linux on your computer. I recommend Ubuntu
- Use the package manager to install mysql, apache and PHP
- Install and set up mediawiki (this can also be done via package manger)
- Read http://en.wikipedia.org/wiki/Wikipedia:Database_download in opposite to the German version file size issues are discussed within the article
- Find the Wikipedia dump that you want to download on the above page
- Use the Java Programm MWDumper and read the instructions.
- Install Java if not already done (can be done with package manager)
- Donate some money to the Wikimedia foundation or at least contribute to Wikipedia by correcting some articles.
So obviously I haven’t even started with the real interesting research of German Wikipedia. But I thought that it might already be interesting to share the experience I already had with you. A nice but not surprising effect is by the way that the local version of Wikipedia is amazingly fast. After I have crawled and indexed Wikipedia and transferred the data to some format that I can better use I might write another article and maybe I will even publish a dump of my data structures.