Data sets – Data Science, Data Analytics and Machine Learning Consulting in Koblenz Germany https://www.rene-pickhardt.de Extract knowledge from your data and be ahead of your competition Tue, 17 Jul 2018 12:12:43 +0000 en-US hourly 1 https://wordpress.org/?v=4.9.6 Extracting 2 social network graphs from the Democratic National Committee Email Corpus on Wikileaks https://www.rene-pickhardt.de/extracting-2-social-network-graphs-from-the-democratic-national-committee-email-corpus-on-wikileaks/ Thu, 28 Jul 2016 12:15:05 +0000 http://www.rene-pickhardt.de/?p=1989 tl,dr verion: Source code at github!
A couple of days ago a data set was released on Wikileaks consisting of about 23 thousand emails sent within the Democratic National Committee that would demonstrate how the DNC was actively trying to prevent Bernie Sanders from being the democratic candidate for the General public election. I am interested in who are the people with a lot of influence so I decided to have a closer look at the data.
Yesterday I crawled the dataset and processed it. I extracted two graphs in the Konect format. Since I am not sure if I am legally allowed to publish the processed data sets I will only link to the source code so you can generate the data sets yourself, if you don’t know how to run the code but need the information drop me a mail. I Also hope that Jérôme Kunegis will do an analysis of the networks and include them to Konect.

First we have the temporal graph

This graph consists of 39338 edges. There is a directed edge for each email sent from one person to another person and a timestamp when this has happened. If a person puts n recipients in CC there will be n edges added to the graph.

rpickhardt$ wc -l temporalGraph.tsv
39338 temporalGraph.tsv
rpickhardt$ head -5 temporalGraph.tsv
GardeM@dnc.org DavisM@dnc.org 1 17 May 2016 19:51:22
ShapiroA@dnc.org KaplanJ@dnc.org 1 4 May 2016 06:58:23
JacquelynLopez@perkinscoie.com EMail-Vetting_D@dnc.org 1 13 May 2016 21:27:16
JacquelynLopez@perkinscoie.com LykinsT@dnc.org 1 13 May 2016 21:27:16
JacquelynLopez@perkinscoie.com ReifE@dnc.org 1 13 May 2016 21:27:16

clearly the format is: sender TAB receiver TAB 1 TAB date
The data is currently not sorted by the fourth column but this can easily be done. Clearly an email network is directed and can have multi edges.

Second we have the weighted co-recipient network

Looking at the data I have discovered that many mails have more than one recipient so I thought it would be nice to see the social network structure by looking at how often two people occur in the recipient list for an email. This can reveal a lot about the social network structure of the DNC.

rpickhardt$ wc -l weightedCCGraph.tsv
20864 weightedCCGraph.tsv
rpickhardt$ head -5 weightedCCGraph.tsv
PaustenbachM@dnc.org MirandaL@dnc.org 848
MirandaL@dnc.org PaustenbachM@dnc.org 848
WalkerE@dnc.org PaustenbachM@dnc.org 624
PaustenbachM@dnc.org WalkerE@dnc.org 624
WalkerE@dnc.org MirandaL@dnc.org 596

clearly the format is: recipient1 TAB recipient2 TAB count
where count counts how ofthen recipient1 and recipient2 have been together in mails
 

Simple statistics

There have been

  • 1226 senders
  • 1384 recipients
  • 2030 people

included in the mails. In total I found 1226 different senders and 1384 different receivers. The top 7 Senders are:

MirandaL@dnc.org 1482
ComerS@dnc.org 1449
ParrishD@dnc.org 750
DNCPress@dnc.org 745
PaustenbachM@dnc.org 608
KaplanJ@dnc.org 600
ManriquezP@dnc.org 567

And the top 7 recievers are:

MirandaL@dnc.org 2951
Comm_D@dnc.org 2439
ComerS@dnc.org 1841
PaustenbachM@dnc.org 1550
KaplanJ@dnc.org 1457
WalkerE@dnc.org 1110
kaplanj@dnc.org 987

As you can see kaplanj@dnc.org and KaplanJ@dnc.org occur in the data set so as I mention in the Roadmap section at the end of the article more clean up of data might be necessary to get a more precise picture.
Still on a first glimse the data looks pretty natural. In the following I provide a diagram showing the rank frequency plot of senders and recievers. One can see that some people are way more active then other people. Also the recipient curve is above the sender curve which makes sense since every mail has one sender but at least 1 reciepient.

Also you can see the rank co-occurence count diagram of the co-occurence network. This when the ranks are above 2000 the standard network structure picture changes a little bit. I have no plausible explaination for this. Maybe this is due to the fact that the data dump is not complete. Still I find the data looks pretty natrual to me so further investigations might make sense.

Code

The crawler code is a two-liner. just some wget and sleep magic
The python code for processing the mails builds upon the python email library by Alain Spineux which is released under the LGPL license. My Code on top is released under GPLv3 and can be found on github.

Roadmap

  • Use the Generalized Language Model Toolkit to build Language Models on the data
  • Compare with the social graph from twitter – many email addresses or at least names will be linked to twitter accounts. Comparing the Twitter network with the email network might reveal the differences in internal and external communication
  • Improve Quality of data i.e. better clean up of the data. Sometimes people in the recipient list have more than one email address. Currently they are treated as two different people. On the other hand sometimes mail addresses are missing and just names are included. These could probably be inferred from the other mail addresses. Also names in this case serve as uniq identifiers. So if two different people are called ‘Bob’ they become one person in the dataset. 
]]>
Analyzing the final and intermediate results of the iversity MOOC Fellowship online voting https://www.rene-pickhardt.de/analyzing-the-final-and-intermediate-results-of-the-iversity-mooc-fellowship-online-voting/ https://www.rene-pickhardt.de/analyzing-the-final-and-intermediate-results-of-the-iversity-mooc-fellowship-online-voting/#comments Thu, 23 May 2013 23:07:24 +0000 http://www.rene-pickhardt.de/?p=1609 As writen before Steffen and I participated in the online voting for the MOOC fellowship. Today the competition finished and I would like to say thank you to everyone who so far participated in the voting in particular to the 435 people supporting our course. I did never image to get that many people to be interested in our course!
The voting period went from May first till today. During this period the user interface of the iversity website changed several times providing different kind of information about the voting to us users. Since I have observed a drastic change in rankings on May 9th and since the process and scores have not been very transparent I have decided on that very day to collect some data about the rankings. I already did some quick analysis on the data and found some interesting facts but I am running out of time right now to conduct an extensive data analysis. So I will share the data set with the public domain:
http://rene-pickhardt.de/mooc.tar.bz2 (33MB)
If you download the zip file and extract it you’ll find folders for every hour after May 9th. In every folder you will find 26 html-files representing the current ranking of the courses at that time and a transaction log of the http-requests which were done to download the 26 html files. There are 26 html files since 10 courses were displayed per page and we had 255 courses participating.
During the time of data collection I had 2 or 3 short down times of my web server so it could be possible that some data points are missing.
I already wrote a “dirty hack” and pushed it on github which also extracts the interesting information out of the downloaded html files.

  1. There is a file rank.tsv (334 kb) that displays for every course on an hourly basis the rankings
  2. There is a file vote.tsv (113 kb) that contains for every course on an hourly basis (between may 20th and today) the number of votes the course did acquire. The period of time for vote.tsv is so short since the votes have only been available in the html files during this time. 

Skimming the data with my eyes there are already some facts that make me very curious for a deeper data analysis:

  1. Some courses gained several hundred votes within a short period of time (usually only 2 or 3 hours) whereas most courses (especially those gaining such a large amount of votes) often stayed far under 1000 votes at all. 
  2. Also it is interesting to see how much variation has been going on in the last couple of days. 
  3. Also I haven’t crawled the views of the Youtube videos of the courses and even now after observing the following I did not take a snapshot of them it is interesting that there is such a large difference in conversion rate. Especially the top courses seem to have much more votes than they have views of the application video. Where some really high class and outstanding applications like the ones from Chrstian Spannagel (Math) or  Oliver Vornberger (Algorithms and data structures) have two or three times as many views on Youtube as votes. Especially they have about the same amount of views on Youtube as the top voted courses.

I am pretty sure there are some more interesting facts and maybe someone else has collected a better data set over the complete periode of time and including Youtube snapshots as well as Facebook and Twitter mentions.
Since I have been asked several times already: here are the final rankings to download and also as a table in the blog post:

  Kursname Anzahl an votes
1 sectio chirurgica anatomie interaktiv 8013
2 internationales agrarmanagement 2 7557
3 ingenieurmathematik fur jedermann 2669
4 harry potter and issues in international politics 2510
5 online surgery 2365
6 l3t s mooc der offene online kurs uber das lernen und lehren mit technologien 2270
7 design 101 or design basics 2 2216
8 einfuhrung in das sozial und gesundheitswesen sozialraume entdecken und entwickeln 2124
9 changeprojekte planen nachhaltige entwicklung durch social entrepreneurship 2083
10 social work open online course swooc14 2059
11 understanding sustainability environmental problems collective action and institutions 1912
12 the dance of functional programming languaging with haskell and python 1730
13 zyklenbasierte grundung systematische entwicklung von geschaftskonzepten 1698
14 a virtual living lab course for sustainable housing and lifestyle 1682
15 family politics domestic life revolution and dictatorships between 1900 1950 1476
16 h2o extrem 1307
17 dark matter in galaxies the last mystery 1261
18 algorithmen und datenstrukturen 1207
19 psychology of judgment and decision making 1168
20 the future of storytelling 1164
21 web engineering 1152
22 die autoritat der wissenschaften eine einfuhrung in das wissenschaftstheoretische denken 2 1143
23 magic and logic of music a comprehensive course on the foundations of music and its place in life 1138
24 nmooc nachhaltigkeit fur alle 1130
25 sovereign bond pricing 1115
26 soziale arbeit eine einfuhrung 1034
27 mathematische denk und arbeitsweisen in geometrie und arithmetik 1016
28 social entrepreneurship wir machen gesellschaftlichen wandel moglich 1010
29 molecular gastronomy an experimental lecture about food food processing and a bit of physiology 984
30 fundamentals of remote sensing for earth observation 920
31 kompetenzkurs ernahrungswissenschaft 891
32 erfolgreich studieren 879
33 deciphering ancient texts in the digital age 868
34 qualitative methods 861
35 karl der grosse pater europae 855
36 who am i mind consciousness and body between science and philosophy 837
37 programmieren mit java 835
38 systemisches projektmanagement 811
39 lernen ist sexy 764
40 modelling and simulation using matlab one mooc more brains an interdisciplinary course not just for experts 760
41 suchmaschinen verstehen 712
42 hands on course on embedded computing systems with raspberry pi 679
43 introduction to mixed methods and doing research online 676
44 game ai 649
45 game theory and experimental economic research 633
46 cooperative innovation 613
47 blue engineering ingenieurinnen und ingenieure mit sozialer und okologischer verantwortung 612
48 my car the unkown technical being 612
49 gesundheit ein besonderes gut eine multidisziplinare erkundung des deutschen gesundheitssystems 608
50 teaching english as a foreign language tefl part i pronunciation 597
51 wie kann lesen gelernt gelehrt und gefordert werden lesesozialisation lesedidaktik und leseforderung vom grundschulunterricht bis zur erwachsenenbildung 593
52 the european dream 576
53 education of the present what is the future of education 570
54 faszination kristalle und symmetrie 561
55 italy today a girlfriend in a coma a walk through today s italy 557
56 dna from structure to therapy 556
57 grundlagen der mensch computer interaktion 549
58 malnutrition in developing countries 548
59 marketing als strategischer erfolgsfaktor von der produktinnovation bis zur kundenbindung 540
60 environmental ethics for scientists 540
61 stem cells in biology and medicine 528
62 praxiswissen fur den kunstlerischen alltagsdschungel 509
63 physikvision 506
64 high five evidence based practice 505
65 future climate water 484
66 diversity and communication challenges for integration and mobility 477
67 social entrepreneurship 469
68 die kunst des argumentierens 466
69 der hont feat mit dem farat wek wie kinder schreiben und lesen lernen 455
70 antikrastination moocen gegen chronisches aufschieben 454
71 exercise for a healthier life 454
72 the startup source code 438
73 web science 435
74 medizinische immunologie 433
75 governance in and through human rights 431
76 europe in the world law and policy aspects of the eu in global governance 419
77 komplexe welt strukturen selbstorganisation und chaos 419
78 mooc basics of surgery want to become a real surgeon 416
79 statistical data analysis for the humanities 414
80 business math r edux 406
81 analyzing behavioral dynamics non linear approaches to social and cognitive sciences 402
82 space technology 397
83 der erzahler materialitat und virtualitat vom mittelalter bis zur gegenwart 396
84 kriminologie 395
85 von e mail skype und xing kommunikation fuhrung und berufliche zusammenarbeit im netz 394
86 wissenschaft erzahlen das phanomen der grenze 392
87 nachhaltige entwicklung 389
88 die nachste gesellschaft gesellschaft unter bedingungen der elektrizitat des computers und des internets 388
89 die grundrechte 376
90 medienbildung und mediendidaktik grundbegriffe und praxis 368
91 bubbles everywhere speculative bubbles in financial markets and in everyday life 364
92 the heart of creativity 363
93 physik und weltraum 358
94 sim suchmaschinenimplementierung als mooc 354
95 order of magnitude physics from atomic nuclei to the universe 350
96 entwurfsmethodik eingebetteter systeme 343
97 monte carlo methods in finance 335
98 texte professionell mit latex erstellen 331
99 wissenschaftlich arbeiten wissenschaftlich schreiben 330
100 e x cite join the game of social research 330
101 forschungsmethoden 323
102 complex problem solving 321
103 programmieren lernen mit effekt 317
104 molecular devices and machines 317
105 wie man erfolgreich ein startup aufbaut 315
106 grundlagen der prozeduralen und objektorientierten programmierung 314
107 introduction to disability studies 314
108 eu2c the european union explained by two partners cologne and cife 313
109 the english language a linguistic introduction 2 311
110 allgemeine betriebswirtschaftslehre 293
111 interaction design open design 293
112 how we learn nowadays possibilities and difficulties 288
113 foundations of educational technology 288
114 projektmanagement und designbasiertes lernen 281
115 human rights 278
116 kompetenz des horens technische gehorbildung 278
117 it infrastructure management 276
118 a media history in 10 artefacts 274
119 introduction to the practice of statistics and regression 271
120 what is a good society introduction to social philosophy 268
121 modellierungsmethoden in der wirtschaftsinformatik 265
122 objektorientierte programmierung von web anwendungen von anfang an 262
123 intercultural diversity networking vielfalt interkulturell vernetzen 260
124 foundations of entrepreneurship 259
125 business communication for impact and results 257
126 gamification 257
127 creativity and design in innovation management 256
128 mechanik i 252
129 global virtual project management 252
130 digital signal processing for everyone 249
131 kompetenzen fur klimaschutz anpassung 248
132 digital economy and social innovation 246
133 synthetic biology 245
134 english phonetics and phonology 245
135 leibspeisen nahrung im wandel der zeiten molekule brot kase fleisch schokolade und andere lebensmittel 243
136 critical decision making in the contemporary globalized world 238
137 einfuhrung in die allgemeine betriebswirtschaftslehre schwerpunkt organisation personalmanagement und unternehmensfuhrung 236
138 didaktisches design 235
139 an invitation to complex analysis 235
140 grundlagen der programmierung teil 1 234
141 allgemein und viszeralchirurgie 233
142 mathematik 1 fur ingenieure 231
143 consumption and identity you are what you buy 231
144 vampire fictions 230
145 grundlagen der anasthesiologie 228
146 marketing strategy and brand management 227
147 political economy an introduction 225
148 gesundheit 221
149 object oriented databases 219
150 lebenswelten perspektiven fur menschen mit demenz 217
151 applications of graphs to real life problems 210
152 introduction to epidemiology epimooc 207
153 network security 207
154 global civics 207
155 wissenschaftliches arbeiten 204
156 annaherungen an zukunfte wie lassen sich mogliche wahrscheinliche und wunschbare zukunfte bestimmen 202
157 einstieg wissenschaft 200
158 engineering english 199
159 das erklaren erklaren wie infografik klart erklart und wissen vermittelt 198
160 betriebswirtschaftliche und rechtliche grundlagen fur das nonprofit management 192
161 art and mathematics 191
162 vom phanomen zum modell mathematische modellierung von natur und alltag an ausgewahlten beispielen 190
163 design interaktiver medien technische grundlagen 189
164 business englisch 187
165 erziehung sehen analysieren gestalten 184
166 basic clinical research methods 184
167 ordinary differential equations and laplace transforms 180
168 mathematische logik 179
169 die geburt der materie in der evolution des universums 179
170 innovationsmanagement von kleinen und mittelstandischen unternehmen kmu 176
171 introduction to qualitative methods in the social sciences 175
172 advert retard wirkung industrieller interessen auf rationale arzneimitteltherapie 175
173 animation beyond the bouncing ball 174
174 entropie einfuhrung in die physikalische chemie 172
175 edufutur education for a sustainable future 165
176 social network effects on everyday life 164
177 pharmaskills for africa 163
178 nachhaltige energiewirtschaft 162
179 qualitat in der fruhpadagogik auf den anfang kommt es an 158
180 dementias 157
181 beyond armed confrontation multidisciplinary approaches and challenges from colombia s conflict 154
182 investition und finanzierung 150
183 praxis des wissensmanagements 149
184 gutenberg to google the social construction of the communciations revolution 145
185 value innovation and blue oceans 145
186 kontrapunkt 144
187 shakespeare s politics 142
188 jetzt erst recht wissen schaffen uber recht 141
189 rechtliche probleme von sozialen netzwerken 138
190 augmented tuesday suppers 137
191 positive padagogik 137
192 digital storytelling mit bewegenden bildern erzahlen 136
193 wirtschaftsethik 134
194 energieeffizientes bauen 134
195 advising startups 133
196 urban design and communication 133
197 bildungsreform 2 0 132
198 mooc management basics 130
199 healthy teeth a life long course of preventive dentistry 129
200 digitales tourismus marketing 127
201 the arctic game the struggle for control over the melting ice 127
202 disease mechanisms 127
203 special operations from raids to drones 125
204 introduction to geospatial technology 120
205 social media marketing strategy smms 119
206 korpusbasierte analyse sprechsprachlichen problemlosungsverhaltens 116
207 introduction to marketing 115
208 creative coding 114
209 mooc meets 3d 110
210 unternehmenswert die einzig sinnvolle spitzenkennzahl fur unternehmen 110
211 forming behaviour gestaltung und konzeption von web applications 109
212 technology demonstration 108
213 lebensmittelmikrobiologie und hygiene 105
214 estudi erfolgreich studieren mit dem internet 105
215 moderne geldtheorie eine paische perspektive 103
216 kollektive intelligenz 103
217 geschichte der optischen medien 100
218 alter und soziale arbeit 99
219 semantik eine theorie visueller kommunikation 97
220 erziehung und beratung in familie und schule 96
221 foreign language learning in indian context 95
222 bildgebende verfahren 92
223 applied biology 92
224 bildung in der wissensgesellschaft gerechtigkeit 92
225 standortmanagement 92
226 europe a solution from history 90
227 methodology of research in international law 90
228 when african americans came to paris 90
229 contemporary architecture 89
230 past recent encounters turkey and germany 88
231 wars to end all wars 83
232 online learning management systems 82
233 software applications 81
234 business in germany 78
235 requirements engineering 77
236 anything relationship management xrm 77
237 global standards and local practices 76
238 prodima professionalisation of disaster medicine and management 75
239 cytology with a virtual correlative light and electron microscope 75
240 the organisation of innovation 75
241 sensors for all 75
242 diagnostik in der beruflichen bildung 73
243 scientific working 71
244 escience saxony lectures 71
245 internet marketing strategy how to gain influence and spread your message online 69
246 grundlagen des e business 69
247 principles of public health 64
248 methods for shear wave velocity measurements in urban areas 64
249 democracy in america 64
250 building typology studies gebaudelehre 63
251 multi media based learning environments at the interface of science and practice hamburg university of applied sciences prof dr andrea berger klein 61
252 math mooc challenge 60
253 the value of the social 58
254 dienstleistungsmanagement und informationssysteme 57
255 ict integration in education systems e readiness e integration e transformation 56
]]>
https://www.rene-pickhardt.de/analyzing-the-final-and-intermediate-results-of-the-iversity-mooc-fellowship-online-voting/feed/ 8
Download Google n gram data set and neo4j source code for storing it https://www.rene-pickhardt.de/download-google-n-gram-data-set-and-neo4j-source-code-for-storing-it/ https://www.rene-pickhardt.de/download-google-n-gram-data-set-and-neo4j-source-code-for-storing-it/#comments Sun, 27 Nov 2011 13:28:20 +0000 http://www.rene-pickhardt.de/?p=840 In the end of September I discovered an amazing data set which is provided by Google! It is called the Google n gram data set. Even thogh the english wikipedia article about ngrams needs some clen up it explains nicely what an ngram is.
http://en.wikipedia.org/wiki/N-gram
The data set is available in several languages and I am sure it is very useful for many tasks in web retrieval, data mining, information retrieval and natural language processing.
This data set is very well described on the official google n gram page which I also include as an iframe directly here on my blog.

So let me rather talk about some possible applications with this source of pure gold:
I forwarded this data set to two high school students which I was teaching last summer at the dsa. Now they are working on a project for a German student competition. They are using the n-grams and neo4j to predict sentences and help people to improve typing.
The idea is that once a user has started to type a sentence statistics about the n-grams can be used to semantically and syntactically correctly predict what the next word will be and in this way increase the speed of typing by making suggestions to the user. This will be in particular usefull with all these mobile devices where typing is really annoying.
You can find some source code of the newer version at: https://github.com/renepickhardt/typology/tree/develop
Note that this is just a primitive algorithm to process the ngrams and store the information in a neo4j graph data base. Interestingly it can already produce decent recommendations and it uses less storage space than the ngrams dataset since the graph format is much more natural (and also due to the fact that we did not store all of the data saved in the ngrams to neo4j e.g. n-grams of different years have been aggregated.)
From what I know the roadmap is very clear now. Normalize the weights and for prediction use a weighed sum of all different kinds of n-grams and use machine learning (supervised learning) to learn those weights. As a training data set a corpus from different domains could be used (e.g. wikipedia corpus as a general purpose corpus or a corpus of a certain domain for a special porpus)
If you have any suggestions to the work the students did and their approach using graph data bases and neo4j to process and store ngrams as well as predicting sentences feel free to join the discussion right here!

]]>
https://www.rene-pickhardt.de/download-google-n-gram-data-set-and-neo4j-source-code-for-storing-it/feed/ 13
Download network graph data sets from Konect – the koblenz network colection https://www.rene-pickhardt.de/download-network-graph-data-sets-from-konect-the-koblenz-network-colection/ https://www.rene-pickhardt.de/download-network-graph-data-sets-from-konect-the-koblenz-network-colection/#comments Mon, 26 Sep 2011 14:14:13 +0000 http://www.rene-pickhardt.de/?p=834 UPDATE: now with link to the PhD thesis. By the time of blogging the thesis was not published. thanks to Patrick Durusau for pointing out the missing link.
One of the first things I did @ my Institute when starting my PhD program was reading the PhD thesis of Jérôme Kunegis. For a mathematician a nice piece of work to read. For his thesis he analayzed the evolution of networks. Over the last years Jérôme has collected several (119 !) data sets with network graphs. All have different properties.
He provides the data sets and some basic statistics @ http://konect.uni-koblenz.de/networks
Sometimes edges are directed, somtimes they have timestamps somtimes even content. Some graphs are bipartite and the graphs come from different application domains such as trust, social networks, web graphs, co citation, semantic, features, ratings and communications…
Jérôme also has calculated some metrics such as degree distributions and eigenvalue distributions for all these data sets and provides links to the sources. For most of the time he also states the paper for which these data sets have been created.
Over all the Koblenz network collection is a very nice data set for everyone who wants to do a PhD and wants to do analysis of graphs or do data mining or just needs some data sets for evaluation and testing! I am excited to see how this site will evolve over time and want to thank Jérôme for making this data available!

]]>
https://www.rene-pickhardt.de/download-network-graph-data-sets-from-konect-the-koblenz-network-colection/feed/ 2
Download Trec (= Text Retrieval Conference) Data Set https://www.rene-pickhardt.de/download-trec-text-retrieval-conference-data-set/ https://www.rene-pickhardt.de/download-trec-text-retrieval-conference-data-set/#comments Sun, 04 Sep 2011 15:20:39 +0000 http://www.rene-pickhardt.de/?p=753 Being back in University I get to see more and more data sets. Origninally I wanted to use the data sets category of my blog to provide an unordered list of these publicly available data sets sort of as a personal reminder. For some reason I never really did that but I am now about to change that.
To start I have nothing special and something you all should already know. It is the TREC data set. There is the Text retrieval conference and they make many data sets publicly available so that people can study and run tests on them .
You can find the data sets and the following list at: http://trec.nist.gov/data.html

]]>
https://www.rene-pickhardt.de/download-trec-text-retrieval-conference-data-set/feed/ 4
How to download Wikipedia https://www.rene-pickhardt.de/how-to-download-wikipedia/ https://www.rene-pickhardt.de/how-to-download-wikipedia/#comments Wed, 16 Feb 2011 18:58:00 +0000 http://www.rene-pickhardt.de/?p=249 Wikipedia is an amazing data set to do all different kinds of research which will go far beyond text mining. The best thing about Wikipedia is that it is licensed under creative common license. So you are allowed to download Wikipedia and use it in any way you want. The articles have almost no spelling mistakes and a great structure with meaningful headings and subheadings. This makes Wikipedia to a frequently used data set in computer science. No surprise that I decided to download and examine Wikipedia. I first wanted to gain experience in natural language processing. Furthermore I wanted to test some graph mining algorithms and I wanted to obtain some statistics about my mother tong German.
Even though it is very well documented how to download this great data set there are some tiny obstacles which made me struggle once in a while. For the experienced data miner these challenges will probably be easy to master but I still think it is worth wile blogging about them.
Don’t crawl Wikipedia please!
After reading Toby Segaran’s book “Programming collective intelligence” about 2 years ago I wanted to build my first simple web crawler and download Wikipedia by crawling Wikipedia. After installing python and the library beautiful soup which is recommended by Toby I realized that my script could not download Wikipedia pages. I also didn’t get any meaningful error message which I could have typed in Google. After a moment of thinking I realized that Wikipedia might not be happy with to many unwanted crawler since crappy crawlers can produce a lot of load on the web server. So I had a quick look at http://de.wikipedia.org/robots.txt and quickly realized that Wikipedia is not to happy with strangers crawling and downloading it.
I once have heard that database dumps of Wikipedia are available for download. So why not downloading a database dump, installing a web server on my notebook and crawl my local version of Wikipedia? This should be much faster anyway.
Before going over to the step of downloading a database dump I tried to change my script in order to send “better” http-headers to Wikipedia while requesting pages to download. That was not because I wanted to go on crawling Wikipedia anyway I just wanted to see weather I would be able to trick them. Even though I set my user agent to mozilla I was not able to download one single Wikipedia page using my python script.

Wikipedia is huge!

Even though I went for the German Wikipadia which doesn’t even have half the size of the English one I ran into serious trouble due to the huge amount of data. As we know data mining usually is not complex because the algorithms are so difficult but rather because the amount of data is so huge. I would consider Wikipedia to be a relatively small data set but as stated above sufficient big to cause problems.
After downloading the correct data base dump which was about 2 GB in size I had to unzip it. Amazingly no zip program was able to unzip the 7.9 GB huge XML file that contained all current Wikipedia articles. I realized that changing to my Linux operating system might have been a better Idea. So I put the file on my external hard drive and reboot my system. Well Linux didn’t work ether. After exactly 4 GB the unzipping process would stop. Even though I am aware of the fact that 2^32 = 4 GB I was confused an asked Yann for advice and he immediately asked weather I would use Windows or Linux. I told him that I just switched to Linux but then it also came to my mind. My external hard drive has Fat32 as a file system which cannot handle files bigger than 4 GB.
After copying the zipped database dump to my Linux file system the unzipping problem was successfully solved. I installed a media wiki on my local system in order to have all the necessary data base tables. Mediawiki also comes with an import script. This script is php based and incredibly slow. About two articles per second will be parsed and imported to the data base. 1 million articles would therefor need about 138 hours or more than five and a half days. For a small data set and experiment this is unacceptable. Fortunately Wikipedia also provides a java file called mwdumper which can process about 70 articles per second. The first time I was running this java program it crashed after 150’000 articles. I am still not to sure what reason caused the crash but I decided to tweak the mysql settings in /etc/mysql/my.cnf a little bit. So after assigning more memory to mysql I started the script for a second time realizing that it couldn’t continue to import the dump. After truncating all tables I restarted the whole import process again. Right now it is still ongoing but it already has included 1’384’000 articles and my system still seems stable.

Summery: How to download Wikipedia in a nutshell?

  1. Install some Linux on your computer. I recommend Ubuntu
  2. Use the package manager to install mysql, apache and PHP
  3. Install and set up mediawiki (this can also be done via package manger)
  4. Read http://en.wikipedia.org/wiki/Wikipedia:Database_download in opposite to the German version file size issues are discussed within the article
  5. Find the Wikipedia dump that you want to download on the above page
  6. Use the Java Programm MWDumper and read the instructions.
  7. Install Java if not already done (can be done with package manager)
  8. Donate some money to the Wikimedia foundation or at least contribute to Wikipedia by correcting some articles.

So obviously I haven’t even started with the real interesting research of German Wikipedia. But I thought that it might already be interesting to share the experience I already had with you. A nice but not surprising effect is by the way that the local version of Wikipedia is amazingly fast. After I have crawled and indexed Wikipedia and transferred the data to some format that I can better use I might write another article and maybe I will even publish a dump of my data structures.

]]>
https://www.rene-pickhardt.de/how-to-download-wikipedia/feed/ 5
Download Data Sets on this blog https://www.rene-pickhardt.de/download-data-sets-on-this-blog/ https://www.rene-pickhardt.de/download-data-sets-on-this-blog/#comments Mon, 14 Feb 2011 13:15:33 +0000 http://www.rene-pickhardt.de/?p=216 On the Internet you can download some very interesting data sets that computer scientists and others use for scientific or ecnomic reasons. One application, for example, might have the function to test data mining algorithms or even further create new knowledge from the data. In other cases data sets can be used to create mesh-ups. A mesh-up is an application which combines several data sets to create an interesting new peace of information. In this section I will try to collect a bunch of different data sets that are available for download on the Web. In some cases I might also include a short tutorial on how to use and process the respective data set, though I think that in most cases the publisher of the sets will already give you enough information on how to use them. Since my major in university is Webscience, semantic data and open linked data will be of special significance in this section. .

Some data sets might only be available through an API. If there doesn’t exist good tutorials I will introduce the API in these cases.

In the end, I just wanted to encourage you that if you think I am missing out on some good data sets, to please contact me and tell me about it immediately! That would be very helpful!.

]]>
https://www.rene-pickhardt.de/download-data-sets-on-this-blog/feed/ 4