java – Data Science, Data Analytics and Machine Learning Consulting in Koblenz Germany

GWT + database connection in Servlet ContextListener – Auto Complete Video Tutorial Part 5

Rene — Mon, 24 Jun 2013 11:44:47 +0000

Finally we have all the basics that are needed for building an Autocomplete service and now comes the juicy part. From now on we are looking at how to make it fast and robust. In the current approach we open a new Data base connection for every HTTP request. This needs quite some time to lock the data base (at least when using neo4j in the embedded mode) and then also to run the query without having any opportunities to use the caching strategy of the data base.
In this tutorial I will introduce you to the concept of a ContextListener. This is roughly spoken a way of storing objects in the Java Servlet global memory using key value pairs. Once we understand this the roadmap is very clear. We can store objects like data base connections or search indices in the memory of our web server. As from what I currently understand this could also be used to implement some server side caching. I did not do any benchmarking yet testing how fast retrieving objects from context works in tomcat. Also this method of caching does not scale horizontally well as using memcached.
Anyway have fun learning about the context listener.

If you have any suggestions, comments or thoughts or even know of some solid benchmarks about caching using the ServletContext (I did a quick web search for a view minutes and didn’t find any) feel free to contact me and discuss this!

Analyzing the final and intermediate results of the iversity MOOC Fellowship online voting

Rene — Thu, 23 May 2013 23:07:24 +0000

As writen before Steffen and I participated in the online voting for the MOOC fellowship. Today the competition finished and I would like to say thank you to everyone who so far participated in the voting in particular to the 435 people supporting our course. I did never image to get that many people to be interested in our course!
The voting period went from May first till today. During this period the user interface of the iversity website changed several times providing different kind of information about the voting to us users. Since I have observed a drastic change in rankings on May 9th and since the process and scores have not been very transparent I have decided on that very day to collect some data about the rankings. I already did some quick analysis on the data and found some interesting facts but I am running out of time right now to conduct an extensive data analysis. So I will share the data set with the public domain:
http://rene-pickhardt.de/mooc.tar.bz2 (33MB)
If you download the zip file and extract it you’ll find folders for every hour after May 9th. In every folder you will find 26 html-files representing the current ranking of the courses at that time and a transaction log of the http-requests which were done to download the 26 html files. There are 26 html files since 10 courses were displayed per page and we had 255 courses participating.
During the time of data collection I had 2 or 3 short down times of my web server so it could be possible that some data points are missing.
I already wrote a “dirty hack” and pushed it on github which also extracts the interesting information out of the downloaded html files.

There is a file rank.tsv (334 kb) that displays for every course on an hourly basis the rankings
There is a file vote.tsv (113 kb) that contains for every course on an hourly basis (between may 20th and today) the number of votes the course did acquire. The period of time for vote.tsv is so short since the votes have only been available in the html files during this time.

Skimming the data with my eyes there are already some facts that make me very curious for a deeper data analysis:

Some courses gained several hundred votes within a short period of time (usually only 2 or 3 hours) whereas most courses (especially those gaining such a large amount of votes) often stayed far under 1000 votes at all.
Also it is interesting to see how much variation has been going on in the last couple of days.
Also I haven’t crawled the views of the Youtube videos of the courses and even now after observing the following I did not take a snapshot of them it is interesting that there is such a large difference in conversion rate. Especially the top courses seem to have much more votes than they have views of the application video. Where some really high class and outstanding applications like the ones from Chrstian Spannagel (Math) or Oliver Vornberger (Algorithms and data structures) have two or three times as many views on Youtube as votes. Especially they have about the same amount of views on Youtube as the top voted courses.

I am pretty sure there are some more interesting facts and maybe someone else has collected a better data set over the complete periode of time and including Youtube snapshots as well as Facebook and Twitter mentions.
Since I have been asked several times already: here are the final rankings to download and also as a table in the blog post:

	Kursname	Anzahl an votes
1	sectio chirurgica anatomie interaktiv	8013
2	internationales agrarmanagement 2	7557
3	ingenieurmathematik fur jedermann	2669
4	harry potter and issues in international politics	2510
5	online surgery	2365
6	l3t s mooc der offene online kurs uber das lernen und lehren mit technologien	2270
7	design 101 or design basics 2	2216
8	einfuhrung in das sozial und gesundheitswesen sozialraume entdecken und entwickeln	2124
9	changeprojekte planen nachhaltige entwicklung durch social entrepreneurship	2083
10	social work open online course swooc14	2059
11	understanding sustainability environmental problems collective action and institutions	1912
12	the dance of functional programming languaging with haskell and python	1730
13	zyklenbasierte grundung systematische entwicklung von geschaftskonzepten	1698
14	a virtual living lab course for sustainable housing and lifestyle	1682
15	family politics domestic life revolution and dictatorships between 1900 1950	1476
16	h2o extrem	1307
17	dark matter in galaxies the last mystery	1261
18	algorithmen und datenstrukturen	1207
19	psychology of judgment and decision making	1168
20	the future of storytelling	1164
21	web engineering	1152
22	die autoritat der wissenschaften eine einfuhrung in das wissenschaftstheoretische denken 2	1143
23	magic and logic of music a comprehensive course on the foundations of music and its place in life	1138
24	nmooc nachhaltigkeit fur alle	1130
25	sovereign bond pricing	1115
26	soziale arbeit eine einfuhrung	1034
27	mathematische denk und arbeitsweisen in geometrie und arithmetik	1016
28	social entrepreneurship wir machen gesellschaftlichen wandel moglich	1010
29	molecular gastronomy an experimental lecture about food food processing and a bit of physiology	984
30	fundamentals of remote sensing for earth observation	920
31	kompetenzkurs ernahrungswissenschaft	891
32	erfolgreich studieren	879
33	deciphering ancient texts in the digital age	868
34	qualitative methods	861
35	karl der grosse pater europae	855
36	who am i mind consciousness and body between science and philosophy	837
37	programmieren mit java	835
38	systemisches projektmanagement	811
39	lernen ist sexy	764
40	modelling and simulation using matlab one mooc more brains an interdisciplinary course not just for experts	760
41	suchmaschinen verstehen	712
42	hands on course on embedded computing systems with raspberry pi	679
43	introduction to mixed methods and doing research online	676
44	game ai	649
45	game theory and experimental economic research	633
46	cooperative innovation	613
47	blue engineering ingenieurinnen und ingenieure mit sozialer und okologischer verantwortung	612
48	my car the unkown technical being	612
49	gesundheit ein besonderes gut eine multidisziplinare erkundung des deutschen gesundheitssystems	608
50	teaching english as a foreign language tefl part i pronunciation	597
51	wie kann lesen gelernt gelehrt und gefordert werden lesesozialisation lesedidaktik und leseforderung vom grundschulunterricht bis zur erwachsenenbildung	593
52	the european dream	576
53	education of the present what is the future of education	570
54	faszination kristalle und symmetrie	561
55	italy today a girlfriend in a coma a walk through today s italy	557
56	dna from structure to therapy	556
57	grundlagen der mensch computer interaktion	549
58	malnutrition in developing countries	548
59	marketing als strategischer erfolgsfaktor von der produktinnovation bis zur kundenbindung	540
60	environmental ethics for scientists	540
61	stem cells in biology and medicine	528
62	praxiswissen fur den kunstlerischen alltagsdschungel	509
63	physikvision	506
64	high five evidence based practice	505
65	future climate water	484
66	diversity and communication challenges for integration and mobility	477
67	social entrepreneurship	469
68	die kunst des argumentierens	466
69	der hont feat mit dem farat wek wie kinder schreiben und lesen lernen	455
70	antikrastination moocen gegen chronisches aufschieben	454
71	exercise for a healthier life	454
72	the startup source code	438
73	web science	435
74	medizinische immunologie	433
75	governance in and through human rights	431
76	europe in the world law and policy aspects of the eu in global governance	419
77	komplexe welt strukturen selbstorganisation und chaos	419
78	mooc basics of surgery want to become a real surgeon	416
79	statistical data analysis for the humanities	414
80	business math r edux	406
81	analyzing behavioral dynamics non linear approaches to social and cognitive sciences	402
82	space technology	397
83	der erzahler materialitat und virtualitat vom mittelalter bis zur gegenwart	396
84	kriminologie	395
85	von e mail skype und xing kommunikation fuhrung und berufliche zusammenarbeit im netz	394
86	wissenschaft erzahlen das phanomen der grenze	392
87	nachhaltige entwicklung	389
88	die nachste gesellschaft gesellschaft unter bedingungen der elektrizitat des computers und des internets	388
89	die grundrechte	376
90	medienbildung und mediendidaktik grundbegriffe und praxis	368
91	bubbles everywhere speculative bubbles in financial markets and in everyday life	364
92	the heart of creativity	363
93	physik und weltraum	358
94	sim suchmaschinenimplementierung als mooc	354
95	order of magnitude physics from atomic nuclei to the universe	350
96	entwurfsmethodik eingebetteter systeme	343
97	monte carlo methods in finance	335
98	texte professionell mit latex erstellen	331
99	wissenschaftlich arbeiten wissenschaftlich schreiben	330
100	e x cite join the game of social research	330
101	forschungsmethoden	323
102	complex problem solving	321
103	programmieren lernen mit effekt	317
104	molecular devices and machines	317
105	wie man erfolgreich ein startup aufbaut	315
106	grundlagen der prozeduralen und objektorientierten programmierung	314
107	introduction to disability studies	314
108	eu2c the european union explained by two partners cologne and cife	313
109	the english language a linguistic introduction 2	311
110	allgemeine betriebswirtschaftslehre	293
111	interaction design open design	293
112	how we learn nowadays possibilities and difficulties	288
113	foundations of educational technology	288
114	projektmanagement und designbasiertes lernen	281
115	human rights	278
116	kompetenz des horens technische gehorbildung	278
117	it infrastructure management	276
118	a media history in 10 artefacts	274
119	introduction to the practice of statistics and regression	271
120	what is a good society introduction to social philosophy	268
121	modellierungsmethoden in der wirtschaftsinformatik	265
122	objektorientierte programmierung von web anwendungen von anfang an	262
123	intercultural diversity networking vielfalt interkulturell vernetzen	260
124	foundations of entrepreneurship	259
125	business communication for impact and results	257
126	gamification	257
127	creativity and design in innovation management	256
128	mechanik i	252
129	global virtual project management	252
130	digital signal processing for everyone	249
131	kompetenzen fur klimaschutz anpassung	248
132	digital economy and social innovation	246
133	synthetic biology	245
134	english phonetics and phonology	245
135	leibspeisen nahrung im wandel der zeiten molekule brot kase fleisch schokolade und andere lebensmittel	243
136	critical decision making in the contemporary globalized world	238
137	einfuhrung in die allgemeine betriebswirtschaftslehre schwerpunkt organisation personalmanagement und unternehmensfuhrung	236
138	didaktisches design	235
139	an invitation to complex analysis	235
140	grundlagen der programmierung teil 1	234
141	allgemein und viszeralchirurgie	233
142	mathematik 1 fur ingenieure	231
143	consumption and identity you are what you buy	231
144	vampire fictions	230
145	grundlagen der anasthesiologie	228
146	marketing strategy and brand management	227
147	political economy an introduction	225
148	gesundheit	221
149	object oriented databases	219
150	lebenswelten perspektiven fur menschen mit demenz	217
151	applications of graphs to real life problems	210
152	introduction to epidemiology epimooc	207
153	network security	207
154	global civics	207
155	wissenschaftliches arbeiten	204
156	annaherungen an zukunfte wie lassen sich mogliche wahrscheinliche und wunschbare zukunfte bestimmen	202
157	einstieg wissenschaft	200
158	engineering english	199
159	das erklaren erklaren wie infografik klart erklart und wissen vermittelt	198
160	betriebswirtschaftliche und rechtliche grundlagen fur das nonprofit management	192
161	art and mathematics	191
162	vom phanomen zum modell mathematische modellierung von natur und alltag an ausgewahlten beispielen	190
163	design interaktiver medien technische grundlagen	189
164	business englisch	187
165	erziehung sehen analysieren gestalten	184
166	basic clinical research methods	184
167	ordinary differential equations and laplace transforms	180
168	mathematische logik	179
169	die geburt der materie in der evolution des universums	179
170	innovationsmanagement von kleinen und mittelstandischen unternehmen kmu	176
171	introduction to qualitative methods in the social sciences	175
172	advert retard wirkung industrieller interessen auf rationale arzneimitteltherapie	175
173	animation beyond the bouncing ball	174
174	entropie einfuhrung in die physikalische chemie	172
175	edufutur education for a sustainable future	165
176	social network effects on everyday life	164
177	pharmaskills for africa	163
178	nachhaltige energiewirtschaft	162
179	qualitat in der fruhpadagogik auf den anfang kommt es an	158
180	dementias	157
181	beyond armed confrontation multidisciplinary approaches and challenges from colombia s conflict	154
182	investition und finanzierung	150
183	praxis des wissensmanagements	149
184	gutenberg to google the social construction of the communciations revolution	145
185	value innovation and blue oceans	145
186	kontrapunkt	144
187	shakespeare s politics	142
188	jetzt erst recht wissen schaffen uber recht	141
189	rechtliche probleme von sozialen netzwerken	138
190	augmented tuesday suppers	137
191	positive padagogik	137
192	digital storytelling mit bewegenden bildern erzahlen	136
193	wirtschaftsethik	134
194	energieeffizientes bauen	134
195	advising startups	133
196	urban design and communication	133
197	bildungsreform 2 0	132
198	mooc management basics	130
199	healthy teeth a life long course of preventive dentistry	129
200	digitales tourismus marketing	127
201	the arctic game the struggle for control over the melting ice	127
202	disease mechanisms	127
203	special operations from raids to drones	125
204	introduction to geospatial technology	120
205	social media marketing strategy smms	119
206	korpusbasierte analyse sprechsprachlichen problemlosungsverhaltens	116
207	introduction to marketing	115
208	creative coding	114
209	mooc meets 3d	110
210	unternehmenswert die einzig sinnvolle spitzenkennzahl fur unternehmen	110
211	forming behaviour gestaltung und konzeption von web applications	109
212	technology demonstration	108
213	lebensmittelmikrobiologie und hygiene	105
214	estudi erfolgreich studieren mit dem internet	105
215	moderne geldtheorie eine paische perspektive	103
216	kollektive intelligenz	103
217	geschichte der optischen medien	100
218	alter und soziale arbeit	99
219	semantik eine theorie visueller kommunikation	97
220	erziehung und beratung in familie und schule	96
221	foreign language learning in indian context	95
222	bildgebende verfahren	92
223	applied biology	92
224	bildung in der wissensgesellschaft gerechtigkeit	92
225	standortmanagement	92
226	europe a solution from history	90
227	methodology of research in international law	90
228	when african americans came to paris	90
229	contemporary architecture	89
230	past recent encounters turkey and germany	88
231	wars to end all wars	83
232	online learning management systems	82
233	software applications	81
234	business in germany	78
235	requirements engineering	77
236	anything relationship management xrm	77
237	global standards and local practices	76
238	prodima professionalisation of disaster medicine and management	75
239	cytology with a virtual correlative light and electron microscope	75
240	the organisation of innovation	75
241	sensors for all	75
242	diagnostik in der beruflichen bildung	73
243	scientific working	71
244	escience saxony lectures	71
245	internet marketing strategy how to gain influence and spread your message online	69
246	grundlagen des e business	69
247	principles of public health	64
248	methods for shear wave velocity measurements in urban areas	64
249	democracy in america	64
250	building typology studies gebaudelehre	63
251	multi media based learning environments at the interface of science and practice hamburg university of applied sciences prof dr andrea berger klein	61
252	math mooc challenge	60
253	the value of the social	58
254	dienstleistungsmanagement und informationssysteme	57
255	ict integration in education systems e readiness e integration e transformation	56

Building an Autocompletion on GWT screencast Part 2: Invoking The Remote Procedure Call

Rene — Tue, 12 Mar 2013 07:25:00 +0000

Hey everyone after posting my first screencast in this series reviewing the basic process for creating remote procedure calls in GWT we are now finally starting with the real tutorial for building an autocomplete service.
This tutorial (again hosted on wikipedia) covers the basic user interface meaning

how to integreate a SuggestBox instead of a textfield into the GWT Starter project
how to set up the neccessary stuff (extending a SuggestOracle) to fire a remote procedure call that requests suggestions if the user has typed something.
how to override the necessary methods from the SuggestOracle Interface

So here we go with the second part of the screencast which you can of course directly download from wikipedia:

Feel free to ask questions, give comments and improve the screencast!

Building an Autocompletion on GWT screencast Part 1: Getting Warm – Reviewing remote procedure calls

Rene — Tue, 19 Feb 2013 09:11:29 +0000

Quite a while ago I promised to create some screencasts on how to build a (personalized) Autocompletion in GWT. Even though the screencasts have been created for quite some time now I had to wait publishing them for various reasons.
Finally it is now the time to go public with the first video. I do really start from scratch. So the first video might be a little bit boaring since I am only reviewing the Remote Procedure calls of GWT.
A litte Note: The video is hosted on Wikipedia! I think it is important to spread knowledge under a creative commons licence and the youtubes, vimeos,… of this world are rather trying to do a vendor lock in. So If the embedded player is not so well you can go directly to wikipedia for a fullscreen version or direct download of the video.

Another note: I did not publish the source code! This has a pretty simple reason (and yes you can call me crazy): If you really want to learn something, copying and pasting code doesn’t help you to get the full understanding. Doing it step by step e.g. watching the screencasts and reproducing the steps is the way to go.
As always I am open to suggestions and feedback but please have in mind that the entire course of videos is already recorded.

Get the full neo4j power by using the Core Java API for traversing your Graph data base instead of Cypher Query Language

Rene — Tue, 06 Nov 2012 11:55:02 +0000

As I said yesterday I have been busy over the last months producing content so here you go. For related work we are most likely to use neo4j as core data base. This makes sense since we are basically building some kind of a social network. Most queries that we need to answer while offering the service or during data mining carry a friend of a friend structure.
For some of the queries we are doing counting or aggregations so I was wondering what is the most efficient way of querying against a neo4j data base. So I did a Benchmark with quite surprising results.
Just a quick remark, we used a data base consisting of papers and authors extracted from arxiv.org one of the biggest pre print sites available on the web. The data set is available for download and reproduction of the benchmark results at http://blog.related-work.net/data/
The data base as a neo4j file is 2GB (zipped) the schema looks pretty much like that:

 Paper1  <--[ref]-->  Paper2
   |                    |
   |[author]            |[author]
   v                    v
 Author1              Author2

For the benchmark we where trying to find coauthors which is basically a friend of a friend query following the author relationship (or breadth first search (depth 2))
As we know there are basically 3 ways of communicating with the neo4j Database:

Java Core API

Here you work on the nodes and relationship objects within java. Formulating a query once you have fixed an author node looks pretty much like this.
for (Relationship rel: author.getRelationships(RelationshipTypes.AUTHOROF)){ Node paper = rel.getOtherNode(author); for (Relationship coAuthorRel: paper.getRelationships(RelationshipTypes.AUTHOROF)){ Node coAuthor = coAuthorRel.getOtherNode(paper); if (coAuthor.getId()==author.getId())continue; resCnt++; } }
We see that the code can easily look very confusing (if queries are getting more complicated). On the other hand one can easy combine several similar traversals into one big query making readability worse but increasing performance.

Traverser Framework

The Traverser Framework ships with the Java API and I really like the idea of it. I think it is really easy to undestand the meaning of a query and in my opinion it really helps to create a good readability of the code.
Traversal t = new Traversal(); for (Path p:t.description().breadthFirst(). relationships(RelationshipTypes.AUTHOROF).evaluator(Evaluators.atDepth(2)). uniqueness(Uniqueness.NONE).traverse(author)){ Node coAuthor = p.endNode(); resCnt++; }
Especially if you have a lot of similar queries or queries that are refinements of other queries you can save them and extend them using the Traverser Framework. What a cool technique.

Cypher Query Language

And then there is Cypher Query language. An interface pushed a lot by neo4j. If you look at the query you can totally understand why. It is a really beautiful language that is close to SQL (Looking at Stackoverflow it is actually frightening how many people are trying to answer Foaf queries using MySQL) but still emphasizes on the graph like structure.
ExecutionEngine engine = new ExecutionEngine( graphDB ); String query = "START author=node("+author.getId()+ ") MATCH author-[:"+RelationshipTypes.AUTHOROF.name()+ "]-()-[:"+RelationshipTypes.AUTHOROF.name()+ "]- coAuthor RETURN coAuthor"; ExecutionResult result = engine.execute( query); scala.collection.Iterator it = result.columnAs("coAuthor"); while (it.hasNext()){ Node coAuthor = it.next(); resCnt++; } I was always wondering about the performance of this Query language. Writing a Query language is a very complex task and the more expressive the language is the harder it is to achieve good performance (same holds true for SPARQL in the semantic web) And lets just point out Cypher is quite expressive.

What where the results?

All queries have been executed 11 times where the first time was thrown away since it warms up neo4j caches. The values are average values over the other 10 executions.

The Core API is able to answer about 2000 friend of a friend queries (I have to admit on a very sparse network).
The Traverser framework is about 25% slower than the Core API
Worst is cypher which is slower at least one order of magnitude only able to answer about 100 FOAF like queries per second.

I was shocked so I talked with Andres Taylor from neo4j who is mainly working for cypher. He asked my which neo4j version I used and I said it was 1.7. He told me I should check out 1.9. since Cypher has become more performant. So I run the benchmarks over neo4j 1.8 and neo4j 1.9 unfortunately Cypher became slower in newer neo4j releases.

One can see That the Core API outperforms Cypher by an order of magnitute and the Traverser Framework by about 25%. In newer neo4j versions The core API became faster and cypher became slower

Quotes from Andres Taylor:

Cypher is just over a year old. Since we are very constrained on developers, we have had to be very picky about what we work on the focus in this first phase has been to explore the language, and learn about how our users use the query language, and to expand the feature set to a reasonable level

I believe that Cypher is our future API. I know you can very easily outperform Cypher by handwriting queries. like every language ever created, in the beginning you can always do better than the compiler by writing by hand but eventually,the compiler catches up

Conclusion:

So far I was only using the Java Core API working with neo4j and I will continue to do so.
If you are in a high speed scenario (I believe every web application is one) you should really think about switching to the neo4j Java core API for writing your queries. It might not be as nice looking as Cypher or the traverser Framework but the gain in speed pays off.
Also I personally like the amount of control that you have when traversing over the core yourself.
Adittionally I will soon post an article why scripting languages like PHP, Python ore Ruby aren’t suitable for building web Applications anyway. So changing to the core API makes even sense for several reasons.
The complete source code of the benchmark can be found at https://github.com/renepickhardt/related-work.net/blob/master/RelatedWork/src/net/relatedwork/server/neo4jHelper/benchmarks/FriendOfAFriendQueryBenchmark.java (commit: 0d73a2e6fc41177f3249f773f7e96278c1b56610)
The detailed results can be found in this spreadsheet.

Typology Oberseminar talk and Speed up of retrieval by a factor of 1000

Rene — Thu, 16 Aug 2012 11:39:25 +0000

Almost 2 months ago I talked in our oberseminar about Typology. Update: Download slides Most readers of my blog will already know the project which was initially implemented by my students Till and Paul. I am just about to share some slides with you. They explain on one hand how the systems works and on the other hand give some overview of the related work.
As you can see from the slides we are planning to submit our results to SIGIR conference. So one year after my first blogpost on graphity which devoloped in a full paper for socialcom2012 (graphity blog post and blog post for source code) there is the yet informal typology blog post with the slides about the Typology Oberseminar talk and 3 months left for our SIGIR submission. I expect this time the submission will not be such a hassle as graphity since I shuold have learnt some lessons and also have a good student who is helping me with the implementation of all the tests.
Additionally I have finally uploaded some source code to git hub that makes the typology retrieval algorithm pretty fast. There are still some issues with this code since it lowers the quality of predictions a little bit. Also the index has to be built first. Last but not least the original SuggestTree code did not save the weights of the items to be suggested. I need those weights in the aggregation phase. Since i did not want to extend the original code I placed the weights at the end of the suggested Items. This is a little inefficent.
The main idea why retrieval speeds up with the new algorithm is that typology needs to make sorting over all outedges of a node. This is rather slow especially if one only needs the top k elements. Since neo4j as a graph data base does not provide indices for this kind of data I was forced to look for another way to presort the data. Additionally if a prefix is known one does not have to look at all outgoing edges. I found the Suggest Tree class by Nicolai Diethelm. Which solved the problem in a very good way and lead to such a great speed. The index is not persistent yet and it also needs quite some memory. On the other hand for every node a suggest tree is built. This means that the index can be distributed in a very easy manner over several machines allowing for horizontal scaling!
Anyway the old algorithm was only able to handle like 20 requests per second and now we have something like 14 k requests and as I mentioned there is still a little space for more (:
I hope indices like this will be standard in neo4j soon. This would open up the range of applications that could make good use of neo4j.
Like always I am happy for any suggestions and I am looking forward to do the complete evaluation and paper writing for typology.

Neo4j Graph Database vs MySQL

Rene — Thu, 05 May 2011 21:36:32 +0000

For my social news stream application I am heavily thinking about the right software to support my backend. After I designed a database model in MySQL I talked back to Jonas and he suggested to search for a better suiting technology. A little bit of research brought me to a Graph database called Neo4j.
After I downloaded the opensource java libs and got it running with eclipse and maven I did some testing with my Metalcon data set. And I have been very satisfied and the whole project looks very promesing to me. I exported 4 relations from my MySQL Database.

UserUserFriend containing all the friendship requests
UserProfileVisit containing the profiles a user visited
UserMessage containing the messages between users
UserComment containing the profile comments between users

These relations obviously form a graph on my data set. Reading the several 100’000 lines of data and put them into the graph data structure and building a search index on the nodes only took several seconds runtime. But I was even more impressed by the speed with which it was possible to traverse the graph!
Receiving the shortest path between two users of length 4 only took me 150 milliseconds. Doing a full bredthfirst search on a different heavily connected graph with 290’000 edges only took 2.7 seconds which means that neo4j is capable of traversing about 100’000 edges per second.
Now I will have to look more carefully to my usecase. Obviously I want to have edges that are labled with timestamps and retrieve them in orderd lists. Adding key value pairs to the edges and including and index is possible which makes me optimisitic that I will be able to solve a lot of my queries of interest in an efficiant manner.
Unfortunately I am batteling around with Google Webtoolkit and Eclipse and Neo4j which I want to combine for the new metlcon version but I even asked the neo4j mailinglist with an very emberassing question and the guys from neotechnology have been very kind and helpful (even thogh I still couldn’t manage to get it running) I will post an article here as soon as I know how to set everything up.
In General I am still a huge fan of relational databases but for a usecase of social networks I see why graph data bases seem to be the more sophisticated technology. I am pretty sure that I could not have perfomed so well using MySQL.
What is your experience with graph data bases and especially neo4j?