Over the last weeks I did some more work on neo4j. And I am ready to present some more results on the speed (In my use case neo4j outperformed MySQL by a factor of 377 ! That is more than two magnitudes). As known one part of my PhD thesis is to create a social newsstream application around my social networking site metalcon.de. It is very obvious that a graph structure for social newsstreams are very natural: You go to a user. Travers to all his friends or objects of interest and then traverse one step deeper to the newly created content items. A problem with this kind of application is the sorting by Time or relvance of the content items. But before I discuss those problems I just want to present another comparission between MySQL and neo4j.
Setting:
I took a fairly small dataset with the following properties:
- 14986 content items (entries in a forum of a band)
- 1391 Bands
- 854 user having at leas one of those bands in their list of fav bands
- a bipartit graph of fan relations between users and bands
For every User I wanted to select all content items from his favourite bands. I know this is far away from the social newsstream application I am about to build but I used it as a first test to see weather graph databases really are the more natural setting for my questions.
Results:
In MySQL this would look something like this:
for all (User_ID in Users){
SELECT ce.Text
FROM Entry e
JOIN Topic t ON t.ID = e.Topic_ID
JOIN UserBand ub ON ct.Band_ID = ub.Band_ID
WHERE ub.User_ID = User_ID
}
Even though we have all relevant colums indexed those joins are expensive. Especially because the Entry Table is much bigger than 14986 Items.
Using MySQL It took 152 Seconds = 2:32 Minutes to create the interesting newsstreams for all 854 users or 177 ms per user
Let’s look at neo4j:
Not using any traverser but just some hacked in code I have something like the following
for all (user in index){
for (Relationship rel: user.getRelationships(DynamicRelationshipType.withName(“FAN”), Direction.BOTH)){
Node n1 = rel.getOtherNode(user);
Even thogh we only have 854 users and 1391 bands we end up with 1368270 relations that we have traversed to retrieve all content items for all favourite bands of each user.
Using neo4j it took 3,4 Seconds in doing this or 4 ms per user
That is about 44 times faster than MySQL
After warming the caches.
When repeating this experiment MySQL does not get faster. I know they have a query cache but I guess it gets emptied before the first queries are run again. In neo4j though this result gets even better. Each time I repeated this experiment the runtime decreased until I came to something like 0,4 Seconds for the job which now is 377 times faster than MySQL. The best part of this is scaling. When my Social graph grows the search in neo4j stays a local search. more users and more discussions does not mean more edges to traverse from the user to his favourite bands. in MySQl though the cost of these heavy joins will just explode.
Yes I know in MySQL I would denormalize my tables to create such an application in order to gain speed. But denormalizing means more redundancy and again a graph is a much more natural structure for a social application.
Open Questions! Feel free to discuss them (-:
There is an very important open question though and that is sorting. A social newsstream of course should be sorted by time. in MySQL it is easy to create an Index over a colum with timestamps on the Contentitems. Sorting my resultset by time will in this case not increase the runtime.
In a graph database with this particular schema of my graph I am not quite sure how to retrieve the results in a sorted way. (Anyone has some ideas?) There needs to be further testing to see if sorting after retrieving still makes my solution faster than MySQL or (my prefered solution) if there is some way of designing the graph in a way that for any user with any set of favourite bands there is a quick way of traversing through the content items and receiving them in an ordered way. I even guess this is already possible in neo4j using traversers with bredth first search and tell them in wich order to traverse relations. Just have too look deeper into this and i will keep you updated.
I am happy and open for comments and suggestions! Oh and could anyone suggest a nice syntax highlightning plugin for wordpress?