graph query language – Data Science, Data Analytics and Machine Learning Consulting in Koblenz Germany

Get the full neo4j power by using the Core Java API for traversing your Graph data base instead of Cypher Query Language

Rene — Tue, 06 Nov 2012 11:55:02 +0000

As I said yesterday I have been busy over the last months producing content so here you go. For related work we are most likely to use neo4j as core data base. This makes sense since we are basically building some kind of a social network. Most queries that we need to answer while offering the service or during data mining carry a friend of a friend structure.
For some of the queries we are doing counting or aggregations so I was wondering what is the most efficient way of querying against a neo4j data base. So I did a Benchmark with quite surprising results.
Just a quick remark, we used a data base consisting of papers and authors extracted from arxiv.org one of the biggest pre print sites available on the web. The data set is available for download and reproduction of the benchmark results at http://blog.related-work.net/data/
The data base as a neo4j file is 2GB (zipped) the schema looks pretty much like that:

 Paper1  <--[ref]-->  Paper2
   |                    |
   |[author]            |[author]
   v                    v
 Author1              Author2

For the benchmark we where trying to find coauthors which is basically a friend of a friend query following the author relationship (or breadth first search (depth 2))
As we know there are basically 3 ways of communicating with the neo4j Database:

Java Core API

Here you work on the nodes and relationship objects within java. Formulating a query once you have fixed an author node looks pretty much like this.
for (Relationship rel: author.getRelationships(RelationshipTypes.AUTHOROF)){ Node paper = rel.getOtherNode(author); for (Relationship coAuthorRel: paper.getRelationships(RelationshipTypes.AUTHOROF)){ Node coAuthor = coAuthorRel.getOtherNode(paper); if (coAuthor.getId()==author.getId())continue; resCnt++; } }
We see that the code can easily look very confusing (if queries are getting more complicated). On the other hand one can easy combine several similar traversals into one big query making readability worse but increasing performance.

Traverser Framework

The Traverser Framework ships with the Java API and I really like the idea of it. I think it is really easy to undestand the meaning of a query and in my opinion it really helps to create a good readability of the code.
Traversal t = new Traversal(); for (Path p:t.description().breadthFirst(). relationships(RelationshipTypes.AUTHOROF).evaluator(Evaluators.atDepth(2)). uniqueness(Uniqueness.NONE).traverse(author)){ Node coAuthor = p.endNode(); resCnt++; }
Especially if you have a lot of similar queries or queries that are refinements of other queries you can save them and extend them using the Traverser Framework. What a cool technique.

Cypher Query Language

And then there is Cypher Query language. An interface pushed a lot by neo4j. If you look at the query you can totally understand why. It is a really beautiful language that is close to SQL (Looking at Stackoverflow it is actually frightening how many people are trying to answer Foaf queries using MySQL) but still emphasizes on the graph like structure.
ExecutionEngine engine = new ExecutionEngine( graphDB ); String query = "START author=node("+author.getId()+ ") MATCH author-[:"+RelationshipTypes.AUTHOROF.name()+ "]-()-[:"+RelationshipTypes.AUTHOROF.name()+ "]- coAuthor RETURN coAuthor"; ExecutionResult result = engine.execute( query); scala.collection.Iterator it = result.columnAs("coAuthor"); while (it.hasNext()){ Node coAuthor = it.next(); resCnt++; } I was always wondering about the performance of this Query language. Writing a Query language is a very complex task and the more expressive the language is the harder it is to achieve good performance (same holds true for SPARQL in the semantic web) And lets just point out Cypher is quite expressive.

What where the results?

All queries have been executed 11 times where the first time was thrown away since it warms up neo4j caches. The values are average values over the other 10 executions.

The Core API is able to answer about 2000 friend of a friend queries (I have to admit on a very sparse network).
The Traverser framework is about 25% slower than the Core API
Worst is cypher which is slower at least one order of magnitude only able to answer about 100 FOAF like queries per second.

I was shocked so I talked with Andres Taylor from neo4j who is mainly working for cypher. He asked my which neo4j version I used and I said it was 1.7. He told me I should check out 1.9. since Cypher has become more performant. So I run the benchmarks over neo4j 1.8 and neo4j 1.9 unfortunately Cypher became slower in newer neo4j releases.

One can see That the Core API outperforms Cypher by an order of magnitute and the Traverser Framework by about 25%. In newer neo4j versions The core API became faster and cypher became slower

Quotes from Andres Taylor:

Cypher is just over a year old. Since we are very constrained on developers, we have had to be very picky about what we work on the focus in this first phase has been to explore the language, and learn about how our users use the query language, and to expand the feature set to a reasonable level

I believe that Cypher is our future API. I know you can very easily outperform Cypher by handwriting queries. like every language ever created, in the beginning you can always do better than the compiler by writing by hand but eventually,the compiler catches up

Conclusion:

So far I was only using the Java Core API working with neo4j and I will continue to do so.
If you are in a high speed scenario (I believe every web application is one) you should really think about switching to the neo4j Java core API for writing your queries. It might not be as nice looking as Cypher or the traverser Framework but the gain in speed pays off.
Also I personally like the amount of control that you have when traversing over the core yourself.
Adittionally I will soon post an article why scripting languages like PHP, Python ore Ruby aren’t suitable for building web Applications anyway. So changing to the core API makes even sense for several reasons.
The complete source code of the benchmark can be found at https://github.com/renepickhardt/related-work.net/blob/master/RelatedWork/src/net/relatedwork/server/neo4jHelper/benchmarks/FriendOfAFriendQueryBenchmark.java (commit: 0d73a2e6fc41177f3249f773f7e96278c1b56610)
The detailed results can be found in this spreadsheet.

PhD proposal on distributed graph data bases

Rene — Tue, 27 Mar 2012 10:19:22 +0000

Over the last week we had our off campus meeting with a lot of communication training (very good and fruitful) as well as a special treatment for some PhD students called “massage your diss”. I was one of the lucky students who were able to discuss our research ideas with a post doc and other PhD candidates for more than 6 hours. This lead to the structure, todos and time table of my PhD proposal. This has to be finalized over the next couple days but I already want to share the structure in order to make it more real. You might also want to follow my article on a wish list of distributed graph data base technology

[TODO] 0. Find a template for the PhD proposal

That is straight forward. The task is just to look at other students PhD proposals also at some major conferences and see what kind of structure they use. A very common structure for papers is Jennifer Widom’s structure for writing a good research paper. This or a similar template will help to make the proposal readable in a good way. For this blog article I will follow Jennifer Widom more or less.

1. Write an Introduction

Here I will describe the use case(s) of a distributed graph data base. These could be

indexing the web graph for a general purpose search engine like Google, Bing, Baidu, Yandex…
running the backend of a social network like Facebook, Google+, Twitter, LinkedIn,…
storing web log files and click streams of users
doing information retrieval (recommender systems) in the above scenarios

There could also be very other use cases like graphs from

biology
finance
regular graphs
geographic maps like road and traffic networks

2. Discuss all the related work

This is done to name all the existing approaches and challenges that come with a distributed graph data base. It is also important to set onself apart from existing frameworks like graph processing. Here I will name the at least the related work in the following fields:

graph processing (Signal Collect, Pregel,…)
graph theory (especially data structures and algorithms)
(dynamic/adaptive) graph partitioning
distributed computing / systems (MPI, Bulk Synchronous Parallel Programming, Map Reduce, P2P, distributed hash tables, distributed file systems…)
redundancy vs fault tolerance
network programming (protocols, latency vs bandwidth)
data bases (ACID, multiple user access, …)
graph data base query languages (SPARQL, Gremlin, Cypher,…)
Social Network and graph analysis and modelling.

3. Formalize the problem of distributed graph data bases

After describing the related work and knowing the standard terminology it makes sense to really formalize the problem. Several steps have to be taken: There needs to be notation for distributed graph data bases fixed. This has to respect two things:
a) the real – so far unknown – problems that will be solved during PhD. In this way fixing the notation and formalizing the (unknown) problem will be kind of hard.
b) The use cases: For the web use case this will probably translate to scale free small world network graphs with a very small diameter. Probably in order to respect other use cases than the web it will make sense to cite different graph models e.g. mathematical models to generate graphs with certain properties from the related work.
The important step here is that fixing a use case will also fix a notation and help to formalize the problem. The crucial part is to choose the use case still so general that all special cases and boarder line cases are included. Especially the use case should be a real extension to graph processing which should of course be possible with a distributed graph data base.
One very important part of the formalization will lead to a first research question:

4. Graph Query languages – Graph Algebra

I think graph data bases are not really general purpose data bases. They exist to solve a certain class of problems in a certain range. They seem to be especially useful where information of a local neighborhood of data points is frequently needed. They also often seem to be useful when schemaless data is processed. This leads to the question of a query language. Obviously (?) the more general the query language the harder to have a very efficient solution. The model of a relational algebra was a very successful concept in relational data bases. I guess a similar graph algebra is needed as a mathmatical concept for distributed graph data bases as a foundation of their query languages.
Remark that this chapter has nothing much to do with distributed graph data bases but with graph data bases in general.
The graph algebra I have in mind so far is pretty similar to neo4j and consists of some atomic CRUD operations. Once the results are known (ether as an answer from the related work or by own research) I will be able to run my first experiments in a distributed environment.

5. Analysis of Basic graph data structures vs distribution strategies vs Basic CRUD operations

As expected the graph algebra will consist of some atomic CRUD operations those operations have to be tested against all different data structures one can think of in the different known distributed environments over several different real world data sets. This task will be rather straight forward. It will be possible to know the theoretical results of most implementations. The reason for this experiment is to collect experimental experiences in a distributed setting and to understand what is really happening and where the difficulties in a distributed setting are. Already in the evaluation of graphity I realized that there is a huge gap between theoretical predictions and the real results. In this way I am convinced that this experiment is a good step forward and the deep understanding of actually implementing all this will hopefully lead to:

6. Development of hybrid data structures (creative input)

It would be the first time in my life where I am running such an experiment without any new ideas coming up to tweak and tune. So I am expecting to have learnt a lot from the first experiment in order to have some creative ideas how to combine several data structures and distribution techniques in order to make a better (especially bigger scaling) distributed graph data base technology.

7. Analysis of multiple user access and ACID

One important fact of a distributed graph data base that was not in the focus of my research so far is the part that actually makes it a data base and sets it apart from some graph processing frame work. Even after finding a good data structure and distributed model there are new limitations coming once multiple user access and ACID are introduced. These topics are to some degree orthogonal to the CRUD operations examined in my first planned experiment. I am pretty sure that the experiments from above and more reading on ACID in distributed computing will lead to more reasearch questions and ideas how to test several standard ACID strategies for several data structures in several distributed environments. In this sense this chapter will be an extension to the 5. paragraph.

8. Again creative input for multiple user access and ACID

After heaving learnt what the best data structures for basic query operations in a distributed setting are and also what the best methods to achieve ACID are it is time for more creative input. This will have the goal to find a solution (data structure and distribution mechanism) that respects both the speed of basic query operations and the ease for ACID. Once this done everything is straight forward again.

9. Comprehensive benchmark of my solution with existing frameworks

My own solution has to be benchmarked against all the standard technologies for distributed graph data bases and graph processing frameworks.

10. Conclusion of my PhD proposal

So the goal of my PhD is to analyse different data structures and distribution techniques for a realization of distributed graph data base. This will be done with respect to a good runtime of some basic graph queries (CRUD) respecting a standardized graph query algebra as well as muli user access and the paradigms of ACID.

11 Timetable and mile stones

This is a rough schedual fixing some of the major mile stones.

2012 / 04: hand in PhD proposal
2012 / 07: graph query algebra is fixed. Maybe a paper is submitted
2012 / 10: experiments of basic CRUD operations done
2013 / 02: paper with results from basic CRUD operations done
2013 / 07: preliminary results on ACID and multi user experiments are done and submitted to a conference
2013 /08: min 3 month research internship in a company benchmarking my system on real data
end of 2013: publishing the results
2014: 9 months of writing my dissertation

For anyone who has input, knows of papers or can point me to similar research I am more than happy if you could contact me or start the discussion!
Thank you very much for reading so far!