Yesterdays meeting of the reading club was quite nice. We all agreed that the papers where of good quality and we gained some nice insights. The only drawback of the papers was that it did not directly tell us how to achieve our goal for a real time distributed graph data base technology. In the readings for next meeting (which will take place Wednesday March 7th 2pm CET) we tried to choose papers that don’t discuss these distributed graph / data processing techniques but focus more on speed or point out the general challenges in parallel graph processing.
Readinglist for next Meeting (Wednesday March 7th 2pm CET)
- memcached paper: To understand how for distributed shared memory works which could essentially speed up approaches like Signal Collect
- Beehive: to see a p2p aproach for graph distribution.
- Challenges in parallel graph processing. For obvious reasons since it points out the large picture.
- http://www.boost.org/doc/libs/1_48_0/libs/graph_parallel/doc/html/index.html The boos library is a general parallel graph processing framework. In any case it is interesting and good to understand what is going on there.
- Topology partitioning applied to SPARQL, HADOOP and TripleStores Shows how a speedup of 1000x can be achieved due to smart partitioning of a graph
Again while reading an preparing stuff feel free to add more reading wishes to the comments of this blog post or drop me a mail!
Summary of yesterdays meeting
As written in the introduction we agreed that the papers where interesting but not heading in our direction. Claudio pointed out that everyone should consider the following set of questions.
- Do we want the graph to be mutable or is it supposed to writable or is it supposed to be read only?
- writing makes sens. If it is read only it is called batch processing
- Writing is hard you care about locking consistancy
- Do we want to answer queries (Cypher/gremlin/whatever)?
- Do we want to provide an API for processing?
- How big is the data set we want to support
- many people do in memory
- If you go to the disk you open a whole new bottle of topics
- One approach would be to solve the problem in memory first.
I am very confident that it was a good idea to start with graph processing but that we are taking the right steps now to go in the direction of real distributed graph data base systems. I think there are some more questions and high level assumptions that one has to fix which I will post in a few days on this blog. Sorry I am in a hurry for this day / rest of the week.
Infrastructure
Schegi just suggested to create a Mailingliste for the reading club or to switch to Google Groups. He pointed out that a private blog is kind of a weired medium to be so central. What is your opinion on that? Do we need some other / more formal infrastructure?