From Graph (batch) processing towards a distributed graph data base

Yesterdays meeting of the reading club was quite nice. We all agreed that the papers where of good quality and we gained some nice insights. The only drawback of the papers was that it did not directly tell us how to achieve our goal for a real time distributed graph data base technology. In the readings for next meeting (which will take place Wednesday March 7th 2pm CET) we tried to choose papers that don’t discuss these distributed graph / data processing techniques but focus more on speed or point out the general challenges in parallel graph processing.

Readinglist for next Meeting (Wednesday March 7th 2pm CET)

memcached paper: To understand how for distributed shared memory works which could essentially speed up approaches like Signal Collect
Beehive: to see a p2p aproach for graph distribution.
Challenges in parallel graph processing. For obvious reasons since it points out the large picture.
http://www.boost.org/doc/libs/1_48_0/libs/graph_parallel/doc/html/index.html The boos library is a general parallel graph processing framework. In any case it is interesting and good to understand what is going on there.
Topology partitioning applied to SPARQL, HADOOP and TripleStores Shows how a speedup of 1000x can be achieved due to smart partitioning of a graph

Again while reading an preparing stuff feel free to add more reading wishes to the comments of this blog post or drop me a mail!

Summary of yesterdays meeting

As written in the introduction we agreed that the papers where interesting but not heading in our direction. Claudio pointed out that everyone should consider the following set of questions.

Do we want the graph to be mutable or is it supposed to writable or is it supposed to be read only?
- writing makes sens. If it is read only it is called batch processing
- Writing is hard you care about locking consistancy

Do we want to answer queries (Cypher/gremlin/whatever)?
Do we want to provide an API for processing?
How big is the data set we want to support
- many people do in memory
- If you go to the disk you open a whole new bottle of topics
- One approach would be to solve the problem in memory first.

I am very confident that it was a good idea to start with graph processing but that we are taking the right steps now to go in the direction of real distributed graph data base systems. I think there are some more questions and high level assumptions that one has to fix which I will post in a few days on this blog. Sorry I am in a hurry for this day / rest of the week.

Infrastructure

Schegi just suggested to create a Mailingliste for the reading club or to switch to Google Groups. He pointed out that a private blog is kind of a weired medium to be so central. What is your opinion on that? Do we need some other / more formal infrastructure?

8 Comments

Claudio says:

February 23, 2012 at 2:40 pm

One thing: i don’t believe that bach processing read-only. Pregel/Giraph supports modifying the graph during computation and writing it at the end of it.
Second thing: i agree on the necessity of a different form of medium, i propose google groups+google docs.

Claudio says:
February 23, 2012 at 2:41 pm

Reply

Mhm, it ate some of my sentence:
It was originally: i don’t believe that batch processing means read-only and vice versa.

Patrick Durusau says:

February 24, 2012 at 1:10 am

As far as mailing list, etc., discussion on a variety of graph topics might be easier. But then your blog hasn’t been over-run with comments. 😉
Whatever is the easiest for you to maintain. If the group grows and continues, can always add more facilities later.
Hope you are having a great week!
Patrick

Rene says:

February 24, 2012 at 1:40 pm

Peter from Neotechnology at the very beginning of the reading club suggested to use http://www.meetup.com/ I just stumbled upon this on another website. Anyone has some thoughts on this?

Wishlist of features for a distributed graph data base technology says:

February 24, 2012 at 1:54 pm

[…] already found the time to look over our courrent reading assignments. Especially the VLDB paper (Topology partitioning applied to SPARQL, HADOOP and TripleStores) and […]

From Graph (batch) processing towards a distributed graph data base « Another Word For It says:

February 24, 2012 at 10:54 pm

[…] From Graph (batch) processing towards a distributed graph data base by René Pickhardt. […]

Jonas Kunze says:

February 26, 2012 at 5:02 pm

I just found a Master thesis about partitioning Graph Databases which was financed by neo4j. So this could be interesting for you!
http://dl.dropbox.com/u/1552664/Master%20Thesis%20-%20Partitioning%20Graph%20Databases%20-%202010%20-%20Alex%20Averbuch%20%26%20Martin%20Neumann.pdf
http://alexaverbuch.blogspot.com/

Related work of the Reading club on distributed graph data bases (Beehive, Scalable SPARQL Querying of Large RDF Graphs, memcached) says:

March 7, 2012 at 5:35 pm

[…] Today we finally had our reading club and discussed several papers from last week’s asignments. […]

Readinglist for next Meeting (Wednesday March 7th 2pm CET)

Summary of yesterdays meeting

Infrastructure

Popular Posts

What are the 57 signals google uses to filter search results?

Graphity: An efficient Graph Model for Retrieving the Top-k News Feeds for users in social networks

Algorithmic Information Filter from Eli Pariser’s TED Talks

Time lines and news streams: Neo4j is 377 times faster than MySQL

8 Comments

Leave a Reply Cancel reply

Readinglist for next Meeting (Wednesday March 7th 2pm CET)

Summary of yesterdays meeting

Infrastructure

You may also like...

Popular Posts

8 Comments

Leave a Reply Cancel reply