Related work of the Reading club on distributed graph data bases (Beehive, Scalable SPARQL Querying of Large RDF Graphs, memcached) – Data Science, Data Analytics and Machine Learning Consulting in Koblenz Germany

Today we finally had our reading club and discussed several papers from last week’s asignments.
Before I give my usual summary I want to introduce our new infrastructure for the reading club. Go to:
http://related-work.rene-pickhardt.de/
There you can find a question and answer system which we will use to discuss questions and answers of papers. Due to the included voting system we thought this is much more convenient than a closed unstructured mailing list. I hope this is of help to the entire community and I can only invite anyone to read and and discuss with us on http://related-work.rene-pickhardt.de/

Reading list for next meeting Wed March 14th 2 pm CET

Boost Graph library Documentation as well as the original paper.
Papers about the Gossip protocol: Gossip based aggregation in large dynamic networks, Ordered slicing of very large-scale overlay networks and a Gossip based monitoring system
Parallal graph distribution algorithms: A distributed heuristic for clustering a virtual p2p supercomputer
Message passing Interface original paper – can anyone provide a link for download? I cannot find it!
Partitioned global address space model PGAS – I cant find source papers but only the wikipedia article
Many Core Key Value Stores (Facebook paper)

We first discussed the memcached paper:

One of the first topics we discussed was how is the dynamically hash done? We also wondered how DHT take care of overloading in general? In the memcached paper this fact is not discussed very well. Schegi knows a good paper that explains the dynamics behind DHT’s and will provide the link soon.
Afterwards we discussed what would happen if a distributed Key Value store like memcached is used to implement a graph store. Obviously creating a graph store on the Key value model is possible. Additionally memcached is very fast in its lookups. One could add another persistence layer to memcached that woul enable disk writes.
We think the main counter arguments are:

In this setting graph distribution to worker nodes is randomly done.
No performance gain by graph partitioning possible

We realized that we should really read about distributed graph distribution
If using memcached you can store much more than an adjacncy list in the value of one key. In this way reducing information needed.
Again I pointed out that seperating the data model from the data storage could help essentially. I will soon write an entire blog article about this idea in the stetting of relational / graph models and relational database management systems.
personally I am still convinced that memcached could be used to improve asynchronous message passing in distributed systems like signal / collect

Scalable SPARQL Query of Large RDF graphs:

We agreed that one of the core principles in this paper is that they remove supernodes (everything connected via RDF type) in order to have a much sparser graph and do the partitioning (which speed up computation a lot) afterwards they added the supernodes as a redundancy to all workers where the supernodes could be needed. This methodology could generalize pretty well to arbitrary graphs: You just look at the node degree and remove the x% nodes with highest degree from the graph run a cluster algorithm and then add the supernodes in a redundant way to the workers.
Thomas pointed out that this paper had a drawback of not using a distributed cluster algorithm but then used a framework like map reduce.

Beehive:

We all agreed that the beehive paper was solving a problem with a really great methodology by first looking into query distribution and then using proactive caching strategies. The interesting points are that they create an analytical model which they can solve in a closed way. The p2p protocols are enhanced by gossip talk to distribute the parameters of the protocol. In this way an adaptive system is created which will adjust its caching strategy once the queries are changing.
We thought that the behive approach could be generalized to various settings. Especially it might be possible to not only analyze zipf distributions but also other distributions of the queries and derive various analytical models which could even coexist in such a system.
You can find our questions and thoughts and joind our discussion about beehive online!

Challenges in parallel Graph processing:

Unfortunately we did not really have the time to discuss this – in my opinion – great paper. I created a discussion in our new question board. so feel free to discuss this paper at: http://related-work.rene-pickhardt.de/questions/13/challenges-in-parallel-graph-processing-what-do-you-think

1 Comment

Graph Reading Club: 7 March 2011 « Another Word For It says:

March 9, 2012 at 3:44 am

[…] Related work of the Reading club on distributed graph data bases (Beehive, Scalable SPARQL Querying … by René Pickhardt. […]

Popular Posts

What are the 57 signals google uses to filter search results?

Graphity: An efficient Graph Model for Retrieving the Top-k News Feeds for users in social networks

Algorithmic Information Filter from Eli Pariser’s TED Talks

Time lines and news streams: Neo4j is 377 times faster than MySQL

1 Comment

Leave a Reply Cancel reply

Reading list for next meeting Wed March 14th 2 pm CET

We first discussed the memcached paper:

Scalable SPARQL Query of Large RDF graphs:

Beehive:

Challenges in parallel Graph processing:

You may also like...

Popular Posts

1 Comment

Leave a Reply Cancel reply