Reading Club Management of Big Data

Rene — Thu, 05 Sep 2013 10:01:28 +0000

Even though the reading club on distributed graph data bases stopped I never really lost interest in management of big data and graph data. Due to the development of research grants and some new workers in our group I decided to create a new reading club. (The next and first meeting will be Thursday September 12th 15:30 central European time.) The reading club won’t be on a weekly basis but rather something like once a month. Tell me if you want to join via hangout or something similar! But I would like to be clear: If you didn’t carefully prepare the reading assignments by bringing questions and points for discussion to the meeting then don’t join the meeting. I don’t consider skimming a paper as a careful preparation.
The road map for the reading club on big data is quite clear: We will read again some papers that we read before but we will also look deeper and check out some existing technologies. So the reading will not only consist of scientific work (though this will build up the basis) but it will also consist of hand on and practical sessions which we obtain from reading blogs, tutorials, documentation and hand books.
Here will be the preliminary structure and road map for the reading club on big data which of course could easily vary over time!

google file system (implemented in hadoop file system)
map reduce (implemented in hadoop)
google big table (implemented in hbase)
google pregel (implemented in giraph)
amazon dynamo (imiplemented in cassandra)
On the way we will probably dig into some basics like: Message passing interface or the gossip protocol or the CAP-Theorem

Along these lines we want to understand

Why do these technologies scale?
How do they handle concurrent traffic (especially write requests)?
How performance can be increased if there is another way of building up such highly scalable systems?
What kind of applications (like titan or mahout) are build on top of these systems?

At some point I would also love to do some side reading on distributed algorithms and distributed and parallel algorithm and data structure design.

As stated above the reading club will be much more hand on in future than before I expect us to also deliver tutorials like that one on getting Nutch running on top of HBase and Solr.
Even though we want to get hands on in current technologies the goal is rather to understand the principles behind them and find ways of improving them instead of just applying them to various problems.
I am considering to start a wikipage on wikiversity to create something like a course on big data management but I would only do this if I find a couple of people who would actively help to contribute to such a course. So please contact me if you are interested!
So to sum up the reading assignment for the first meeting are the Google file system and the map reduce paper.

Google Pregel vs Signal Collect for distributed Graph Processing – pros and cons

Rene — Sun, 19 Feb 2012 17:05:49 +0000

One of the reading club assignments was to read the paper about Google Pregel and Signal Collect, compare them and point out pros and cons of both approaches.
So after I read both papers as well as Claudios overview on Pregel clones and took some notes here are my thoughts but first a short summary of both papers.

Summary of Google Pregel

The methodology is heavily based on Bulk Sychronous Parallel Model (BSP) and also has some similarties to MapReduce (with just one superstep). The main idea is to spread the data over several machines and introduce some supersteps. For each superstep every vertex of the graph calculates a certain function that is given by the programmer.
This enables one to process large graphs which are distributed over several machines. The paper describes how to use Checkpoints to increase fault tolerance and also how to make good use of the Google File System in order to partition the graph data on the workers. The authors mention that smarter hashing functions could help to distribute the vertices not randomly but rather in a way they are connected on the graph which could potentially increase performance.
Overall the goal of Google Pregel seems to enable one to process large graph data and gain knowledge from it. The focus does not seem to increase the usage of the calculation power of the distributed system efficiently. In stead it rather seems to create a system that makes distribution of data – that will not fit into one machine – possible at a decent speed and without much effort for the programmer by introducing methods for increasing fault tolerance.

Summary of Signal Collect

Signal Collect as a system is pretty similar to Google Pregel. The main difference is that the authors introduce a threshold score which is used to decide weather a node should collect its signals or weather it should send signals. Using this score the processing of algorithms can be accelerated in a way that for every super step only signals and collects are performed if a certain threashhold is hit.
From here the authors say that one can get rid of the superstep model and make the entire calculation asynchronous. This is done by introducing randomization on the set of vertices on which signal and collect computations have to be computed (as long as the threasholdscores are overcome)
The entire system is implemented on a single machine but the vertices of the compute graph are processed by different workers (in this setting Threads). All Threads are able to share the main memory of the system which makes message passing of Signal and Collect computations unnecessary. The authors show how the increasing number of workers actually antiproportionally lower the runtime of the algorithm in the asynchronous setting. They also give evidence that different scheduling strategies seem to fit the needs for different graphs or algorithms.

Discussion of Pros and Cons

From the text above it seems very obvious that Signal Collect with its Asynchronous Programming model seems superior. But – in opposite to the authors – I have hidden to mention the drawbacks of one small but important detail. The fact that all the workers share a common knowledge which they can access due random access in main memory of the machine allows their model to be so fast while being asynchronous. It is not clear how to maintain this speed with a real distributed system. So in this way Signal Collect only give a proof of concept that an abstract programming model for graph processing exists and it enables fast distribution in theory.
Pregel actually is a real frame work that can really achieve distribution of large data to clusters of several thousand machines which for sure is a huge pro.
Signal Collect proposes to be more general than Pregel since Pregel can only respect one vertex type and edges are stored implicitly. Whereas Signal Collect is able to store RDF Graphs. I personally understand that Signal Collect can only send signals from one vertex to another if and edge exists and is also not able to add or remove edges or vertices. In this sense I still think that Pregel is the more general system. But I guess one can still argue on my point of view.
Pregel’s big drawbacks in my opinion are that the system is not optimized for speed. As already discussed in the last meeting of the reading club Map Reduce – with its one Superstep attitude – is able to start Backup tasks towards the end of the computation in order to fight stragglers. Pregel has to wait for those stragglers in every superstep in order to make synchronous Barriers possible.
Another point that is unique with Pregel is the deep integration with Google File System (btw. I am almost through the google file system paper and even if you already know of the idea it is absolutely worthwhile reading it and understanding the arguments for the design decisions of the google file system). So far I am not sure weather this integration is a strong or a weak point. This is due to the fact that I can’t see all the implications. However it gives strenght to my argument that for a distributed system some things like network protocols and file systems should be considered since they seem to have a strong impact on the entire system.
Both systems in my opinion fail to consider partitioning of the graph and a different network protocol as an important task. Especially for Pregel I do not understand this since it already has so much network traffic. Partitioning the graph might increase start up Traffic on the one hand but could increase overall traffic on the long term.

Outlooks and personal thoughts:

I am considering to invite the authors of both papers to next weeks reading club. It would be even more interesting to discuss these and other questions directly with the guys who built that stuff.
Also I like Schegi’s idea to see what happens if one actually runs several neo4j servers on different machines and just use a model similar to Signal Collect or Pregel to perform some computations. In this way a programming model could be given and research on the core distribution framework – relying on good technologies for the workers – could be done.
For the development of the first version of metalcon we used memcached. I read a lot that memcached scales perfectly horizontal over several machines. I wonder how an integration of memcached to Signal Collect would work in order to make the asynchronous computation possible in a distributed fashion. Since random access memory is a bottleneck in any application I suggest to put the original memcached paper on our reading list.
One last point to mention is that both systems still don’t seem to be useful as a technology to built a distributed graph data base which enables online query processing.

Pregel – Data Science, Data Analytics and Machine Learning Consulting in Koblenz Germany