Even though the reading club on distributed graph data bases stopped I never really lost interest in management of big data and graph data. Due to the development of research grants and some new workers in our group I decided to create a new reading club. (The next and first meeting will be Thursday September 12th 15:30 central European time.) The reading club won’t be on a weekly basis but rather something like once a month. Tell me if you want to join via hangout or something similar! But I would like to be clear: If you didn’t carefully prepare the reading assignments by bringing questions and points for discussion to the meeting then don’t join the meeting. I don’t consider skimming a paper as a careful preparation.
The road map for the reading club on big data is quite clear: We will read again some papers that we read before but we will also look deeper and check out some existing technologies. So the reading will not only consist of scientific work (though this will build up the basis) but it will also consist of hand on and practical sessions which we obtain from reading blogs, tutorials, documentation and hand books.
Here will be the preliminary structure and road map for the reading club on big data which of course could easily vary over time!
- google file system (implemented in hadoop file system)
- map reduce (implemented in hadoop)
- google big table (implemented in hbase)
- google pregel (implemented in giraph)
- amazon dynamo (imiplemented in cassandra)
- On the way we will probably dig into some basics like: Message passing interface or the gossip protocol or the CAP-Theorem
Along these lines we want to understand
- Why do these technologies scale?
- How do they handle concurrent traffic (especially write requests)?
- How performance can be increased if there is another way of building up such highly scalable systems?
- What kind of applications (like titan or mahout) are build on top of these systems?
As stated above the reading club will be much more hand on in future than before I expect us to also deliver tutorials like that one on getting Nutch running on top of HBase and Solr.
Even though we want to get hands on in current technologies the goal is rather to understand the principles behind them and find ways of improving them instead of just applying them to various problems.
I am considering to start a wikipage on wikiversity to create something like a course on big data management but I would only do this if I find a couple of people who would actively help to contribute to such a course. So please contact me if you are interested!
So to sum up the reading assignment for the first meeting are the Google file system and the map reduce paper.