I am just dreaming this does not exist and needs to be refined in a later stage.
- Fast traversals:
- Jumping from one vertex of the graph to another should be possible in O(1)
- Online processing:
- “Standard queries” (<–whatever this means) should compute within miliseconds.
- As an example: Local recommendations e.g. similar users in a bipartite “User – Band” graph should be possible to process online in less than a second.
- Query language:
- A programming model that supports pattern matching and traversals with one (or possibly several) starting nodes
- No SPARQL (too general for a reasonable graph application) support needed.
- Support for reading and writing new data (to disk)!
- Distribution effort:
- The programmer should not have to care about the distribution techniques.
- He should just be able to use the technology.
- Fault tolerance:
- The system has to run stable if working computers are added or removed.
- Probably by introducing redundancy in some way [1]
- Persistence:
- Transactions and persistence are important for any data base service.
It is very clear that this wish list is very high level. But I think these are reasonable assumptions from which we can break down the problem and discuss pros and cons of all the techniques needed to built such a system.
[1] on the Redundancy discussion:
Depending on the techniques used, introducing redundancy has probably two positive effects on:
- Fast traversals
- Fault tolerance
On the other hand it has a deep impact on
- Persistence (which is hard to achieve in a distributed setting anyway is even harder to achieve once redundancies are included.)
It is not clear if we really need redundancy. Maybe there are some other techniques that enable us to find our goals but I personally have the feeling that a good model for redundancy will “solve” the problem.
relation to the reading club
I already found the time to look over our courrent reading assignments. Especially the VLDB paper (Topology partitioning applied to SPARQL, HADOOP and TripleStores) and the Challenges in parallel graph processing strengthen my confidence that an approach described above seems very reasonable.
What is your oppinion?
Do you think I am missing some features or should keep a focus on one particular feature? What about methods to achieve those goals? I am happy to discuss your thoughts!