Wishlist of features for a distributed graph data base technology

I am just dreaming this does not exist and needs to be refined in a later stage.

Fast traversals:
- Jumping from one vertex of the graph to another should be possible in O(1)
Online processing:
- “Standard queries” (<–whatever this means) should compute within miliseconds.
- As an example: Local recommendations e.g. similar users in a bipartite “User – Band” graph should be possible to process online in less than a second.
Query language:
- A programming model that supports pattern matching and traversals with one (or possibly several) starting nodes
- No SPARQL (too general for a reasonable graph application) support needed.
- Support for reading and writing new data (to disk)!
Distribution effort:
- The programmer should not have to care about the distribution techniques.
- He should just be able to use the technology.
Fault tolerance:
- The system has to run stable if working computers are added or removed.
- Probably by introducing redundancy in some way [1]
Persistence:
- Transactions and persistence are important for any data base service.

It is very clear that this wish list is very high level. But I think these are reasonable assumptions from which we can break down the problem and discuss pros and cons of all the techniques needed to built such a system.

[1] on the Redundancy discussion:

Depending on the techniques used, introducing redundancy has probably two positive effects on:

Fast traversals
Fault tolerance

On the other hand it has a deep impact on

Persistence (which is hard to achieve in a distributed setting anyway is even harder to achieve once redundancies are included.)

It is not clear if we really need redundancy. Maybe there are some other techniques that enable us to find our goals but I personally have the feeling that a good model for redundancy will “solve” the problem.

relation to the reading club

I already found the time to look over our courrent reading assignments. Especially the VLDB paper (Topology partitioning applied to SPARQL, HADOOP and TripleStores) and the Challenges in parallel graph processing strengthen my confidence that an approach described above seems very reasonable.

What is your oppinion?

Do you think I am missing some features or should keep a focus on one particular feature? What about methods to achieve those goals? I am happy to discuss your thoughts!

2 Comments

Stefan says:

February 24, 2012 at 5:17 pm

I agree with the main points of you “wishlist”. Only two little comments, as query language i would also prefer a tranversal language but for some usecases a graph matching language like sparql could also be interesting.
For example, in social networks searching persons depending on different features (age, homebase, interests …). I was thinking about transforming sparql queries to index lookups and tranversals. Supported by rules (e.g. for type subsumption), such a system could even be more powerful than sparql. Sarql does not support arbitrary depth tree traversal you would need for e.g. the “give me all ancestors of a person” query.
Combined system would be able to solve such problems.
Second the redundancy point, i agree that redundancy seems to be very interesting, especially regarding the graph distribution. But i also think that redundancy could lead to multiple side effects with write/delete operations. Keeping all redundant copies up to date in an distributed system where multipe competitive jobs are running at the same time is a challenging task. To solve this, you would probably have to apply versioning, version control and/or blocking or log mechanisms to graph-nodes or atomic subgraphs.
From my point of view this would be work for the later paret of the problem.
Thirdly and lastly, for me fault tolerance is the last to think about.

PhD proposal on distributed graph data bases says:

March 27, 2012 at 11:19 am

[…] Over the last week we had our off campus meeting with a lot of communication training (very good and fruitful) as well as a special treatment for some PhD students called “massage your diss”. I was one of the lucky students who were able to discuss our research ideas with a post doc and other PhD candidates for more than 6 hours. This lead to the structure, todos and time table of my PhD proposal. This has to be finalized over the next couple days but I already want to share the structure in order to make it more real. You might also want to follow my article on a wish list of distributed graph data base technology […]

[1] on the Redundancy discussion:

relation to the reading club

What is your oppinion?

Popular Posts

What are the 57 signals google uses to filter search results?

Graphity: An efficient Graph Model for Retrieving the Top-k News Feeds for users in social networks

Algorithmic Information Filter from Eli Pariser’s TED Talks

Time lines and news streams: Neo4j is 377 times faster than MySQL

2 Comments

Leave a Reply Cancel reply

[1] on the Redundancy discussion:

relation to the reading club

What is your oppinion?

You may also like...

Aurelius Titan graph enables realtime querying with 2400 concurrent users on a distributed graph database!

PhD proposal on distributed graph data bases

Popular Posts

What are the 57 signals google uses to filter search results?

Graphity: An efficient Graph Model for Retrieving the Top-k News Feeds for users in social networks

Algorithmic Information Filter from Eli Pariser’s TED Talks

Time lines and news streams: Neo4j is 377 times faster than MySQL

2 Comments

Leave a Reply Cancel reply