A Survey on Graph Databases
Graph Databases were also discussed in my previous entry about NoSQL databases. Two other valuable surveys about graph databases are a post in ReadWriteWeb and a page in DBPedias. While they are from the top view by conceptual and framework sides respectively, here I mainly start from the bottom view by looking at their manipulations and functions. In addition, this entry covers more products than them.
In Graph Theory, a simple graph is a set of nodes and edges. While this definition is fundamental, graph databases usually add types and attributes on both nodes and edges to make themselves more descriptive and practical in use. At least, graph databases are expected to support fast traversal — this is the reason why we do not simply use tabular databases like HBase or Cassandra to store all the edges (join operation is expensive).
In the previous entry we say graph databases are one of four major categories of NoSQL databases. Also, seven products are listed in the category of graph store: Neo4J, Infinite Graph, DEX, InfoGrid, HyperGraphDB, Trinity and AllegroGraph. We discuss each of them in detail in this entry, mainly from the perspective of how to use them as a Java programmer.
1. Neo4J (Neo Technology)
Neo4J may be the most popular graph database. From the name we know Neo4J is particularly developed for Java applications, but it also supports Python. Neo4J is an open source project available in a GPLv3 Community edition, with Advanced and Enterprise editions available under both the AGPLv3 as well as a commercial license.
The graph model in Neo4J is shown in Figure 1. In simple words,
- Property (key-value pair) can be added to both node and edge;
- Only edges can be associated with a type, e.g., “KNOWS”;
- Edges can be specified as directed or undirected.
Given the name of a node, if you want to locate this node in the graph, then you need the help of an index. Neo4J uses the following index mechanism: a super referenceNode is connected all the nodes by a special edge type “REFERENCE”. This actually allows you to create multiple indexes if you distinguish them by different edge types. The index structure is illustrated in Figure 2.
Neo4J also provides functions such as getting the neighbors of a specific node or all the shortest distance paths between two nodes. Notice that for all of these “traverse” functions, Neo4J needs you to specify the edge types along paths, which is handy.
There is no need to install Neo4J as a software. We can simply import the JAR file to build an embedded graph database, which will be persisted in your disk as a directory. The documentation of Neo4J looks complete. There is no limit about the maximum number of supported nodes in free version.
- Although we can manually add a property on nodes with key “type” to annotate the type of node, it is better to provide native support for node types in API to make the graph model more general. Also the problem comes when a node has multiple types.
- The index mechanism by adding new edges manually by user seems strange and not convenient. It is better to follow what the current relational DB does: users say “create index on a group of nodes”, then done.
Here is another entry about how to get started with Neo4J in Java: http://jasperpeilee.wordpress.com/2011/11/22/neo4j-the-first-cup-of-tea/
2. Infinite Graph (Objectivity Inc.)
InfiniteGraph is a graph database from Objectivity, the company behind the object database of the same name. The free license can only support up to 1 million nodes and edges. InfiniteGraph needs to be installed as a service, which behaves like traditional DB such as MySQL. InfiniteGraph borrows the object-oriented concepts from Objectivity/DB, so each node and edge in InfiniteGraph are objects. Specially,
- All node classes will extend the base class BaseVertex;
- All edge classes will extend the base class BaseEdge.
In the example page shown in http://wiki.infinitegraph.com/w/index.php?title=Tutorial:_Hello_Graph!, suppose Person is a node class and Meeting is an edge class. This is the code for adding an edge between two nodes:
Person john = new Person("John", "Hello "); helloGraphDB.addVertex(john); Person dana = new Person("Dana", "Database!"); helloGraphDB.addVertex(dana); Meeting meeting1 = new Meeting("NY", "Graph");
InfiniteGraph also provides the visualization tool to view the data. The edge generated by the above code is visualized in Figure 3. Compared with the graph model of Neo4J in Figure 1, InfiniteGraph supports nodes with different types/classes. Note that the property key-value pairs in Neo4J can correspond to member variables in classes of InfiniteGraph.
- It is fine to install as a service, but should make the configuration simple.
- Since nodes and edges can be user-customized objects, I suspect the performance will be harmed for huge graphs when we enjoy the flexibility. Remember NoSQL databases should always keep high performance to make themselves compelling.
Note: My experience of getting started with InfiniteGraph on Win 7 64-bit OS is not smooth. The configuration shown in http://wiki.infinitegraph.com/w/index.php?title=InfiniteGraph_Installation seems not complete, which makes the Java programs keep throwing “….dll: Can’t find dependent libraries” error. Then I checked the dependency of that DLL file using Dependency Walker, the error “Modules with different CPU types were found” tells me probably InfiniteGraph does not support 64 bit OS. Finally, I switch to Ubuntu 64-bit OS, finding that InfiniteGraph only provides versions for Redhat/SUSE Linux OS.
3. DEX (Sparsity Technologies)
DEX is said to be a high-performance and scalable graph database, which is attractive for NoSQL applications. The personal evaluation version can support up to 1 million nodes. The current version is 4.2 and it supports both Java and .NET programming. Note that the old version 4.1 only supports Java and is not compatible with the new version. Until today Nov. 24, 2011, the documentation for new version 4.2 is not complete yet, and it is very hard to find a start example for the new version on the web. The migration file here will be very helpful to write programs based on old version examples.
Figure 4 shows the architecture of DEX, which explains why DEX can achieve a high performance. The native C++ DEX Core is the key. In the event page, the team shows some exciting applications based on DEX:
- Bibliographic exploration: a use case of DEX by storing all DBLP data (demo);
- Twitter loaded into DEX: the 4.5 billion graph;
- Wikipedia Loaded into DEX and Query: obviously better than Neod4J.
DEX is also portable, and you only need a JAR file to run. Not like Neo4J, the persisted database of DEX is only a single file. DEX Java API is easy to use, and Class Graph can provide nearly all the operations you need. To make DEX stronger, following weak points are expected to be eliminated:
- Better to raise the limit for personal version to 1 billion nodes;
- More complete documentation with fine examples;
- Transplant the graph algorithms on old version to the new version in the near future.
Here is a new entry about how to deploy your graph with DEX.
4. InfoGrid (Netmesh Inc.)
InfoGrid calls itself as a “web graph database”, so some of its functions are oriented to web applications. Figure 5 shows the whole framework of InfoGrid, and Graph DB seems not a dominating component. InfoGrid has some applications in OpenID project, which is supported by the same company. I suspect InfoGrid is only used in the internal of Netmesh, because of the following weakness:
- The newest Java API at here is incomplete and sometimes confused;
- The tutorial at here is not written in a clear and formal way.
For the first step example at http://infogrid.org/wiki/Examples/FirstStep, while it is not hard to read overall, but the enums such as TAGLIBRARY, TAG, TAG_LABEL and TAGLIBRARY_COLLECTS_TAG make me really confused. These enums seems embedded in the model, and why is that? It looks like this example is used in the internal projects of Netmesh to serve for some specific application but who knows.
5. HyperGraphDB (Kobrix Inc.)
HyperGraphDB is an open source data storage mechanism with its implementation based on BerkeleyDB database. The graph model of HyperGraphDB is known as direct hypergraphs. In mathematics, a hypergraph allows its edge pointing to more than two nodes. HyperGraphDB extends this further by allowing edges to point to other edges, so HyperGraphDB offers more generality than other graph databases. Figure 6 shows a hypergraph example with four edges, distinguished by different colors.
The tutorial of HyperGraphDB looks complete. Each node in HyperGraphDB is called an atom, and operations like indexing and traversals are supported.
Note: Although the tutorial is written in a nice form, the same error “….dll: Can’t find dependent libraries” occurs on Win 7 OS. After I switch to Ubuntu 64-bit, the sample program throws exception “ELFCLASS32 (possible cause: architecture word width mismatch)”. That’s probably because HyperGraphDB only supports Linux 32-bit.
6. Trinity (Microsoft)
Microsoft joins the competition just recently and the first release V0.1 of Trinity only allows for intranet access. From the introduction, Trinity is a memory-based graph store with rich database features, including highly concurrent online query processing, ACI transaction support, etc. Trinity only provides C# APIs to the user for graph processing.
Since Trinity package is not open to the outside of Microsoft, we cannot know too much details of it. But at least, the key features of Trinity are listed below:
- Use hypergraph as data model;
- Applicable to be deployed in distributed mode.
The system architecture can be found here. Overall, it is hard to find any distinct advantages currently when we compare Trinity with other open source graph databases. However, since Trinity is still in its prototype stage, it is worth being noticed. In addition, Probase is an ongoing project that looks like an ontology/taxonomy knowledge bases built on top of Trinity. Here links to a nice article about Probase and Trinity.
7. AllegroGraph (Franz Inc.)
AllegroGraph is a persistent graph database that purportedly scales to “billions of RDF triples while maintaining high performance”. Although a RDF triple can be viewed as an edge, AllegroGraph is intended to build RDF-centric semantic web applications and supports SPARQL, RDFS++, and Prolog reasoning from client applications including Java programs. A free version of AllegroGraph RDFStore supports up to 50 million triples.
Figure 7 shows an example of RDF graph. AllegroGraph appends an additional slot called “named graph” for each triple to make them as quads (but still call them triples for convenience). Here are some assertions from Figure 7.
subject predicate object graph robbie petOf jans jans's home page petOf inverseOf hasPet english grammar Dog subClassOf Mammal science
To add a bunch of triples into RDF graph, AllegroGraph has facilities to bulk load from both N-triples and RDF/XML files. Overall, AllegroGraph is ideal for RDF storage, but not for general graphs. The documentation looks great. Find introduction here and for Java API tutorial, the Sesame version here and the Jena version here.
The overall comparison is shown in the table below. High-performance and distributed deploy are supposed to be supported by all products. “1M” means the corresponding graph databases can support 1 million nodes for free. RDF graphs can be viewed as a special kind of property graph. Since hypergraph is the most generic form of graphs, a graph database supporting hypergraph should also support property graphs theoretically.
|Free?||Y||< 1M||< 1M||Y||Y||N||< 50 M|
Which one is the best? The answer is usually “it depends”. Although it is always controversial to rank products with different characteristics, sometimes we need to make a hard decision. I show the following general rules based on my personal understanding:
- If you need to store RDF triples, go to AllegroGraph;
- For property graph, make Neo4J and DEX the first class citizen;
- For Hypergraph, go to HyperGraphDB.