What is NoSQL?
NoSQL is a classification of database systems that do not conform to the relational database or SQL standard. Most often they are categorized according to the way they store the data and fall under categories such as key-value stores, BigTable implementations, document store databases, and graph databases. In general the term isn't well enough defined to reduce it to a single supporting JSR or technology. So the only way to find suitable integration technologies is to dig through every single category.
Key/Value Stores
Key/Value stores allow data storage in a schema-less way. It could be stored in a datatype of a programming language or an object. Because of this, there is no need for a fixed data model. This is obviously comparable to parts of JSR 338 (Java Persistence 2.1) and JSR 347 ( Data Grids for the Java Platform) and also to what is done with JSR 107 (JCACHE - Java Temporary Caching API).
with native JPA2
Also primary aimed at caching is the JPA L2 Cache. The JPA Cache API is good for basic cache operations, while L2 cache shares the state of an entity -- which is accessed with the help of the entity manager factory -- across various persistence contexts. Level 2 cache underlies the persistence context, which is highly transparent to the application. When Level 2 cache is enabled, the persistence provider will look for the entities in the persistence context first. If it does not find them there, the persistence provider will look in the Level 2 cache next instead of sending a query to the database. The drawback here obviously is, that as of today this only works with NoSQL as some kind of "Cache". And not as a replacement for the RDBMS data store. Given the scope of this spec it would be a good fit: But I strongly believe that JPA is designed to be an abstraction on RDBS and nothing else. If there has to be some kind of support for non relational databases we might end up having a more high level abstraction layer in place which tons of different persistence modes and features (maybe something like Spring Data). Generally mapping at the object level has many advantages including the ability to think object and let the underlying engine drive the de-normalization if needed. So reducing JPA to the caching features probably is the wrong decision.
with JCache
JCache having a CacheManager that holds and controls a collection of Caches and every single Caches have it's entries. The basic API can be thought of map-like with additional features (compare Greg's blog). With JCache being designed as a "Cache" using it as a standardized interface against NoSQL data stores this isn't a good fit on the first look. But given the nature of the use-cases for unstructured Key/Value based data with enterprise java this might be the right kind of integration. And the NoSQL concept also allows for the "Key-value cache in RAM" category which is an exact fit for both JCache and DataGrids.
with DataGrids
This JSR proposes an API for interacting with in-memory and disk-based distributed data grids. The API aims to allow users to perform operations on the data grid (PUT, GET, REMOVE) in an asynchronous and non-blocking manner returning a java.util.concurrent.Futures rather than the actual return values. The process here is not really visible at the moment (at least to me). So there aren't any examples or concepts for integration of a NoSQL Key/Value store available until today. Beside this the same reservations as for the JCache API are in place.
with EclipseLink
EclipseLink's NoSQL support is based on previous EIS support offered since EclipseLink 1.0. EclipseLink's EIS support allowed persisting objects to legacy and non-relational databases. EclipseLink's EIS and NoSQL support uses the Java Connector Architecture (JCA) to access the data-source similar to how EclipseLink's relational support uses JDBC. EclipseLink's NoSQL support is extendable to other NoSQL databases, through the creation of an EclipseLink EISPlatform class and a JCA adapter. At the moment it supports MongoDB (Document Oriented) and Oracle NoSQL (BigData). It's interesting to see, that Oracle doesn't address the Key/Value DBs first. Might be because of the possible confusion with the Cache features (e.g. Coherence).
Column based DBs
Read and write is done using columns rather than rows. The best known examples are Google's BigTable and the likes of HBase and Cassandra that were inspired by BigTable. The BigTable paper says that BigTable is a sparse, distributed, persistent, multidimensional sorted Map. GAE for example works only with BigTable. It offers variety of APIs: from "native" low-level API to "native" high-level ones (JDO and JPA). With the older Datanucleus version used by Google there seem to be a lot of limitations in place which could be removed (see comments) but still are in place.
Document-oriented DBs
The Document-oriented DBs are most obviously best addressed by JSR 170 (Content Repository for Java) and JSR 283 (Content Repository for Java Technology API Version 2.0). With JackRabbit as a reference implementation it's a strong sign for that :) The support for other NoSQL document stores is non existent as of today. Even Apache's CouchDB doesn't provide a JSR 170/283 compliant way of accessing the documents. The only drawback is that both JSR's aren't sexy or bleeding edge. But for me this would be the right bucket to put support for document-oriented DBs. Flip side of the medal? The content repository API isn't exactly a natural model for an application. Does an app really want to deal with Nodes and attributes in Java? The notion of a domain model works nicely for many apps and if there is no chance to use it, you probably would be better off going native and use the MondoDB driver directly.
Graph oriented DBs
This kind of databases are thought for data whose relations are well represented with a graph-style (elements interconnected with an undetermined number of relations between them). Aiming primarily at any kind of network topology the recently rejected JSR 357 (Social Media API) would have been a good place to put support. At least from a use-case point of view. If those graph-oriented DBs are considered as a data-store there are a couple of options. If the Java EE persistence is steering into the direction of a more general data abstraction layer the 338 or it's successors would be the right place to put support. If you know a little bit about how Coherence works internally and what had to be done to put JPA on top of it you also could consider 347 a good fit for it. With all the drawbacks already mentioned. Another alternative would be to have a separate JSR for it. The most prominent representative of this category is Neo4J which itself has an easy API available to simply include everything you need directly into your project. There is additional stuff to consider if you need to control the Neo4J instance via the application server.
Conclusion
To sum it up: We already have a lot in place for the so-called "NoSQL" DBs. And the groundwork for integrating this into new Java EE standards is promising. Control of embedded NoSQL instances should be done via JSR 322 (Java EE Connector Architecture) with this being the only allowed place spawn threads and open files directly from a filesystem. I'm not a big supporter of having a more general data abstraction JSR for the platform comparable to what Spring is doing with Spring Data. To me the concepts of the different NoSQL categories are too different than to have a one-size-fits-all approach. The main pain point of NoSQL besides the lack of standard API is that users are forced to denormalize and maintain de-normalization by hand.
What I would like to see are some smaller changes to both the products to be more Java EE ready and also to the way the integration into the specs is done. Might be a good idea to simply define the different persistence types and generally define the JSRs which could be influenced by this and noSQLing those accordingly.
For users willing to facilitate a domain model (ie a higher level of abstraction compared to the raw NoSQL API), JPA might be the best vehicle for that at the moment. The feedback from both EclipseLink and Hibernate OGM users is needed to value what is working and what not. From a political point of view it might also make sense to pursue 347. Especially since main big players are present here already.
My painful experience with OR mappers, which try to present an object graph sitting on an RDBMS is: don't lure the developers into false data access idioms! When you have a relational DB, think relational! When you have a distributed big table type of DB, stick to the specific idioms!
ReplyDeleteJPA is already absurd. A pseudo object graph plus an almost-but-not-quite SQL query language plus an unreadable query API on top of a relational DB, whose properties a developer must also understand when things get serious. Reduction of complexity? You got to be kidding!
The super-scalable mass data DBs fail to work spectacularly when developers use the wrong data models (which is true for all DBs, actually) and apply mismatched data access and design idioms. With the Big Data class of DBs, we even must consider the physical layout of data, and much more.
Things which are inherently different must be used differently. There is no common denominator.
However, I would welcome an out of the box JSON mapper. There are many good ones around, but that is an area where the JEE platform can deliver value, like it did for XML parsing (after some time...)
Also, let the market and the products settle before jumping forward with a solution, whose lifetime will be shorter than a year. This doesn't fit enterprise innovation cycles. Nobody wants this kind of hectic activity.
Unfortunately, JPA is too complex. The API is enormous, and it is very hard to learn it. If you want to get some idea of how it works (so that you use it properly), you have to learn by heart some axioms, and to apply them in your mind when you write your code. But then it turns out that the JPA provider is not really following those axioms, but probably doing something slightly (and sometimes not even slightly) different. You end up with more exceptions, rather than rules.
ReplyDeleteAnd all this ocean of new and terribly complex (and therefore widely misused) concepts, like: persistence context, entity instances lifecycle (new/merged/detached/removed states), entity listeners, query handling ... is actually getting me quite fast to give up using JPA. Sometimes, when you just take a look at the API, you get the feeling that it doesn't feel right. It's too complex to do even very simple tasks. The JPA language is too magical and too witchy for me...
No matter how big and clunky JPA and JEE is, it has one advantage which is standardization. It allows us to pick and choose whatever vendor can provide us with the best support or best development speed.
ReplyDeleteFor example, I use Glassfish + EclipseLink + Derby for my local container test, and WebSphere + DB2 on integration to production with the same code base.
In terms of complexity, there's no need to go learn the whole thing, since it is a standard it won't change as quickly.
I'm working on a project that hasn't decided on SQL or NoSQL. Do you have any recommendation on a good abstraction that would allow us to change our minds in the future with minimal impact to clients. We are mostly looking at MySQL vs. CouchDb.
ReplyDeleteOur team has initially chosen JPA because of EclipseLink's EIS support. You made some comments above that indicated JPA is only for RDBMS, but later indicated that the JCA approach might be valid. Do you have any other insight?
My main concern with JPA is that it requires vendor specific annotations and obviously doesn't map all JQL to no-SQL queries . Use of native noSQL queries negates the advantages of the abstraction. Beyond method signatures, my concern is that the JPA contract seems to assume transactional behavior and consistency intentionally don't exist in most noSQL solutions.
Any feedback you have on this subject would be appreciated ...
Thanks!