Polyglot Persistence

Posted on October 15, 2008 by Scott Leberknight

In late 2006 Neal Ford wrote about Polyglot Programming and predicted the wave of language choice we are now seeing in the industry to use the right language for the specific job at hand. Instead of assuming a "default" language like Java or C# and then warring over the many different available frameworks, polyglot programming is all about using the right language for the job rather than just the right framework(s). For a while now I've thought about the fact that, paralleling Neal's description of polyglot programming, a relational database seems to be the accepted and default choice for persistence. Sometimes this is due to the fact that organizations have standardized on RDBMS systems and there isn't even any other choice. Other times it is simply what we're used to doing, and possibly we don't even consider alternatives. But now, with things like Amazon SimpleDB, Google Bigtable, Microsoft SQL Server Data Services (SSDS), CouchDB, and lots more, it seems like we're now seeing the beginning of Polyglot Persistence in addition to polyglot programming.

Polyglot Persistence, like polyglot programming, is all about choosing the right persistence option for the task at hand. For example, some co-workers of mine on one project are effectively using Lucene as their primary datastore, since the application they've built is mainly to do complex full-text searches very fast against huge datasets. Most people probably don't think of Lucene as a data store and just consider it as their full-text search engine. But for this particular application, which aggregates multiple disparate datasets, glues them together, and performs full-text search against the consolidated view of the data, it makes a good deal of sense. It also helped that in a bake-off against a very popular traditional RDBMS system's full-text add-on product, the Lucene search solution blew the doors off the traditional RDBMS in terms of performance, and that was even after a team of consultants from the vendor came in and tried to optimize the search performance. So, in this case a non-relational data store made more sense in terms of the problem context, which was data aggregation and fast full-text search.

Within the past few years we've started to see and hear about how companies like Amazon and Google are using non-traditional data stores such as SimpleDB and Bigtable for their own applications. Google App Engine in fact provides access to Bigtable, described as a "sparse, distributed multi-dimensional sorted map," as the sole persistent store for Google App Engine applications. Other organizations like the Apache Software Foundation have gotten into the non-relational data store market as well with things like CouchDB which is described as "a distributed, fault-tolerant and schema-free document-oriented database accessible via a RESTful HTTP/JSON API." One of the common threads among all these non-relational stores is that they are distributed, designed for fault tolerance, embrace asynchronicity, and are based on BASE (Basically Available, Soft State, Eventually Consistent) and CAP (Consistency, Availability, Partition Tolerance) principles as opposed to traditional ACID (Atomicity, Consistency, Isolation, Durability) properties found in traditional RDBMS systems. In addition, they are almost all either "schemaless" or provide a flexible architecture that promotes ease of schema changes over time, again as opposed to the rigid and inflexible schemas of traditional relational databases.

I don't think it's a coincidence that the companies creating and now offering these alternative data stores - free, commercial, or hybrid models like Google App Engine which is free up to a certain point - are all giants in distributed computing and deal with data on a massive scale. My guess is that perhaps they initially deployed some things on traditional RBDMS systems and outgrew them or maybe they simply thought they could do it better for their own specific problems. But as a result, I think over time that organizations are going to start thinking more and more about the type of persistence they need for different problems, and that ultimately the RDBMS will be but one of the available persistence choices.

Cloud-Oriented Architecture (COA)

Posted on August 18, 2008 by Scott Leberknight

With all the hype this year about cloud computing and things like Amazon EC2/S3 as well as Google App Engine and Bigtable, you can feel it coming. Soon vendors will be peddling COA (Cloud-Oriented Architecture) solutions, probably combining them with their SOA solution and somehow probably getting their ESB solution into the mix as well. This past weekend at the Enterprise Architecture BOF at the Southern Ohio Software Symposium, we had a discussion about cloud computing among other things. Ted Neward even coined the term "Enterprise Service Cloud" and I came up with "Cloud Service Bus," surely the next Big Thing. Any vulture (I mean, venture) capitalists out there want to invest in my new Cloud Service Bus company? I have a pretty brochure ready to go!

The big difference I see with regard to cloud computing is the fact that, unlike your typical ESB/SOA peddling vendors, companies like Amazon and Google already have cloud or cloud-like solutions in place a la Amazon EC2/S3 and , and Google App Engine and Bigtable. Now all of those things just mentioned are not "the cloud" (whatever "the cloud" actually is defined to be), but it doesn't matter because the point is these companies have designed, implemented, and most importantly, run their critical business operations on these platforms. That in and of itself is more important than all the vaporware and marketing hype any other vendor comes up with. Rather than having to get customers to believe that a solution works via marketing and then force it down their IT staff's throats, Google and Amazon are basically saying "Hey why not use stuff that we use and have proven can scale up to handle huge loads and huge amounts of data?"

To me as a developer this is a much more appealing approach for several reasons. First, it means there won't be (or shouldn't need to be) any "golf-course deals" where the vendor sales guys and customer CIOs/CTOs/CEOs meet up and decide on the technology stack independent of any real technical analysis, investigation, or input of the people who will be charged with implementing the vendor stack (and they better do it well else it's their job on the line to boot).

Second, I can base my decision to use a Google or Amazon service based on their actual track record in delivering these services and eating their own dog food, since they are trying to monetize their existing investment in proven highly distributed and scalable infrastructures. Yes, there have been Amazon outages this year and whenever it happens it is big news because it is right there out in the public, as opposed to a company whose IT operations are totally in-house and which isn't going to publicize their downtime statistics. I'd wager on Amazon and Google's availability over probably most other companies. Of course I have no way to prove that last statement, but the mere fact that I can get objective statistics on their services helps in my decision making process and planning.

Last, I can decide how much or how little to outsource to the Amazon or Google infrastructure; for example some organizations might choose to keep their most sensitive data (e.g. customer information, credit card numbers, etc.) in-house but outsource everything else to, say, an Amazon EC2/S3 infrastructure. There is still some level of vendor lock-in here, but there is with anything else short of you implementing your own solution from scratch. And if, by leveraging proven solutions by companies like Amazon and Google, you are able to deliver real value to your customers faster and are able to scale up, out, and beyond without needing to build that infrastructure yourself, then I'd say that could potentially equate to a big win.

So when that vendor comes calling with their shiny new COA solution, be very afraid, and make sure you know your options and present them objectively. We as an industry have more buzzwords and hype (at least from my perspective) than almost any other, and this causes more money than I can possibly imagine to be wasted every year on solutions that don't (and never will) work as advertised. Developers often have a feeling that the VDD (vendor-driven development) solutions just won't work, but cannot convince their managers or their managers' managers of this fact, which is why communications skills are critical in today's world. I don't know about you, but I don't want to be the person who becomes responsible for implementing a solution I don't believe in.

"The Cloud" and cloud computing are definitely here to stay forever, and as Amazon and Google have proven, can add huge amounts of value to businesses. I am sure there will be other companies perhaps trying to implement similar strategies and monetizing their investment in their own infrastructure, and that will be mostly a good thing to have different options and competition to further push the Cloud Service Providers (CSPs) to continually improve their offerings. Get ready, because our toolboxes have just become a lot bigger.