Key-value stores have emerged in the last few years as an alternative to relational database management systems for some data storage tasks. SimpleDB, BigTable, and Tokyo are just three of the variants. So what are they? How are they different than relational databases? And when should you use them?
Applications, whether web apps, simple dynamic websites or command line apps, frequently need some sort of persistent data store. As a result, databases have become ubiquitous on modern systems, and because of this chicken and egg relationship, programmers will often habitually reach for a relational database when the project only calls for a way to persist data.
Generally, this is fine. Relational databases are a well known technology, with numerous features that make life easy for the people who use them. And while you can quibble about the nuances, relational databases typically do a decent job with most data storage and data query tasks.
Once in a while it behooves a programmer to reach for a different tool. Perhaps you need something that can query simple data faster than a relational system can. Maybe there are scaling requirements that are difficult to meet with a relational system. Or you want to avoid tying an application to a requirement for a relational system that may well have its own maintenance needs over time. Or maybe you just want something simpler than a relational database. In all of these cases, a key-value store may be just the tool for you.
In the simplest terms, a key-value store is exactly what the name suggests: it’s a storage system that stores values, indexed by a key. The implementations are diverse, but they’re usually built on either a hash data structure, or a tree data structure, like a b-tree. Hash based key-value stores have fast lookup and update speeds, and require that keys be unique. Lookup in a hash based tree is limited to key-only lookup.
Tree based stores generally allow multiple identical keys, and because tree based structures (like the b-tree) are ordered, they allow you to query individual keys, as well as ranges of keys. The drawback to tree based stores is that those structures are generally slower than simple hashes. For example:
A log aggregation system that stores messages keyed by timestamp, but which uses a hash based store, will be troublesome to query in a useful way, despite the speed of lookup, because you can only lookup specific timestamps. A system that uses a b-tree based store may be marginally slower for individual queries, but by permitting lookup based on a range of timestamps, can easily query the records of interest. This more than offsets the difference in basic query speeds.
Ruby comes with a simple hash based key-value store called PStore. If you’ve never used it, it’s definitely worth learning. For simple data storage, PStore can be very useful.
Let’s walk through a simple example: assume you get a daily feed of stock data, and want to load it into a simple PStore database so that the data can be displayed on a web site. A loader might look like this:
stock_loader.rb
require 'pstore' require 'csv' require 'date' store = PStore.new('fund_data') funds = {} store.transaction do store.roots.each { |root| funds[root] = store.root } CSV.read(ARGV.first).reject {|x| x.first.nil? }.each do |row| row[0] = row[0].to_s.strip record = { :ticker => row[0], :rate_date => Date.parse(row[2]), :price => row[3], :price_change => row[4], :percent_change => row[5] } store[row[0]] = record end end
A PStore data store acts just like a hash, with the exception that everything done with the store has to be done inside the context of a transaction. Creating a new store is simple:
store = PStore.new(FILENAME)
With that one line, a key-value store with hash-like behavior (which stores its data on the filesystem in a file named FILENAME) is created. Accessing a PStore key-value store is conceptually very similar to accessing a Hash:
store['footloose'] = 'dance like it's the 1980s!' hash['footloose'] = 'dance like it's the 1980s!'
Both lines of code have similar behaviors; they both store a value that is indexed by a key. Likewise:
line = store['footloose'] line = hash['footloose']
In both cases, a value is queried from the data store based on the key (footloose) that was provided. The most significant difference between a Hash and a PStore is seen a couple lines farther down in the code.
store.transaction do
PStore requires that all interaction with the store happens within the context of a transaction, while regular Hash interactions have no such restriction. This guarantees, in a simple manner, that two separate processes that are both accessing the same store will not have collisions with any modifications. Only a single transaction can be active at any one time, and all changes to the store are made persistent only when the transaction is closed (or committed). Changes are discarded if the transaction is rolled back. It’s simple, with convenient, hash-like access, and just a few useful database-like features.
Once a loader similar to the one above runs, the key-value store will be populated with data, and your website can pull this data just as trivially as it was stored:
/app/controllers/fund_table_controller.rb
class FundTableController < ActionController::Base before_filter :setup def index end private def setup @store = Pstore.new('fund_data') end end
/app/views/helpers/fund_table_helper.rb
module FundTableHelper def data @store.transaction { @store[@ticker] } end def tickers @store.transaction { @store.roots.sort } end end
/app/views/fund_table/index.html.haml
%table
%thead
%tr
%th ticker
%th price
%th change
%tbody
- tickers.each do |@ticker|
%tr
%td= data.ticker
%td= data.price
%td= data.price_changeThe web application connects to the store and then opens transactions and reads the desired data - it's all very simple.
While PStore lets you store anything that can be marshaled, it does have some notable performance restrictions. It's a simple piece of kit, so when a transaction is opened on a store, it doesn't do anything sophisticated; it reads the whole store into memory. This makes it useless for huge data sets that you would not or could not otherwise store in RAM, and it also makes opening and closing any large data store an expensive operation. Whether the expense is greater than that incurred by a traditional relational database will depend on the individual use cases, but for larger data sets it becomes completely untenable as a solution. In those cases, you need to look a relational database or more sophisticated key-value stores.
Syntax for different key-value stores will obviously differ, but they all function in a manner conceptually similar to PStore. They all allow storage of some arbitrary data, indexed for retrieval by a single key. For data which is amenable to this sort of storage and retrieval pattern (and a lot of data used in web applications and dynamic web sites is), key-value stores will generally be much faster than relational databases and can keep your code looking clean and simple.


Watch a Live Demo of Engine Yard AppCloud
The Engine Yard Newsletter
I've been working on a Datamapper adapter for Tokyo Cabinet's TDB datastore. TDB isn't quite key-value, but it's not quite relational either.
I'll drop by later this evening and post a comment with a link to the Github project once I implement a few more of the core features.
Thanks Scott — sounds interesting!
Cool, Scott. I look forward to seeing what you have been working on!
How similar is what you're working on to http://github.com/makoto/dm-tokyo-cabinet-adapter... (other than it's likely been updated more recently :)
Is TDB a separate beast entirely? I can't wait for the followup to this post.
Hi,Scott. I am an author of dm-tokyo-cabinet-adapter. I haven't worked on the adapter, and my implementation is quite old (implemented before Table Database API was supported by TC, so most of my code is redundant) and there is a bug which I haven't fixed yet, so go for it. Feel free to ping me if you have any questions on my implementations.
Hi Makoto,
Thanks for replying – I was looking for a way to get in touch with you!
I originally had an issue getting the BDB adapter you put together working with greater than 10 records – it was a really strange problem that looked to be caused by BDBCUR failing to increment or return the correct maximum ID…when I tried to store greater than 10 records (say, 100), the 10th would just keep getting overwritten rather than storing the new ones. It looked like this was an issue with the Tokyo Ruby library and had nothing to do with your adapter, but it was hard for me to pin down.
I decided to work on the TDB implementation because it seemed a bit more appropriate for a (semi-)relational concept including associations between different models, though I expect that it is less efficient overall than what you're working on.
I just posted a preview version of it last night – here's the blog post: http://bit.ly/tdbmap
Here's a quick sample app I built on top of it: http://bit.ly/stck
If you're interested, I'd love to hear your thoughts on this implementation and potentially build it up into something better. Drop me an e-mail at scott [at] phoreo [dot] com if you'd be interested in talking about some of this.
Thanks!
Scott
Thanks for covering this topic. Will you be discussing Redis in the series?
The followup article will focus on Tokyo Cabinet and using that from Ruby. We're talking now about whether or not to follow that one with articles about the other major key/value stores and how one uses them from Ruby (like Redis and CouchDB).
Regarding performance of PStore: I was using PStore for storage, and then switched to using gdbm instead (because Ruby came with bindings for it.) Even with marshalling the data like PStore does, gdbm was about 50 times faster.
thanks for the well reasoned article, kirk. i agree that there is a lot of diversity in persistence solutions, and rdbms solutions (even wee little sqlite3) are very often "too much medicine". like others, my current preferred alternative is tokyo for r/w scenarios where speed is the premium, but for most situations requiring lightweight "good enough" persistence, yaml is fantastic and since it stores in plain text, it allows other etl processes in other languages to get into the game, even if that "other etl process" is a non-technical human with a word processor. a great middle performer between yaml and tokyo is "marshal". any of your readers that are enlightened by the existence of PStore will probably dig yaml and marshal, too.
[...] Key-Value Stores in Ruby | Engine Yard Blogengineyard.com [...]
"…fast, handles large data sets well, …"
Just wondering, how large is "large"?
Well I am using pstore for storage purpose. It works good. But I use it from 1 year. Now I want to change it. So I thought which one will be better for storage. Then saw this post and got my solution. Thanks for sharing such nice information on key value with code example. I like it and I will sure use it. It is nice to post here. I like this site.
I don't anything about ruby a programming language but these key values really help me to understand the storage of these values. I am already know .net and java but this is a complete different type of language for me and believe me it's quite interesting.
Excellent post Kirk! You’ve brought together some processes which haven’t been brought in to the public eye as of yet. Keep it up!
Regarding performance of PStore: I was using PStore for storage, and then switched to using gdbm instead (because Ruby came with bindings for it.) Even with marshalling the data like PStore does, gdbm was about 50 times faster.
Fiat
You really shouldn’t be using Pstore for Rails. Apart from the fact that it doesn’t scale to a setup with more than 1 box, it is much slower than using the database for session storage, for example.
It's limitations are pretty apparent, and nowhere is anyone suggesting that one try to build Twitter on PStore. However, PStore can be perfectly appropriate. Not all apps are going to see heavy use. In fact, I'd wager that the majority of apps that actually get built and put into production are apps targeted at small, specific userbases, with no particular performance concerns, and no need to scale beyond a single box.
It's all a matter of knowing what you are doing, and selecting the tools to match, and there's no reason at all not to use PStore if it will satisfy the needs of the project. Conversely, if it won't, then there is no reason to use it. It is a tool with some significant limitations, for certain.
The followup article will focus on Tokyo Cabinet and using that from Ruby. We're talking now about whether or not to follow that one with articles about the other major key/value stores and how one uses them from Ruby (like Redis and CouchDB).
bank exams
Relational databases is something that every programmer wants to do. Well it is a lot of help too and most uses it.
This is nice information that Relational databases are a well known technology, with numerous features that make life easy for the people who use them. I hope you will provide us a lot of information like this.