These notes are just a tad out of date. I started working on them right away after the conference, but then never got beck to them again. Unfortunately, I also no longer remember what goes in the gaps of my short-hand, so they get less explanatory and more list-y pretty quickly.

Pre-conference session at Code4lib2013

From the program:

Introduction to NoSQL Databases

1-470 Richard J. Daley Library, 9:00 am to 12:00 pm on Monday, February 11

Joshua Gomez, George Washington University, jngomez at gwu edu

Since Google published its paper on BigTable in 2006, alternatives to the traditional relational database model have been growing in both variety and popularity. These new databases (often referred to as NoSQL databases) excel at handling problems faced by modern information systems that the traditional relational model cannot. They are particularly popular among organizations tackling the so-called “Big Data” problems. However, there are always tradeoffs involved when making such dramatic changes. Understanding how these different kinds of databases are designed and what they can offer is essential to the decision making process. In this precon I will discuss some of the various types of new databases (key-value, columnar, document, graph) and walk through examples or exercises using some of their open source implementations like Riak, HBase, MongoDB or CouchDB, and Neo4j.

Notes:

No slides available as of 6/18/13

Gomez began his presentation with the caution that he was not an expert and had, in fact, signed up to give the presentation in order to make himself learn about noSQL databases. Gold star for him, I don’t think I could do that.

Outline of the presentation

What noSQL dbs are
How they are different from relational dbs
Review of ACID and CAP
The 4 types of noSQL dbs:
columnar db

bigtable & hbase
mapreduce & hadoop

key-value store db

dynamo
riak

document store

mongodb

graph db

Introduction

Are noSQL dbs an innovation or just a fad? Everyone seems to be doing it and each has their own version.

BigTable – Google
Dynamo – Amazon
Cassandra – Facebook
Voldemort – Linkedin

NoSQL dbs are built for situations where relational dbs struggle
There are no joins, the data is not stored like relational db
Schemaless data structures

Advantages:

high availability
massive horizontal scaling

Tradeoffs:

no guarantees,
not perfect consistency

ACID

atomicity – transaction must all run or all fail
consistency – db is always valid, no violation of constraints
isolation – separate transactions can’t interfere with each other
durability – changes are permanent, even when there is a system failure

CAP Theory

distributed dbs can only provide for 2 out of 3 properties:

consistency
availability
partition tolerance

latency is the tradeoff against availability and consistency

4 types of noSQL dbs
columnar

a single very large [table?], column-based, 2-dimensions
good for sparse data sets (lots of nulls) and aggregation
easy to add columns

key-value

fast distributed hash map
no good for complex queries or data aggregation

document

like a hash but allows for hierarchical data structures
combo of simple look-up and nested data, flexible

graph

records data about relationships
excellent for insight and discovery through node traversal

Key-value store dbs are the least complex databases
Graph databases are the most complex databases
90% of use cases don’t even come close to needing noSQL level
Relational dbs are not obsolete, they can do most or all of the same things that noSQL does

So why would someone use noSQL?

might need unitasker
might be cheaper
for the schema flexibility
to learn new things

Detailed look at the four types of databases

1. Columnar dbs

These dbs are column-based, not row-based
They are good for scans over large data sets
They allow for massive horizontal scaling

Implementations

BigTable
HBase
Cassandra
Hypertable

BigTable goals

wide applicability
scalability
hight performance
high availability

Note that consistency is missing from the list!

BigTable data model

sparse
distrbuted
persistent
multi-dimensional

map (each cell) indexed by:

row key: string
column key: string
timestamp: 64 bit integer

No old data thrown away, always new stamp added
Values are uninterrupted arrays of bytes (strings)

BigTable rows

rows are dynamically portioned (tables)
short row ranges usually on one machine
manipulation of row names ensures keys are on same machine

BigTable columns

column keys grouped into families
few hundreds of families
individual columns unbounded (millions exp)
columns can be org into locality[???] groups to improve lookup efficiency

Bloom filters

used for fast lookups
array holds single bit values
hash function maps inputs to set of cells in the array
allows for quick negative response
can give false positives, but no false negatives

Close grouping of similar values enables very high compression (10-1)
Doesn’t keep saving the same identical data over and over again, just points back to the existing file

Distributed processing
The traditional approach uses a single expensive powerful computer, but thus doesn’t scale well.
The divide and conquer approach leverages lots of commodity hardware to handle data in smaller chunks processed in parallel.

HBase (hadoop)

apache
open source bigtable
built in Java
shell is jruby interp
other interface
jython, groovy, scala, REST, ???

HBase CRUD

–quick demo, didn’t work—

Operations can be run from the command line or from a REST interface (REST interface requires values in base64)
Intended for big jobs
Strong scaling capabilities
Scans over large sets fast
Complex ancillary systems – high learning curve
Documentation not the best

HBase possible library applications:

full text indexing of an entire collection — probably wouldn’t make it work very hard
web archiving
cms backend? — probably not, only for masochists

2. Key value stores

Very simple data structure – one big table
Fast but can’t do complex queries

Implementations:

Dynamo

Open source implementations

redis
riak
memcached
project voldemort

Dynamo (Amazon)
It’s goal is to be reliable on a massive scale. This means:

high availability
low latency

The environment is an e-comerce platform, running hundreds of services (not just the Amazon store).
An RDBMS is not a good fit for this environment.
Most services retrieve by primary key, so there are no complex queries
An RDBMS, on the other hand, requires expensive hardware and expertise

Assumptions and Requirements
The query model:

read/write by key
no mul items (relations)

acid properties

weak consistency

efficient

built on commodity hardware
no latency

other

internal non-hostile – no internal controls on access
scale of hundreds of hosts

Why does it need to be so fast?
In an ecosystem built on service tiers, each level of services has to be even faster than the one above it, right down to the data stores.

Design considerations
The compromise is on consistency, with a goal of “eventual consistency”
Conflicts are resolved at read time so that the database is always writeable, resulting in high availability
Conflicts are resolved by the application, the db can only just take the last change, so potentially, data could be lost

Gomez used the example of Amazon’s shopping cart and accessing it from multiple computers at multiple times and making changes to to it. There is the potential that something could get out of sync between the various computers.

Incremental scale, scale out one node at a time
Symmetry – all nodes are peers and equal, no master nodes
Decentralization – handles outages better
Heterogeneity – hosts not created equally, some hardware will be better than other hardware

System architecture
interface, has only two operations

get key
put key, contact, object

dynamic data partitioning

consistent hashing (in a ring structure)
each node is assigned a position in a ring
keys are hashed to determine node
nodes are virtual and hashed to multiple points

replication

data replicated to N hosts
coordinator node replicates to n-1 successors + others in preferred list

versioning

eventual consistency
vector clocks

configurable quorum

define levels at which read and write happen

Amazon’s white paper on Dynamo

Riak

open source key-value store
developed by Basho
written in Erlang

Riak is based on dynamo

Add features:

it links data together
data can be anything – text, images. video
can use a REST interface
curl – get/put/delete/post

–Riak demo–

link walking

bucket, tag, keep

ad hoc read/write specification

you can specify w, r and n within a query

custom server side validation

pre-commit and post-commit hooks

plugins

indexing – put into headers
search – plugin builds interred index with pre-commit hooks
has http solr interface

Summary of key-value dbs

distributed replicated
high availability
schemaless
[something else on the slide that I missed]

Possible library applications

large inventory/repository backend

3. Document-oriented DBs
These dbs use table for keys like the key-value stores
data stored in json “documents”
allows for hierarchical data
schemaless, change data structure on the fly

open source options

mongodb
couchdb

couchdb

apache
Erlang

easy to use

big or small projects
easy to install
nice web UI (Futon)

–couch db demo–

data not overwritten, just added with a new timestamp
couchdb – mapreduce
views to query data

consist of map() and optional reduce()
map has 2 arguments: key and value
map function emits key-value pairs
reduce has 3 arguments: key value, rereducer

mapreduce – resource intensive

don’t run jobs on production environ
save temp views as design doc
couchdb stores results and watches for changes

queries from the demo:
function(doc) { if ('user' in doc && 'albums' in doc) { doc.albums.forEach(function(album){ var key = album.title; var value = {by: album.artist, tracks: album.tracks}; emit(key, value); }); } }

function(doc) { if ('user' in doc && 'albums' in doc) { doc.albums.forEach(function(album){ if ('tracks' in album) { album.tracks.forEach(function(track){ emit(doc.user, 1) }); } }); } }

couchdb — stepping through the map reduce example
docs passed to map
map emits values from each doc
emitted values are sorted by keys
chunks of rows with same key passed to reduce
chunks too big, re-reduce
repeat until no duplicate keys

couch vs mongo

couch can be small, mongo not
mongo shards, couch replicates
mongo enables ad hoc queries – couch requires views
couch is made for web – rest is afterthought in mongo

Summary

schemaless json stores
powerful map reduce
scalable

Possible library applications

multiple domain metadata repository

4. Graph dbs

interconnected data
nodes and edges (edges are the relationships)
both can have metadata
queries traverse data
good for networks, object-oriented data

open source

neo4j
orientdb
hypergraphdb

neo4j

built by Neo Tech
written in java

interacting with the db

gremlin/groovy command line console
cypher query language
rest api
web UI (includes gremlin)

adding data in using gremlin
chain commands
out == outE.inV
in == ???

Demo
* gremlin> g.v(7).outE.inV.title * ==> Time Bandits * ==> Twelve Monkeys * ==> Jabberwocky * ==> Monty Python and the Holy Grail * ==> null

looping – social network, find the friends of a friend, loops out over and over again

rest interface

getting path in rest, give starting and ending node, it will give you the path between them?

gremlin via rest

indexing

custom indexes

summary

model anything
pretty big
acid compel
does not scale well

Possible library applications

no suggestions offered

Talk summary

nosql dbs are fun to learn
new capabilities
trade-offs
don’t abandon your RDBMS just yet