Code4lib 2013 NoSQL Pre-Conference Session

These notes are just a tad out of date. I started working on them right away after the conference, but then never got beck to them again. Unfortunately, I also no longer remember what goes in the gaps of my short-hand, so they get less explanatory and more list-y pretty quickly.

 

Pre-conference session at Code4lib2013

From the program:

Introduction to NoSQL Databases

1-470 Richard J. Daley Library, 9:00 am to 12:00 pm on Monday, February 11

Joshua Gomez, George Washington University, jngomez at gwu edu

Since Google published its paper on BigTable in 2006, alternatives to the traditional relational database model have been growing in both variety and popularity. These new databases (often referred to as NoSQL databases) excel at handling problems faced by modern information systems that the traditional relational model cannot. They are particularly popular among organizations tackling the so-called “Big Data” problems. However, there are always tradeoffs involved when making such dramatic changes. Understanding how these different kinds of databases are designed and what they can offer is essential to the decision making process. In this precon I will discuss some of the various types of new databases (key-value, columnar, document, graph) and walk through examples or exercises using some of their open source implementations like Riak, HBase, MongoDB or CouchDB, and Neo4j.

 Notes:

No slides available as of 6/18/13

Gomez began his presentation with the caution that he was not an expert and had, in fact, signed up to give the presentation in order to make himself learn about noSQL databases. Gold star for him, I don’t think I could do that.

 Outline of the presentation

What noSQL dbs are
How they are different from relational dbs
Review of ACID and CAP
The 4 types of noSQL dbs:
columnar db

  • bigtable & hbase
  • mapreduce & hadoop

key-value store db

  • dynamo
  • riak

document store

  • mongodb

graph db

Introduction

Are noSQL dbs an innovation or just a fad? Everyone seems to be doing it and each has their own version.

  • BigTable – Google
  • Dynamo – Amazon
  • Cassandra – Facebook
  • Voldemort – Linkedin

NoSQL dbs are built for situations where relational dbs struggle
There are no joins, the data is not stored like relational db
Schemaless data structures

Advantages:

  • high availability
  • massive horizontal scaling

Tradeoffs:

  • no guarantees,
  • not perfect consistency

ACID

  • atomicity – transaction must all run or all fail
  • consistency – db is always valid, no violation of constraints
  • isolation – separate transactions can’t interfere with each other
  • durability – changes are permanent, even when there is a system failure

CAP Theory

distributed dbs can only provide for 2 out of 3 properties:

  • consistency
  • availability
  • partition tolerance

latency is the tradeoff against availability and consistency

4 types of noSQL dbs
columnar

  • a single very large [table?], column-based, 2-dimensions
  • good for sparse data sets (lots of nulls) and aggregation
  • easy to add columns

key-value

  • fast distributed hash map
  • no good for complex queries or data aggregation

document

  • like a hash but allows for hierarchical data structures
  • combo of simple look-up and nested data, flexible

graph

  • records data about relationships
  • excellent for insight and discovery through node traversal

Key-value store dbs are the least complex databases
Graph databases are the most complex databases
90% of use cases don’t even come close to needing noSQL level
Relational dbs are not obsolete, they can do most or all of the same things that noSQL does

So why would someone use noSQL?

  • might need unitasker
  • might be cheaper
  • for the schema flexibility
  • to learn new things

Detailed look at the four types of databases

1. Columnar dbs

These dbs are column-based, not row-based
They are good for scans over large data sets
They allow for massive horizontal scaling

Implementations

  • BigTable
  • HBase
  • Cassandra
  • Hypertable

BigTable goals

  • wide applicability
  • scalability
  • hight performance
  • high availability

Note that consistency is missing from the list!

BigTable data model

  • sparse
  • distrbuted
  • persistent
  • multi-dimensional

map (each cell) indexed by:

  • row key: string
  • column key: string
  • timestamp: 64 bit integer

No old data thrown away, always new stamp added
Values are uninterrupted arrays of bytes (strings)

BigTable rows

  • rows are dynamically portioned (tables)
  • short row ranges usually on one machine
  • manipulation of row names ensures keys are on same machine

BigTable columns

  • column keys grouped into families
  • few hundreds of families
  • individual columns unbounded (millions exp)
  • columns can be org into locality[???] groups to improve lookup efficiency

Bloom filters

  • used for fast lookups
  • array holds single bit values
  • hash function maps inputs to set of cells in the array
  • allows for quick negative response
  • can give false positives, but no false negatives

Close grouping of similar values enables very high compression (10-1)
Doesn’t keep saving the same identical data over and over again, just points back to the existing file

Distributed processing
The traditional approach uses a single expensive powerful computer, but thus doesn’t scale well.
The divide and conquer approach leverages lots of commodity hardware to handle data in smaller chunks processed in parallel.

HBase (hadoop)

  • apache
  • open source bigtable
  • built in Java
  • shell is jruby interp
  • other interface
  • jython, groovy, scala, REST, ???

HBase CRUD

–quick demo, didn’t work—

Operations can be run from the command line or from a REST interface (REST interface requires values in base64)
Intended for big jobs
Strong scaling capabilities
Scans over large sets fast
Complex ancillary systems – high learning curve
Documentation not the best

HBase possible library applications:

  • full text indexing of an entire collection — probably wouldn’t make it work very hard
  • web archiving
  • cms backend? — probably not, only for masochists

2. Key value stores

Very simple data structure – one big table
Fast but can’t do complex queries

Implementations:

  • Dynamo

Open source implementations

  • redis
  • riak
  • memcached
  • project voldemort

Dynamo (Amazon)
It’s goal is to be reliable on a massive scale. This means:

  • high availability
  • low latency

The environment is an e-comerce platform, running hundreds of services (not just the Amazon store).
An RDBMS is not a good fit for this environment.
Most services retrieve by primary key, so there are no complex queries
An RDBMS, on the other hand, requires expensive hardware and expertise

Assumptions and Requirements
The query model:

  • read/write by key
  • no mul items (relations)

acid properties

  • weak consistency

efficient

  • built on commodity hardware
  • no latency

other

  • internal non-hostile – no internal controls on access
  • scale of hundreds of hosts

Why does it need to be so fast?
In an ecosystem built on service tiers, each level of services has to be even faster than the one above it, right down to the data stores.

Design considerations
The compromise is on consistency, with a goal of “eventual consistency”
Conflicts are resolved at read time so that the database is always writeable, resulting in high availability
Conflicts are resolved by the application, the db can only just take the last change, so potentially, data could be lost

Gomez used the example of Amazon’s shopping cart and accessing it from multiple computers at multiple times and making changes to to it. There is the potential that something could get out of sync between the various computers.

Incremental scale, scale out one node at a time
Symmetry – all nodes are peers and equal, no master nodes
Decentralization – handles outages better
Heterogeneity – hosts not created equally, some hardware will be better than other hardware

System architecture
interface, has only two operations

  • get key
  • put key, contact, object

dynamic data partitioning

  • consistent hashing (in a ring structure)
  • each node is assigned a position in a ring
  • keys are hashed to determine node
  • nodes are virtual and hashed to multiple points

replication

  • data replicated to N hosts
  • coordinator node replicates to n-1 successors + others in preferred list

versioning

  • eventual consistency
  • vector clocks

configurable quorum

  • define levels at which read and write happen

Amazon’s white paper on Dynamo

Riak

  • open source key-value store
  • developed by Basho
  • written in Erlang

Riak is based on dynamo

Add features:

  • it links data together
  • data can be anything – text, images. video
  • can use a REST interface
  • curl – get/put/delete/post

–Riak demo–

link walking

  • bucket, tag, keep

ad hoc read/write specification

  • you can specify w, r and n within a query

custom server side validation

  • pre-commit and post-commit hooks

plugins

  • indexing – put into headers
  • search – plugin builds interred index with pre-commit hooks
  • has http solr interface

Summary of key-value dbs

  • distributed replicated
  • high availability
  • schemaless
  • [something else on the slide that I missed]

Possible library applications

  • large inventory/repository backend

3. Document-oriented DBs
These dbs use table for keys like the key-value stores
data stored in json “documents”
allows for hierarchical data
schemaless, change data structure on the fly

open source options

  • mongodb
  • couchdb

couchdb

  • apache
  • Erlang

easy to use

  • big or small projects
  • easy to install
  • nice web UI (Futon)

–couch db demo–

data not overwritten, just added with a new timestamp
couchdb – mapreduce
views to query data

  • consist of map() and optional reduce()
  • map has 2 arguments: key and value
  • map function emits key-value pairs
  • reduce has 3 arguments: key value, rereducer

mapreduce – resource intensive

  • don’t run jobs on production environ
  • save temp views as design doc
  • couchdb stores results and watches for changes

queries from the demo:
function(doc) {
if ('user' in doc && 'albums' in doc) {
doc.albums.forEach(function(album){
var key = album.title;
var value = {by: album.artist, tracks: album.tracks};
emit(key, value);
});
}
}

function(doc) {
if ('user' in doc && 'albums' in doc) {
doc.albums.forEach(function(album){
if ('tracks' in album) {
album.tracks.forEach(function(track){
emit(doc.user, 1)
});
}
});
}
}

couchdb — stepping through the map reduce example
docs passed to map
map emits values from each doc
emitted values are sorted by keys
chunks of rows with same key passed to reduce
chunks too big, re-reduce
repeat until no duplicate keys

couch vs mongo

  • couch can be small, mongo not
  • mongo shards, couch replicates
  • mongo enables ad hoc queries – couch requires views
  • couch is made for web – rest is afterthought in mongo

Summary

  • schemaless json stores
  • powerful map reduce
  • scalable

Possible library applications

  • multiple domain metadata repository

4. Graph dbs

interconnected data
nodes and edges (edges are the relationships)
both can have metadata
queries traverse data
good for networks, object-oriented data

open source

  • neo4j
  • orientdb
  • hypergraphdb

neo4j

  • built by Neo Tech
  • written in java

interacting with the db

  • gremlin/groovy command line console
  • cypher query language
  • rest api
  • web UI (includes gremlin)

adding data in using gremlin
chain commands
out == outE.inV
in == ???

Demo
* gremlin> g.v(7).outE.inV.title
* ==> Time Bandits
* ==> Twelve Monkeys
* ==> Jabberwocky
* ==> Monty Python and the Holy Grail
* ==> null

looping – social network, find the friends of a friend, loops out over and over again

rest interface

getting path in rest, give starting and ending node, it will give you the path between them?

gremlin via rest

indexing

  • custom indexes

summary

  • model anything
  • pretty big
  • acid compel
  • does not scale well

Possible library applications

  • no suggestions offered

Talk summary

  • nosql dbs are fun to learn
  • new capabilities
  • trade-offs
  • don’t abandon your RDBMS just yet