These notes are just a tad out of date. I started working on them right away after the conference, but then never got beck to them again. Unfortunately, I also no longer remember what goes in the gaps of my short-hand, so they get less explanatory and more list-y pretty quickly.
Pre-conference session at Code4lib2013
From the program:
Introduction to NoSQL Databases
1-470 Richard J. Daley Library, 9:00 am to 12:00 pm on Monday, February 11
Joshua Gomez, George Washington University, jngomez at gwu edu
Since Google published its paper on BigTable in 2006, alternatives to the traditional relational database model have been growing in both variety and popularity. These new databases (often referred to as NoSQL databases) excel at handling problems faced by modern information systems that the traditional relational model cannot. They are particularly popular among organizations tackling the so-called “Big Data” problems. However, there are always tradeoffs involved when making such dramatic changes. Understanding how these different kinds of databases are designed and what they can offer is essential to the decision making process. In this precon I will discuss some of the various types of new databases (key-value, columnar, document, graph) and walk through examples or exercises using some of their open source implementations like Riak, HBase, MongoDB or CouchDB, and Neo4j.
Notes:
No slides available as of 6/18/13
Gomez began his presentation with the caution that he was not an expert and had, in fact, signed up to give the presentation in order to make himself learn about noSQL databases. Gold star for him, I don’t think I could do that.
Outline of the presentation
What noSQL dbs are
How they are different from relational dbs
Review of ACID and CAP
The 4 types of noSQL dbs:
columnar db
- bigtable & hbase
- mapreduce & hadoop
key-value store db
- dynamo
- riak
document store
- mongodb
graph db
Introduction
Are noSQL dbs an innovation or just a fad? Everyone seems to be doing it and each has their own version.
- BigTable – Google
- Dynamo – Amazon
- Cassandra – Facebook
- Voldemort – Linkedin
NoSQL dbs are built for situations where relational dbs struggle
There are no joins, the data is not stored like relational db
Schemaless data structures
Advantages:
- high availability
- massive horizontal scaling
Tradeoffs:
- no guarantees,
- not perfect consistency
ACID
- atomicity – transaction must all run or all fail
- consistency – db is always valid, no violation of constraints
- isolation – separate transactions can’t interfere with each other
- durability – changes are permanent, even when there is a system failure
CAP Theory
distributed dbs can only provide for 2 out of 3 properties:
- consistency
- availability
- partition tolerance
latency is the tradeoff against availability and consistency
4 types of noSQL dbs
columnar
- a single very large [table?], column-based, 2-dimensions
- good for sparse data sets (lots of nulls) and aggregation
- easy to add columns
key-value
- fast distributed hash map
- no good for complex queries or data aggregation
document
- like a hash but allows for hierarchical data structures
- combo of simple look-up and nested data, flexible
graph
- records data about relationships
- excellent for insight and discovery through node traversal
Key-value store dbs are the least complex databases
Graph databases are the most complex databases
90% of use cases don’t even come close to needing noSQL level
Relational dbs are not obsolete, they can do most or all of the same things that noSQL does
So why would someone use noSQL?
- might need unitasker
- might be cheaper
- for the schema flexibility
- to learn new things
Detailed look at the four types of databases
1. Columnar dbs
These dbs are column-based, not row-based
They are good for scans over large data sets
They allow for massive horizontal scaling
Implementations
- BigTable
- HBase
- Cassandra
- Hypertable
BigTable goals
- wide applicability
- scalability
- hight performance
- high availability
Note that consistency is missing from the list!
BigTable data model
- sparse
- distrbuted
- persistent
- multi-dimensional
map (each cell) indexed by:
- row key: string
- column key: string
- timestamp: 64 bit integer
No old data thrown away, always new stamp added
Values are uninterrupted arrays of bytes (strings)
BigTable rows
- rows are dynamically portioned (tables)
- short row ranges usually on one machine
- manipulation of row names ensures keys are on same machine
BigTable columns
- column keys grouped into families
- few hundreds of families
- individual columns unbounded (millions exp)
- columns can be org into locality[???] groups to improve lookup efficiency
Bloom filters
- used for fast lookups
- array holds single bit values
- hash function maps inputs to set of cells in the array
- allows for quick negative response
- can give false positives, but no false negatives
Close grouping of similar values enables very high compression (10-1)
Doesn’t keep saving the same identical data over and over again, just points back to the existing file
Distributed processing
The traditional approach uses a single expensive powerful computer, but thus doesn’t scale well.
The divide and conquer approach leverages lots of commodity hardware to handle data in smaller chunks processed in parallel.
HBase (hadoop)
- apache
- open source bigtable
- built in Java
- shell is jruby interp
- other interface
- jython, groovy, scala, REST, ???
HBase CRUD
–quick demo, didn’t work—
Operations can be run from the command line or from a REST interface (REST interface requires values in base64)
Intended for big jobs
Strong scaling capabilities
Scans over large sets fast
Complex ancillary systems – high learning curve
Documentation not the best
HBase possible library applications:
- full text indexing of an entire collection — probably wouldn’t make it work very hard
- web archiving
- cms backend? — probably not, only for masochists
2. Key value stores
Very simple data structure – one big table
Fast but can’t do complex queries
Implementations:
- Dynamo
Open source implementations
- redis
- riak
- memcached
- project voldemort
Dynamo (Amazon)
It’s goal is to be reliable on a massive scale. This means:
- high availability
- low latency
The environment is an e-comerce platform, running hundreds of services (not just the Amazon store).
An RDBMS is not a good fit for this environment.
Most services retrieve by primary key, so there are no complex queries
An RDBMS, on the other hand, requires expensive hardware and expertise
Assumptions and Requirements
The query model:
- read/write by key
- no mul items (relations)
acid properties
- weak consistency
efficient
- built on commodity hardware
- no latency
other
- internal non-hostile – no internal controls on access
- scale of hundreds of hosts
Why does it need to be so fast?
In an ecosystem built on service tiers, each level of services has to be even faster than the one above it, right down to the data stores.
Design considerations
The compromise is on consistency, with a goal of “eventual consistency”
Conflicts are resolved at read time so that the database is always writeable, resulting in high availability
Conflicts are resolved by the application, the db can only just take the last change, so potentially, data could be lost
Gomez used the example of Amazon’s shopping cart and accessing it from multiple computers at multiple times and making changes to to it. There is the potential that something could get out of sync between the various computers.
Incremental scale, scale out one node at a time
Symmetry – all nodes are peers and equal, no master nodes
Decentralization – handles outages better
Heterogeneity – hosts not created equally, some hardware will be better than other hardware
System architecture
interface, has only two operations
- get key
- put key, contact, object
dynamic data partitioning
- consistent hashing (in a ring structure)
- each node is assigned a position in a ring
- keys are hashed to determine node
- nodes are virtual and hashed to multiple points
replication
- data replicated to N hosts
- coordinator node replicates to n-1 successors + others in preferred list
versioning
- eventual consistency
- vector clocks
configurable quorum
- define levels at which read and write happen
Amazon’s white paper on Dynamo
Riak
- open source key-value store
- developed by Basho
- written in Erlang
Riak is based on dynamo
Add features:
- it links data together
- data can be anything – text, images. video
- can use a REST interface
- curl – get/put/delete/post
–Riak demo–
link walking
- bucket, tag, keep
ad hoc read/write specification
- you can specify w, r and n within a query
custom server side validation
- pre-commit and post-commit hooks
plugins
- indexing – put into headers
- search – plugin builds interred index with pre-commit hooks
- has http solr interface
Summary of key-value dbs
- distributed replicated
- high availability
- schemaless
- [something else on the slide that I missed]
Possible library applications
- large inventory/repository backend
3. Document-oriented DBs
These dbs use table for keys like the key-value stores
data stored in json “documents”
allows for hierarchical data
schemaless, change data structure on the fly
open source options
- mongodb
- couchdb
couchdb
- apache
- Erlang
easy to use
- big or small projects
- easy to install
- nice web UI (Futon)
–couch db demo–
data not overwritten, just added with a new timestamp
couchdb – mapreduce
views to query data
- consist of map() and optional reduce()
- map has 2 arguments: key and value
- map function emits key-value pairs
- reduce has 3 arguments: key value, rereducer
mapreduce – resource intensive
- don’t run jobs on production environ
- save temp views as design doc
- couchdb stores results and watches for changes
queries from the demo:
function(doc) {
if ('user' in doc && 'albums' in doc) {
doc.albums.forEach(function(album){
var key = album.title;
var value = {by: album.artist, tracks: album.tracks};
emit(key, value);
});
}
}
function(doc) {
if ('user' in doc && 'albums' in doc) {
doc.albums.forEach(function(album){
if ('tracks' in album) {
album.tracks.forEach(function(track){
emit(doc.user, 1)
});
}
});
}
}
couchdb — stepping through the map reduce example
docs passed to map
map emits values from each doc
emitted values are sorted by keys
chunks of rows with same key passed to reduce
chunks too big, re-reduce
repeat until no duplicate keys
couch vs mongo
- couch can be small, mongo not
- mongo shards, couch replicates
- mongo enables ad hoc queries – couch requires views
- couch is made for web – rest is afterthought in mongo
Summary
- schemaless json stores
- powerful map reduce
- scalable
Possible library applications
- multiple domain metadata repository
4. Graph dbs
interconnected data
nodes and edges (edges are the relationships)
both can have metadata
queries traverse data
good for networks, object-oriented data
open source
- neo4j
- orientdb
- hypergraphdb
neo4j
- built by Neo Tech
- written in java
interacting with the db
- gremlin/groovy command line console
- cypher query language
- rest api
- web UI (includes gremlin)
adding data in using gremlin
chain commands
out == outE.inV
in == ???
Demo
* gremlin> g.v(7).outE.inV.title
* ==> Time Bandits
* ==> Twelve Monkeys
* ==> Jabberwocky
* ==> Monty Python and the Holy Grail
* ==> null
looping – social network, find the friends of a friend, loops out over and over again
rest interface
getting path in rest, give starting and ending node, it will give you the path between them?
gremlin via rest
indexing
- custom indexes
summary
- model anything
- pretty big
- acid compel
- does not scale well
Possible library applications
- no suggestions offered
Talk summary
- nosql dbs are fun to learn
- new capabilities
- trade-offs
- don’t abandon your RDBMS just yet