Code4lib 2013 NoSQL Pre-Conference Session

These notes are just a tad out of date. I started working on them right away after the conference, but then never got beck to them again. Unfortunately, I also no longer remember what goes in the gaps of my short-hand, so they get less explanatory and more list-y pretty quickly.

 

Pre-conference session at Code4lib2013

From the program:

Introduction to NoSQL Databases

1-470 Richard J. Daley Library, 9:00 am to 12:00 pm on Monday, February 11

Joshua Gomez, George Washington University, jngomez at gwu edu

Since Google published its paper on BigTable in 2006, alternatives to the traditional relational database model have been growing in both variety and popularity. These new databases (often referred to as NoSQL databases) excel at handling problems faced by modern information systems that the traditional relational model cannot. They are particularly popular among organizations tackling the so-called “Big Data” problems. However, there are always tradeoffs involved when making such dramatic changes. Understanding how these different kinds of databases are designed and what they can offer is essential to the decision making process. In this precon I will discuss some of the various types of new databases (key-value, columnar, document, graph) and walk through examples or exercises using some of their open source implementations like Riak, HBase, MongoDB or CouchDB, and Neo4j.

 Notes:

No slides available as of 6/18/13

Gomez began his presentation with the caution that he was not an expert and had, in fact, signed up to give the presentation in order to make himself learn about noSQL databases. Gold star for him, I don’t think I could do that.

 Outline of the presentation

What noSQL dbs are
How they are different from relational dbs
Review of ACID and CAP
The 4 types of noSQL dbs:
columnar db

  • bigtable & hbase
  • mapreduce & hadoop

key-value store db

  • dynamo
  • riak

document store

  • mongodb

graph db

Introduction

Are noSQL dbs an innovation or just a fad? Everyone seems to be doing it and each has their own version.

  • BigTable – Google
  • Dynamo – Amazon
  • Cassandra – Facebook
  • Voldemort – Linkedin

NoSQL dbs are built for situations where relational dbs struggle
There are no joins, the data is not stored like relational db
Schemaless data structures

Advantages:

  • high availability
  • massive horizontal scaling

Tradeoffs:

  • no guarantees,
  • not perfect consistency

ACID

  • atomicity – transaction must all run or all fail
  • consistency – db is always valid, no violation of constraints
  • isolation – separate transactions can’t interfere with each other
  • durability – changes are permanent, even when there is a system failure

CAP Theory

distributed dbs can only provide for 2 out of 3 properties:

  • consistency
  • availability
  • partition tolerance

latency is the tradeoff against availability and consistency

4 types of noSQL dbs
columnar

  • a single very large [table?], column-based, 2-dimensions
  • good for sparse data sets (lots of nulls) and aggregation
  • easy to add columns

key-value

  • fast distributed hash map
  • no good for complex queries or data aggregation

document

  • like a hash but allows for hierarchical data structures
  • combo of simple look-up and nested data, flexible

graph

  • records data about relationships
  • excellent for insight and discovery through node traversal

Key-value store dbs are the least complex databases
Graph databases are the most complex databases
90% of use cases don’t even come close to needing noSQL level
Relational dbs are not obsolete, they can do most or all of the same things that noSQL does

So why would someone use noSQL?

  • might need unitasker
  • might be cheaper
  • for the schema flexibility
  • to learn new things

Detailed look at the four types of databases

1. Columnar dbs

These dbs are column-based, not row-based
They are good for scans over large data sets
They allow for massive horizontal scaling

Implementations

  • BigTable
  • HBase
  • Cassandra
  • Hypertable

BigTable goals

  • wide applicability
  • scalability
  • hight performance
  • high availability

Note that consistency is missing from the list!

BigTable data model

  • sparse
  • distrbuted
  • persistent
  • multi-dimensional

map (each cell) indexed by:

  • row key: string
  • column key: string
  • timestamp: 64 bit integer

No old data thrown away, always new stamp added
Values are uninterrupted arrays of bytes (strings)

BigTable rows

  • rows are dynamically portioned (tables)
  • short row ranges usually on one machine
  • manipulation of row names ensures keys are on same machine

BigTable columns

  • column keys grouped into families
  • few hundreds of families
  • individual columns unbounded (millions exp)
  • columns can be org into locality[???] groups to improve lookup efficiency

Bloom filters

  • used for fast lookups
  • array holds single bit values
  • hash function maps inputs to set of cells in the array
  • allows for quick negative response
  • can give false positives, but no false negatives

Close grouping of similar values enables very high compression (10-1)
Doesn’t keep saving the same identical data over and over again, just points back to the existing file

Distributed processing
The traditional approach uses a single expensive powerful computer, but thus doesn’t scale well.
The divide and conquer approach leverages lots of commodity hardware to handle data in smaller chunks processed in parallel.

HBase (hadoop)

  • apache
  • open source bigtable
  • built in Java
  • shell is jruby interp
  • other interface
  • jython, groovy, scala, REST, ???

HBase CRUD

–quick demo, didn’t work—

Operations can be run from the command line or from a REST interface (REST interface requires values in base64)
Intended for big jobs
Strong scaling capabilities
Scans over large sets fast
Complex ancillary systems – high learning curve
Documentation not the best

HBase possible library applications:

  • full text indexing of an entire collection — probably wouldn’t make it work very hard
  • web archiving
  • cms backend? — probably not, only for masochists

2. Key value stores

Very simple data structure – one big table
Fast but can’t do complex queries

Implementations:

  • Dynamo

Open source implementations

  • redis
  • riak
  • memcached
  • project voldemort

Dynamo (Amazon)
It’s goal is to be reliable on a massive scale. This means:

  • high availability
  • low latency

The environment is an e-comerce platform, running hundreds of services (not just the Amazon store).
An RDBMS is not a good fit for this environment.
Most services retrieve by primary key, so there are no complex queries
An RDBMS, on the other hand, requires expensive hardware and expertise

Assumptions and Requirements
The query model:

  • read/write by key
  • no mul items (relations)

acid properties

  • weak consistency

efficient

  • built on commodity hardware
  • no latency

other

  • internal non-hostile – no internal controls on access
  • scale of hundreds of hosts

Why does it need to be so fast?
In an ecosystem built on service tiers, each level of services has to be even faster than the one above it, right down to the data stores.

Design considerations
The compromise is on consistency, with a goal of “eventual consistency”
Conflicts are resolved at read time so that the database is always writeable, resulting in high availability
Conflicts are resolved by the application, the db can only just take the last change, so potentially, data could be lost

Gomez used the example of Amazon’s shopping cart and accessing it from multiple computers at multiple times and making changes to to it. There is the potential that something could get out of sync between the various computers.

Incremental scale, scale out one node at a time
Symmetry – all nodes are peers and equal, no master nodes
Decentralization – handles outages better
Heterogeneity – hosts not created equally, some hardware will be better than other hardware

System architecture
interface, has only two operations

  • get key
  • put key, contact, object

dynamic data partitioning

  • consistent hashing (in a ring structure)
  • each node is assigned a position in a ring
  • keys are hashed to determine node
  • nodes are virtual and hashed to multiple points

replication

  • data replicated to N hosts
  • coordinator node replicates to n-1 successors + others in preferred list

versioning

  • eventual consistency
  • vector clocks

configurable quorum

  • define levels at which read and write happen

Amazon’s white paper on Dynamo

Riak

  • open source key-value store
  • developed by Basho
  • written in Erlang

Riak is based on dynamo

Add features:

  • it links data together
  • data can be anything – text, images. video
  • can use a REST interface
  • curl – get/put/delete/post

–Riak demo–

link walking

  • bucket, tag, keep

ad hoc read/write specification

  • you can specify w, r and n within a query

custom server side validation

  • pre-commit and post-commit hooks

plugins

  • indexing – put into headers
  • search – plugin builds interred index with pre-commit hooks
  • has http solr interface

Summary of key-value dbs

  • distributed replicated
  • high availability
  • schemaless
  • [something else on the slide that I missed]

Possible library applications

  • large inventory/repository backend

3. Document-oriented DBs
These dbs use table for keys like the key-value stores
data stored in json “documents”
allows for hierarchical data
schemaless, change data structure on the fly

open source options

  • mongodb
  • couchdb

couchdb

  • apache
  • Erlang

easy to use

  • big or small projects
  • easy to install
  • nice web UI (Futon)

–couch db demo–

data not overwritten, just added with a new timestamp
couchdb – mapreduce
views to query data

  • consist of map() and optional reduce()
  • map has 2 arguments: key and value
  • map function emits key-value pairs
  • reduce has 3 arguments: key value, rereducer

mapreduce – resource intensive

  • don’t run jobs on production environ
  • save temp views as design doc
  • couchdb stores results and watches for changes

queries from the demo:
function(doc) {
if ('user' in doc && 'albums' in doc) {
doc.albums.forEach(function(album){
var key = album.title;
var value = {by: album.artist, tracks: album.tracks};
emit(key, value);
});
}
}

function(doc) {
if ('user' in doc && 'albums' in doc) {
doc.albums.forEach(function(album){
if ('tracks' in album) {
album.tracks.forEach(function(track){
emit(doc.user, 1)
});
}
});
}
}

couchdb — stepping through the map reduce example
docs passed to map
map emits values from each doc
emitted values are sorted by keys
chunks of rows with same key passed to reduce
chunks too big, re-reduce
repeat until no duplicate keys

couch vs mongo

  • couch can be small, mongo not
  • mongo shards, couch replicates
  • mongo enables ad hoc queries – couch requires views
  • couch is made for web – rest is afterthought in mongo

Summary

  • schemaless json stores
  • powerful map reduce
  • scalable

Possible library applications

  • multiple domain metadata repository

4. Graph dbs

interconnected data
nodes and edges (edges are the relationships)
both can have metadata
queries traverse data
good for networks, object-oriented data

open source

  • neo4j
  • orientdb
  • hypergraphdb

neo4j

  • built by Neo Tech
  • written in java

interacting with the db

  • gremlin/groovy command line console
  • cypher query language
  • rest api
  • web UI (includes gremlin)

adding data in using gremlin
chain commands
out == outE.inV
in == ???

Demo
* gremlin> g.v(7).outE.inV.title
* ==> Time Bandits
* ==> Twelve Monkeys
* ==> Jabberwocky
* ==> Monty Python and the Holy Grail
* ==> null

looping – social network, find the friends of a friend, loops out over and over again

rest interface

getting path in rest, give starting and ending node, it will give you the path between them?

gremlin via rest

indexing

  • custom indexes

summary

  • model anything
  • pretty big
  • acid compel
  • does not scale well

Possible library applications

  • no suggestions offered

Talk summary

  • nosql dbs are fun to learn
  • new capabilities
  • trade-offs
  • don’t abandon your RDBMS just yet

ALA 2012: FRBR Presentation Four

“Current Research on and Use of FRBR in Libraries”
8am on Sunday, June 24, 2012
Speakers: Erik Mitchell & Carolyn McCallum, Thomas Hickey, Yin Zhang & Athena Salaba, Jennifer Bowen

Presentation 4 of 4

“FRBR and XC: Participatory Design”
Jennifer Bowen

No slides available for this presentation.

Bowen began her presentation with a brief introduction to the eXtensible Catalog (XC).

She then noted that user studies were built into the development of XC. The participatory design included observations of users working, surveys, and interviews. They asked users what they wanted.

The findings of that research were not FRBR-specific but what users wanted basically matched the FRBR model:

  • Users have preferred material and format types
  • Users want to know why items are on a result list
  • Users want to choose between versions of resource and see the relationships between resources

For ever-changing future needs, XC has a customizable user interface. Browsing of a collection of resources can be customized based on some common attribute or relationship within the collection.

Finally, Bowen concluded that their research showed that aspects of FRBR do address what users need to do.

Related links:
The results of the research are available in the book Scholarly Practice, Participatory Design and the eXtensible Catalog

ALA 2012: FRBR Presentation Three

“Current Research on and Use of FRBR in Libraries”
8am on Sunday, June 24, 2012
Speakers: Erik Mitchell & Carolyn McCallum, Thomas Hickey, Yin Zhang & Athena Salaba, Jennifer Bowen

Presentation 3 of 4

“Research, Development and Evaluation of a FRBR-based Catalog Prototype”
Yin Zhang & Athena Salaba

Presentation Slides

Presentation outline:

  • background of the project
  • research and development of the project
  • user evaluation of the project
  • conclusion/next steps

Background

Zhang began the presentation by discussing the background of the project. While FRBR has the potential for libraries to develop better and more effective catalogs and discovery tools, there is not much in the way of guidance for how to implement FRBR. User studies are still few and far between. KSU received IMLS funding to develop and research FRBR-based systems. As part of that project, KSU conducted a series of user studies.

Methodology

1. Run user evaluation studies on FRBR-based catalogs already in existence
2. Put together a FRBR-ized data set
3. Develop an initial set of displays
4. User feedback on the developed prototypes

Step One

The first step was to evaluate existing FRBR-based catalogs. They evaluated three existing FRBR-based catalogs for user experiences and support for the FRBR tasks: OCLC WorldCat.org, FictionFinder, and Libraries Australias. The results of the evaluation served as the basis for their own FRBR prototype catalog.

Step Two

The next step was to extract Library of Congress bib records and authority records from WorldCat. They used OCLC’s Workset algorithm to identify works, but applied their own algorithm to identify expressions and manifestations. The results of this were used to develop FRBR-based displays.

Step Three

In the third step, they developed the layouts for the FRBR-based displays based of:

  • works from an author search
  • works from a subject search
  • works from a title search
  • expressions from a language/form search
  • manifestation (slide 7)

Step Four

Finally, they sought user feedback on the interface design. The study participants were interviewed using printed display layouts as prompts and asked about data elements and functions. The feedback was incorporated into the final prototype catalog programming.

Here I have appended a screenshot of the prototype catalog search results taken from presentation slide 10 and a screenshot taken of LC’s current catalog search results where I tried to run approximately the same search.

FRBR prototype catalog


Traditional catalog


Instead of the gazillion search results all strung out over many pages as seen in the traditional catalog (is this another record for the same thing that I already looked at three pages ago?), in the prototype, the records are gathered together under the author/title work sets and then by form and language. The resulting display seems cleaner and more compact, while still presenting plenty of information. It seems so obvious to me that catalogs should have always worked this way.

Study Design

Next Salaba discussed the study design for having users actually evaluate the FRBR prototype. They used a comparative approach: with the same set of records, they had users search using both the traditional catalog and the FRBR prototype catalog. The study group contained 34 participants and data was collected via observations, interviews, audio recordings and screen captures.

The participants were given two kinds of search strategies to pursue. The first set of searches were predefined and users were asked to evaluated the resulting displays. In the second set, participants were given criteria and allowed to use their own search strategies.

Findings

Overall, most users (85%) preferred the FRBR prototype for all of the searches they did. The table on slide 14 breaks down the findings into the categories of language or type of materials, author, title, title and publication information, entertainment, research, and a general topic. The biggest difference in searching the two catalogs was that the FRBR prototype allowed users to find expressions. Since the current catalog only provides access at the manifestation level and does not group by language or format, this cannot really be a surprise.

Features that the participants found “helpful”:
Grouping of results by work and expression (65%)
Refining results (24%)
Alphabetical order of results display (15%)
Interface appearance (24%)

Features that participants thought needed improvement:
More detail before manifestation level display (15%)
Prefer individual manifestation level results (9%)
Listing a resource under each language of a multi-language resource (3%)

88% of participants thought that clustering the resources by work/expression/manifestation made it easier to find things. 91% thought that the navigation made sense and was helpful in performing searches. One participant found the FRBR prototype less helpful for searching for a specific title, but helpful when searching for a specific topic.

Conclusions

Salaba noted the importance of user input into the design and implementation of FRBR-based catalogs. The study showed that users can successfully complete searching tasks using the FRBR-based catalog and that users do understand and can navigate the FRBR-based displays.

Finally, Salaba stated that more research is needed into other FRBR implementations, with more studies comparing those implementations. She noted that other issues include:

  • FRBRization algorithms
  • Existing MARC records
  • Attributes and relationships
  • FRBR-based catalogs the support user tasks
  • Displays

Additionally, it is unknown at this point how RDA and Linked Data will work into the whole equation.

Related links:

Article (2007): Critical Issues and Challenges Facing FRBR Research and Practice

Article (2007): From a Conceptual Model to Application and System Development

Poster (2007): User Research and Testing of FRBR Prototype Systems

Article (2009): User Interface for FRBR User Tasks in Online Catalogs

Article (2009): What is Next for Functional Requirements for Bibliographic Records? A Delphi Study

Book (2009): Implementing FRBR in Libraries: Key Issues and Future Directions

Presentation for the ALA 2010 Annual Conference: FRBRizing MARC Records Based on FRBR User Tasks

Presentation for ASIST 2010 Annual Conference: FRBR User Research and a User Study on Evaluating FRBR Based Catalogs

An abstract for a presentation at a panel discussion at ASIST 2010: FRBR Implementation and User Research

An abstract for a presentation at a panel discussion of FRBR at ASIST 2011: Developing FRBR-Based Library Catalogs for Users

ALA 2012: FRBR Presentation Two

“Current Research on and Use of FRBR in Libraries”
8am on Sunday, June 24, 2012
Speakers: Erik Mitchell & Carolyn McCallum, Thomas Hickey, Yin Zhang & Athena Salaba, Jennifer Bowen

This is the second of four presentations given at this session.

“FRBR at OCLC”
Thomas Hickey

No slides found online for this presentation.

Hickey spoke about the use of FRBR at OCLC.

OCLC manages 275 million bibliographic records at the work, expression and manifestation levels. The bib records are already clustered by work level. OCLC has now started the process of clustering “content” which is roughly equivalent to expression and manifestation.

Clustering is done by creating normalized keys from a combination of the author and title. The advantage is that the process is straight-forward and efficient. The disadvantage is that the algorithm misses cataloging variations.

OCLC is now working with the GLIMIR (Global Library Manifestation Identifier) project. This project will cluster records at the manifestation level and assign an identifier. The algorithms for this project go beyond the author title keys used for workset creation into even the note fields. [NOTE: Examples given in the code4lib article include: publishing information and pagination.]

Using the GLIMIR algorithms they have discovered the same manifestation hiding in different worksets. They tried pushing changes back up to the workset level but it didn’t work very well. [NOTE: the code4lib article gives several examples of ways the GLIMIR has improved their de-duplication efforts. Was it a computational/technical problem?] They are moving to Hadoop and HBase now [?to improve their ability to handle copious amounts of data?].

The goal is to pull together all of the keys, group them and then separate them into coherent work[?] clusters. One problem is the friend of a friend issue. This is used to cluster similar items, but if A links to B and B links to C, are A and C the same thing?

In sum:

  • the new algorithms are much more forgiving of variations in the records
  • the iterations can be controlled
  • the records are much easier to re-cluster (the processing takes hours rather than months)
  • the work cluster assignments can happen in real time

Worldcat contains:
1.8 billion holdings
275 million worksets
20% non-singltons
80% holdings [have 42 per workset?]
Top 30 worksets — 3-10 thousand records
30-100 holdings
largest group 3.3 million
2.7 million keys
GLIMIR content set — 483 records

Music is problematic for clustering.

VIAF and FRBR
VIAF contains 1.5 million uniform title records
links to and from expressions
link to author from author/title

OCLC can also do clustering in multiple alphabets, using many, many cross-references.

ALA 2012: FRBR Presentation One

“Current Research on and Use of FRBR in Libraries”
8am on Sunday, June 24, 2012
Speakers: Erik Mitchell & Carolyn McCallum, Thomas Hickey, Yin Zhang & Athena Salaba, Jennifer Bowen

This is the first of four presentations given at this session.

“FRBRizing Mark Twain”
Erik Mitchell & Carolyn McCallum

The presentation slides are available on Slideshare or view them in the embedded slideshow below.

Erik Mitchell and Carolyn McCallum discussed their project to apply the FRBR model to a group of records relating to Mark Twain. McCallum organized the data manually while Mitchell created a program to do it in an automated fashion. They then compared the results. This presentation covered:

  • Metadata issues that arose from applying FRBR
  • Issues in migration
  • Comparison of the automated technique to an expert’s manual analysis

Carolyn McCallum spoke first about the manual processing portion of the project.

For this project, they focused on the Group 1 entities (work, expression, manifestation and item). They extracted 848 records from the Z. Smith Reynolds Library catalog at Wake Forest University for publications that were either by Mark Twain or about him. Using Mark Twain ensured that the data set had enough complexity to reveal any problems. The expert cataloger then grouped the metadata into worksets using titles and the OCLC FRBR key.

In the cataloger’s assessment, there were 410 records that grouped into 147 total worksets (each one having 2 or more expressions). The other 420 records sorted out into worksets with only one expression each. The largest worksets were for Huckleberry Finn (26 records) and Tom Sawyer (14 records). The most useful metadata was title, author, and a combination of title and author.

A couple of problems that were identified in the process were that whole to part and expression to manifestation were not expressed consistently across the records and that determining boundaries between entities was difficult. The line where one work changes enough to become another expression or even a completely different work can be open to interpretation. McCallum suggested that the entity classification should be guided by the needs of the local collection.

Mitchell then spoke about the automated version of the processing.

Comparison keys comprised of the OCLC FRBR keys (author & title) were again used to cluster records into worksets. The results were not as good as the manual expert process but were acceptable and comparable to OCLC’s results. To improve the results using the automated process, they built a Python script to extract normalized FRBR keys out of the MARC data and compared those keys. This did improve the results.

In conclusion, Mitchell noted that the metadata quality is not so much a problem as the intellectual content. The complex relationships between the various works/expressions/manifestations are simply not described by the metadata. Both methods, manual and automated are time and resource consuming. Finally, new data models, like Linked Data, “are changing our view of MARC metadata” (slide 21).

Question from the audience about problems [with the modeling process?]
Answer: Process could not deal well with multiple authors.

Other related links:
McCallum’s summary of their presentation (about halfway through the post).
A poster from the ASIS&T Annual Meeting in 2011

Moving to new Quarters…

Old quarters:

B105 – Manual file and two student workstations
B115 – Lateral files, table, supply cabinets
B117 – Microfilm cabinets
B114 – The front office workstation, copier, fridge and table
B118 – Shelving!
B109 – Three full workstations and four bookcases

New quarters:

M1006, view 1
M1006, view 2
M1010
 

Some of the furniture is still missing from the new place too… Anyone got a shoehorn?