Converting HTML to EPUB Part 1: Using a Website Converter

Summary

I compared the results for several websites that convert HTML webpage articles into EPUB. None of the websites that I tested provided consistently good results. A couple did provide the consistency of never actually working at all.

HTML to EPUB files

Introduction

Previously, I explored several iOS apps to read the PDF articles that I download/save. Unhappy with some of the formatting problems that come from saving HTML pages as PDFs, I decided to explore saving HTML pages as EPUB instead. There are several ways to do this: use a website converter, use a browser extension or use desktop computer software.

In this post I compare the results from several website converters. As I discussed in the previous blog post referenced above, I retrieve articles (and books) from multiple sources on several different devices. To convert website articles to EPUB using a website converter should be possible on any of the devices I use, though I only tested it on my desktop. Theoretically that means the process is flexible enough for my needs. Unfortunately, (spoiler alert!) based on my desktop testing I am not going to bother with testing the process any of my other devices.

Articles

To compare the output, I chose nine HTML articles, three each from three types of (somewhat arbitrary) categories: website articles, blog posts, and online journal articles. I wanted to the results for a variety of article characteristics, including things like images, code, quotes, references, comments, tables, etc. and different web page styles that would present a variety of formatting challenges.

Website articles

- Web APIs for non-programmers by Noah Veltman on School of Data
- XMLHttpRequest and AJAX for PHP programmers by James Kassemi on phpbuilder
- Why Tech’s Best Minds Are Very Worried About the Internet of Things by Klint Finley on Wired

Blog posts

- White Librarianship in Blackface: Diversity Initiatives in LIS by April Hathcock on In the Library with the Lead Pipe (Note: This site self-identifies a Journal, but the formatting is more blog-like than journal-like, hence its inclusion in this section.)
- Critical IoT Reading List – Summaries by Libby Miller at PlanB
- Dada Data and the Internet of Paternalistic Things by Sara M. Watson on The Message (Medium)

Journal articles

- Broken-World Vocabularies by Daniel Lovins and Diane Hillmann on D-Lib Magazine
- Recommendations for the application of Schema.org to aggregated Cultural Heritage metadata to increase relevance and visibility to search engines: the case of Europeana by Richard Wallis, Antoine Isaac, Valentine Charles, and Hugo Manguinhas on Code4Lib Journal
- Monuments of cyberspace: Designing the Internet beyond the network framework by Paris Chrysos on First Monday

Website Converters

A web search turned up 10 websites that claimed to convert HTML into EPUB. Five of them I didn’t use:

- Epubor Online eBook Converter — The small print on this website says that it will convert HTML, but I couldn’t enter a URL and it would not accept an uploaded HTML file.
- Zamzar — This site requires an email address, so I moved on without even trying it.
- 2epub — This site seems to be defunct.
- EPUB bud — This site requires an account, so I moved on without trying it.
- CloudConvert — Even though HTML to EPUB is listed on the site, it would only allow me to convert to PDF.

The other five websites I found all output at least one result, even if it was meaningless:

Comparison Results

Using a Firefox extension (EPUBReader), I was able to open the EPUB files before I downloaded them to Dropbox. My initial impression of the output was not encouraging. However, since I was not planning to read them on my desktop but in an e-reader app on my iPad, I decided to withhold judgement until I had done a more detailed comparison of the files using the e-readers. I compared the resulting EPUB files in three e-readers on my iPad mini: KyBook 2, MapleRead SE, and ShuBook 2M.

I entered the URLs for nine articles into five website converters. In-ePub output a file for five of the nine articles. Online Converter and Online-Convert each output a file for eight of the nine articles. Go4Convert output a file for only one article. Convertio output a file for every single article; unfortunately, every single one of those files was empty.

None of the websites converted all of my test articles. Convertio failed the most often, with exactly zero successful conversions (also ironically making it the most consistent of the five). It simply created empty EPUB files. Online Converter and Online-Convert succeeded the most often with 8 articles each.

The simplest webpages turned into the best-looking EPUB files. But that is not saying much since the simplest webpages also produce the best PDFs. My problem is not the simple pages but the complicated ones. The EPUB converters did not (or could not?) really do any cleanup on the webpages. They did not seem to have any way to determine what on the page was relevant and what was not. Header, footer, and sidebar information often ended up being included in the EPUB file sometimes with and sometimes without the CSS that formatted it on the original webpage. This is perhaps an inherent weakness in trying to convert automatically from one format to another.

Within those basic parameters, there was some variety in the formatting applied by the converters. Online-Convert seemed to keep the closest appearance to the original webpages. This sometimes meant the articles looked the “best” in the EPUB readers but it also sometimes resulted in them looking the worst. MapleRead in particular had the most trouble with the formatting included by Online-Convert while ShuBook was the most forgiving (or possibly ignored the most webpage formatting?). The differently sourced EPUB files all looked the most similar in ShuBook.

Screenshot: MapleRead's ugly display of Online-Convert EPUB file — Screenshot: MapleRead’s ugly display of Online-Convert EPUB file

Screenshot: MapleRead's nice display of an Online-Convert EPUB file — Screenshot: MapleRead’s nice display of an Online-Convert EPUB file

Screenshot: ShuBook's cleaner display od Online-Convert EPUB file — Screenshot: ShuBook’s cleaner display od Online-Convert EPUB file

Images were the biggest problem. The default settings for the website EPUB converters do not seem to include the images. This is not a problem if the image is purely decorative, but when it is a diagram or figure illustrating something explained in the text, then it is certainly a deal-breaker to not include them. In-ePub and Online Converter did not include images in the converted files while Online-Convert did. The single article that Go4Convert converted did not include any images, so I have no conclusive data one way or the other for this converter.

Screenshot: Text includes the caption for Figure 1 but not the image (displayed by MapleRead).

Code generally did not cause any problems. It was easily discernible; set off from the text using a fixed-width font. Unfortunately, when the files output by Online-Convert had formatting conflicts with MapleRead, the code ran off the side of the page rather than wrapping around, making it impossible to see, let alone read. ShuBook code looked the best.

Screenshot: Code does not wrap properly in Online-Convert EPUB displayed by MapleRead

While about half of the webpages I converted allowed comments, only three articles actually had comments. Of those, only one converted EPUB file included the comments. The In the Library with a Lead Pipe article has over 70 comments which were included in the EPUB in all their glory. The Wired article includes comments but hides them by default and the phpbuilder article includes comments, but neither set of comments was included in the converted EPUB file.

Quotes and references generally did not cause any problems for any of the EPUB converters. While the exact formatting varied, both quotes and references were easy to distinguish from the rest of the text.

Finally, only one of the articles converted included a useful table of contents. For the rest of the articles, none of the converters produced useful output. In the e-readers, the information displayed varied from nothing to a random-looking list of entries unrelated to the actual sections of the articles. Even the EPUB file that included actual headings from the article also included a bunch of unrelated entries on the list. The output was consistent across converters; articles lacking a table of contents lacked it from all of the converters. The one article with a decent table of contents had the same list of entries from all converters. The apparent gibberish populating the table of contents for other articles was the exact same gibberish from each converter.

Screenshot: Table of contents with section headers from the article (displayed in Kybook)

Screenshot: Table of contents is all junk (displayed in ShuBook)

Screenshot: Table of contents created from the website header rather than the page content (displayed in MapleRead)

Conclusion

Comparing the EPUB files in my three e-readers confirmed my initial disappointment with the converted files. Converting HTML to EPUB has many of the same problems as converting HTML to PDF. There is such a diversity of ways that webpages are built that there is no one good way to catch them all. Website HTML to EPUB converters do not represent an improvement over converting to PDF for me and, in some cases, produced worse results. I am not going to use this method to convert HTML into EPUB.

May 12, 2017

Reading PDFs

Summary

I have what seem like straightforward needs. I want to be able to read PDFs that I have saved in my Calibre Library and take notes in them. It turns out that this is not so simple at all. GoodReader and Kindle do not support my local OPDS catalog. ShuBook 2M and the quite expensive (but I’m not bitter) MapleRead SE have sub-optimal note taking abilities. Right now, KyBook 2 works well for me, but my search for the “best” method continues.

Introduction

While the fiction I read largely comes from Amazon via the Kindle and other online publishers as EPUB books, the professional literature and much of the non-fiction I read comes in PDF format. I won’t get into the discussion about loving or hating PDFs. My concern here is that I need to be able to read them and take notes about them in some useful manner.

Currently, I save the PDFs (and EPUBs) temporarily to my Dropbox account and then import them into Calibre on my desktop computer so I can assign metadata and be able to actually find the files again. Calibre makes the metadata and the files available via an OPDS catalog on my local network. It seems like it should be a simple matter to import those files into my e-reader of choice. However, it turns out that to then pull those files onto my iPad mini to read them is, well, not so simple at all.

The apps

GoodReader
Cost: $4.99
I really like this app. It works great with Dropbox, so I can pull in anything that I haven’t already moved out of Dropbox. It connects with many of the cloud services. I can also connect to my computer over my local network. This is a neat feature that I, alas, do not use very often. However, it doesn’t work with the OPDS catalog and that is the deal-breaker for me.

Despite that, I want to emphasize that the note-taking abilities in GoodReader are excellent. There is a plethora of tools; in addition to usual highlighting and commenting, there is underlining, strike-through, shapes, etc. GoodReader also has the unique (among the apps I am comparing here) design of saving notes as part of the document being annotated rather than in a separate notes file. The first time you highlight something, it asks if you want to “save changes to this file or do you want to create a separate copy of a file, and save changes there.” So you can have two versions of your document, one clean and one all marked up with your notes. It seems like GoodReader would be great for editing as well as note taking. I really wish I could use this app, but the lack of OPDS support makes it nearly impossible to create any sort of feasible workflow. Add-on note: This app does not seem to display a table of contents for PDFs.

Kindle
Cost: Free
Alas, this app does not work with my local OPDS catalog. I would have to add another step (or several?) to send all of the files to my kindle account, resulting in duplicated files that lack the metadata I assigned in Calibre. Nope!

ShuBook SE
Cost: no longer available?
This is an older app that I bought several years ago. It works fine with my OPDS catalog. I have mostly used it for reading EPUB books. It doesn’t work so well for PDFs. It doesn’t save my progress, so PDFs always say 0% done and I have to start from the beginning of the file each time I open it. This app has been superseded by ShuBook 2M (and ShuBook 2P). The website provides a useful comparison chart.

ShuBook 2M
Cost: $2.99
This app is better than the older ShuBook SE. It saves my progress through the PDF. There are no in-app note taking abilities, but I can highlight text and copy/paste it to an outside app (like OneNote, etc.). While that works fine, I find that the process is intrusive enough to interrupt my reading flow. So it’s not my favorite way to take notes. There is also no table of contents for PDFs in this app. I like this app but it works best with lighter reading where I don’t need to take notes.

Editing tools for ShuBook 2M — Editing tools available for ShuBook 2M

MapleRead SE
Cost: $5.99
This app was a huge disappointment. It touts its note taking abilities (and charges heavily for them too), however, that apparently only applies to EPUBs and not not to PDFs at all. This app provides the worst experience of the apps I have looked at thus far for reading PDFs. Downloading them from my OPDS catalog worked great but PDFs are images only in this app. The website says “Note-taking including marking (highlighting) as images and commenting with 3 priority levels”(emphasis mine). That means no ability to highlight text at all. No dictionary look up, no text saved to notes, and no copy/paste to an outside app. I can draw a box on the page to mark it but that is it. Looking at my saved notes, I cannot see anything except that I marked something on page X. If I add comments to the marked box, I can see those, but that is not useful if I can’t also see the text that I commented on. Finally, while it has a VERY nice table of contents, there is no apparent way to export notes.

KyBook 2
Cost: Free/$3.99 in app upgrade
So far, this one has worked the best for me. Downloading from my OPDS catalog is easy. And this is the first app I’ve used that actually remembers my OPDS catalog from one use to the next. All of the other apps have to scan the network each time to rediscover it. Reading progress in PDFs is saved. Note taking is super easy and I can export the notes to an outside app once I am done reading the book. This app also has an interactive table of contents. Currently, this is officially my go-to reading app for PDFs.

Table of contents in KyBook 2 — Table of Contents in KyBook 2

Conclusion

This is an ongoing project (as they all are). Thus far, I think I have spent more time hunting for the right methodology than I have actually spent reading. Stupid internet! *Shakes fist angrily* I am collecting [all those cataloging and BIBFRAME] articles and [other non-fiction, like Anatomy! and Economics!] books faster than I can actually read them. Another method that I need to explore is converting PDFs to EPUB. This is easily done in Calibre. I have tried it in the past and was not impressed with the results. However, I will try it again because I am also just not satisfied with the usability of the PDFs. Another interesting idea to try is converting the HTML articles I find into EPUB rather than PDF.

One final thing to note here is that each of the reading apps above has a different interface. Learning how to use one app will definitely not help you to figure out the next one, it might even make it harder to figure out. Comments in the support forums for every single one complain that they are not intuitive. I suspect that “intuitive” in this case is shorthand for “that is not how my brain would organize things.” In any case, I have found that spending the time to click all the buttons (what happens when I tap that icon?) is the best way to figure how the app is organized.

August 6, 2015August 6, 2015

Digital Music Organization and Archiving

When I first considered writing this up, my first thought was that it wasn’t related to cataloging. But really, like many of my projects, it actually is. I am all about the information organization and having good metadata no matter what I am doing. So here it is.

Recently, I ran out of space on my ancient iMac’s 500 GB hard drive. I have thinking for quite awhile now about shifting my media off the internal hard drive and onto an external drive array of some kind; specifically so I could add more media. Fortunately, we had a Drobo sitting around begging to be used. Now I have 2 TBs of space that is expandable to 8 TBs! All that space! Let the ripping, er, archiving commence!

First question: What format should I use?

Being firmly planted in the Apple camp, of course I use iTunes as my media player, so I need to have my music in a format that it can use. I have ripped music to mp3s previously and I know that there is a newer format that iTunes uses as the default for purchased music. But I also vaguely know that these are not “perfect” versions, that they are compressed to save space, etc. So what about an “archival” version? Is is possible to rip an exact duplicate of the CD version of the music?

I discovered that there are basically three types of music files: uncompressed; lossless, compressed; and lossy, compressed. The uncompressed files take up huge amounts of space but are exact replicas of the original CD. I guess this would be the ultimate “preservation” file, assuming I had the disk space for it. A lossless, compressed file also exactly replicates the original CD though, so I think this is probably a better choice for me. Someone else might have some sort of specific reason to choose uncompressed. The lossy, compressed file seems to be a good choice for the “access” version of the music.

My research revealed two main types of uncompressed formats: WAVE and AIFF. WAVE (Waveform Audio File Format) is used primarily by Windows computers and AIFF (Audio Interchange File Format) by Macs. I didn’t investigate these formats any further as my focus was on the lossless, compressed file types.

The Wikipedia article “Audio file format” discusses several lossless, compressed formats and further searches narrowed my interest down to two: FLAC and ALAC. FLAC stands for “Free Lossless Audio Codec” and ALAC stands for “Apple Lossless Audio Codec”. I’ll cover these in more detail below.

Finally, there are many lossy, compressed formats, but again I quickly narrowed my interest down to two: mp3 and AAC. These also I did not pursue very far in my research. Despite being the defacto standard for digital music, it turns out that the mp3 format is actually encumbered with a jumble of licensing issues. However, the adoption rate for this file format is high enough that, not being a developer, the messy licensing would probably never be an issue for me. AAC is the format that iTunes uses for music downloaded through its store. Additionally, I learned that iTunes saves AAC DRM’d files as m4p and AAC unprotected files as m4a; and as an iTunes Match subscriber, I can get my (handful) of m4p songs updated to m4a.

As an aside to this discussion about “which format?”, I learned that file format (i.e. container) and codec are not necessarily the same thing. Often a codec uses a particular file format but that is not always true. For example, FLAC can be encoded in its native container (with a file extension of .flac) or in the Ogg container (with file extension .oga).

FLAC is open-source and supported by a range of software and hardware. Everyone on the internet, (by which I mean the posters in the audiophile forums that my search results turned up[1]), seems to have generally positive things to say about FLAC. Alas, it is not supported by iTunes. But that is okay; at this point in my research, my thought was that I could use this as my “preservation” format and transcode the files into a lossy “access” format (for the smaller file size) that iTunes can use.

The objections around ALAC seem to center mainly on it being proprietary (and/or because it’s Apple) and the fact that it won’t transcode 24 bit files back out of ALAC again[2]. It turns out that while the codec was originally proprietary, Apple released it as open source in 2011. So that is now a non-issue. As for the 24 bit transcoding problem; according to Wikipedia, audio CDs are 16 bit, while audio DVDs can be up to 24 bit. I am only ripping CDs at this point, so the 24 bit transcoding problem seems like a non-issue for me. Obviously, if I should want to rip audio DVDs in the future, I would need to revisit this issue and do some more research (DRM is also an issue for audio DVDs). For 16 bit files, the audiophile forums discussing the topic assert that there is no difference in the quality of files produced by ALAC versus FLAC.

In sum, I found that FLAC is supported in more places than ALAC, but ALAC is supported by Apple. Using ALAC would also allow me to rip the music into just the one lossless format, instead of needing separate preservation and access formats.

Question: Would using ALAC lock me into iTunes as my only alternative?

More research revealed that there are several tools for converting between ALAC and FLAC. ffmpeg was the most commonly mentioned (also, free!), but others I saw frequently mentioned were dBpowerAMP and XLD.

Question: What about the metadata?

This took a bit more careful searching (and some reading between the lines) to discover the answer. The metadata was not really explicitly discussed very often. I think that “everyone knows how it works” so it is not really discussed, or people just jump right in and try it out, so they don’t ever need to ask (or they don’t care?). In any case, the sense that I got from the discussion boards was that, yes, the metadata is transcoded along with the rest.

However, there seems to be an issue with album covers. This isn’t really an issue for me as I have never made any effort to include them, but it is something to be aware of. It is not really clear to me, but I think that if the art is embedded in the metadata (requiring some effort on the part of the user), it will be transferred. But, if the art is only linked to the metadata, then it will not be transferred. I believe that iTunes’ default is to link rather than embed. I have not researched how or if it is possible to change linked art to embedded art within iTunes.

Conclusion

So the answer is fairly simple for me: use iTunes to rip my CDs into ALAC. I’ll have a lossless, compressed file that works just fine with iTunes and can be transcoded fairly easily to another format at a later date without any loss of fidelity. The next step is the hard one – actually doing it!

A note about my research process (because I care about that sort of thing)

I did all of my research on the internet. From a general internet search, I started with Wikipedia articles. These gave me the basic keywords and understanding that I needed to ask more specific questions. The answers to those questions, I found on blogs and message forums, and in the documentation for the software/codecs/file types. This was a fairly straight-forward question that I basically spent an afternoon researching. Writing this blog post, on the other hand, has taken at least four times as long. Organizing a coherent explanation that cites all of the appropriate sources apparently takes me much longer than just making a decision based on those sources.

Notes

[1] In researching differences between FLAC and ALAC, I found several forums that discuss it.
Head-Fi Forums
Hydrogenaudio Forums
Linn Forums

[2] This discussion on the Stereophile Forums specifically addresses the 16 versus 24 bit issue.

June 28, 2013

Code4lib 2013 NoSQL Pre-Conference Session

These notes are just a tad out of date. I started working on them right away after the conference, but then never got beck to them again. Unfortunately, I also no longer remember what goes in the gaps of my short-hand, so they get less explanatory and more list-y pretty quickly.

Pre-conference session at Code4lib2013

From the program:

Introduction to NoSQL Databases

1-470 Richard J. Daley Library, 9:00 am to 12:00 pm on Monday, February 11

Joshua Gomez, George Washington University, jngomez at gwu edu

Since Google published its paper on BigTable in 2006, alternatives to the traditional relational database model have been growing in both variety and popularity. These new databases (often referred to as NoSQL databases) excel at handling problems faced by modern information systems that the traditional relational model cannot. They are particularly popular among organizations tackling the so-called “Big Data” problems. However, there are always tradeoffs involved when making such dramatic changes. Understanding how these different kinds of databases are designed and what they can offer is essential to the decision making process. In this precon I will discuss some of the various types of new databases (key-value, columnar, document, graph) and walk through examples or exercises using some of their open source implementations like Riak, HBase, MongoDB or CouchDB, and Neo4j.

Notes:

No slides available as of 6/18/13

Gomez began his presentation with the caution that he was not an expert and had, in fact, signed up to give the presentation in order to make himself learn about noSQL databases. Gold star for him, I don’t think I could do that.

Outline of the presentation

What noSQL dbs are
How they are different from relational dbs
Review of ACID and CAP
The 4 types of noSQL dbs:
columnar db

bigtable & hbase
mapreduce & hadoop

key-value store db

dynamo
riak

document store

mongodb

graph db

Introduction

Are noSQL dbs an innovation or just a fad? Everyone seems to be doing it and each has their own version.

BigTable – Google
Dynamo – Amazon
Cassandra – Facebook
Voldemort – Linkedin

NoSQL dbs are built for situations where relational dbs struggle
There are no joins, the data is not stored like relational db
Schemaless data structures

Advantages:

high availability
massive horizontal scaling

Tradeoffs:

no guarantees,
not perfect consistency

ACID

atomicity – transaction must all run or all fail
consistency – db is always valid, no violation of constraints
isolation – separate transactions can’t interfere with each other
durability – changes are permanent, even when there is a system failure

CAP Theory

distributed dbs can only provide for 2 out of 3 properties:

consistency
availability
partition tolerance

latency is the tradeoff against availability and consistency

4 types of noSQL dbs
columnar

a single very large [table?], column-based, 2-dimensions
good for sparse data sets (lots of nulls) and aggregation
easy to add columns

key-value

fast distributed hash map
no good for complex queries or data aggregation

document

like a hash but allows for hierarchical data structures
combo of simple look-up and nested data, flexible

graph

records data about relationships
excellent for insight and discovery through node traversal

Key-value store dbs are the least complex databases
Graph databases are the most complex databases
90% of use cases don’t even come close to needing noSQL level
Relational dbs are not obsolete, they can do most or all of the same things that noSQL does

So why would someone use noSQL?

might need unitasker
might be cheaper
for the schema flexibility
to learn new things

Detailed look at the four types of databases

1. Columnar dbs

These dbs are column-based, not row-based
They are good for scans over large data sets
They allow for massive horizontal scaling

Implementations

BigTable
HBase
Cassandra
Hypertable

BigTable goals

wide applicability
scalability
hight performance
high availability

Note that consistency is missing from the list!

BigTable data model

sparse
distrbuted
persistent
multi-dimensional

map (each cell) indexed by:

row key: string
column key: string
timestamp: 64 bit integer

No old data thrown away, always new stamp added
Values are uninterrupted arrays of bytes (strings)

BigTable rows

rows are dynamically portioned (tables)
short row ranges usually on one machine
manipulation of row names ensures keys are on same machine

BigTable columns

column keys grouped into families
few hundreds of families
individual columns unbounded (millions exp)
columns can be org into locality[???] groups to improve lookup efficiency

Bloom filters

used for fast lookups
array holds single bit values
hash function maps inputs to set of cells in the array
allows for quick negative response
can give false positives, but no false negatives

Close grouping of similar values enables very high compression (10-1)
Doesn’t keep saving the same identical data over and over again, just points back to the existing file

Distributed processing
The traditional approach uses a single expensive powerful computer, but thus doesn’t scale well.
The divide and conquer approach leverages lots of commodity hardware to handle data in smaller chunks processed in parallel.

HBase (hadoop)

apache
open source bigtable
built in Java
shell is jruby interp
other interface
jython, groovy, scala, REST, ???

HBase CRUD

–quick demo, didn’t work—

Operations can be run from the command line or from a REST interface (REST interface requires values in base64)
Intended for big jobs
Strong scaling capabilities
Scans over large sets fast
Complex ancillary systems – high learning curve
Documentation not the best

HBase possible library applications:

full text indexing of an entire collection — probably wouldn’t make it work very hard
web archiving
cms backend? — probably not, only for masochists

2. Key value stores

Very simple data structure – one big table
Fast but can’t do complex queries

Implementations:

Dynamo

Open source implementations

redis
riak
memcached
project voldemort

Dynamo (Amazon)
It’s goal is to be reliable on a massive scale. This means:

high availability
low latency

The environment is an e-comerce platform, running hundreds of services (not just the Amazon store).
An RDBMS is not a good fit for this environment.
Most services retrieve by primary key, so there are no complex queries
An RDBMS, on the other hand, requires expensive hardware and expertise

Assumptions and Requirements
The query model:

read/write by key
no mul items (relations)

acid properties

weak consistency

efficient

built on commodity hardware
no latency

other

internal non-hostile – no internal controls on access
scale of hundreds of hosts

Why does it need to be so fast?
In an ecosystem built on service tiers, each level of services has to be even faster than the one above it, right down to the data stores.

Design considerations
The compromise is on consistency, with a goal of “eventual consistency”
Conflicts are resolved at read time so that the database is always writeable, resulting in high availability
Conflicts are resolved by the application, the db can only just take the last change, so potentially, data could be lost

Gomez used the example of Amazon’s shopping cart and accessing it from multiple computers at multiple times and making changes to to it. There is the potential that something could get out of sync between the various computers.

Incremental scale, scale out one node at a time
Symmetry – all nodes are peers and equal, no master nodes
Decentralization – handles outages better
Heterogeneity – hosts not created equally, some hardware will be better than other hardware

System architecture
interface, has only two operations

get key
put key, contact, object

dynamic data partitioning

consistent hashing (in a ring structure)
each node is assigned a position in a ring
keys are hashed to determine node
nodes are virtual and hashed to multiple points

replication

data replicated to N hosts
coordinator node replicates to n-1 successors + others in preferred list

versioning

eventual consistency
vector clocks

configurable quorum

define levels at which read and write happen

Amazon’s white paper on Dynamo

Riak

open source key-value store
developed by Basho
written in Erlang

Riak is based on dynamo

Add features:

it links data together
data can be anything – text, images. video
can use a REST interface
curl – get/put/delete/post

–Riak demo–

link walking

bucket, tag, keep

ad hoc read/write specification

you can specify w, r and n within a query

custom server side validation

pre-commit and post-commit hooks

plugins

indexing – put into headers
search – plugin builds interred index with pre-commit hooks
has http solr interface

Summary of key-value dbs

distributed replicated
high availability
schemaless
[something else on the slide that I missed]

Possible library applications

large inventory/repository backend

3. Document-oriented DBs
These dbs use table for keys like the key-value stores
data stored in json “documents”
allows for hierarchical data
schemaless, change data structure on the fly

open source options

mongodb
couchdb

couchdb

apache
Erlang

easy to use

big or small projects
easy to install
nice web UI (Futon)

–couch db demo–

data not overwritten, just added with a new timestamp
couchdb – mapreduce
views to query data

consist of map() and optional reduce()
map has 2 arguments: key and value
map function emits key-value pairs
reduce has 3 arguments: key value, rereducer

mapreduce – resource intensive

don’t run jobs on production environ
save temp views as design doc
couchdb stores results and watches for changes

queries from the demo:
function(doc) { if ('user' in doc && 'albums' in doc) { doc.albums.forEach(function(album){ var key = album.title; var value = {by: album.artist, tracks: album.tracks}; emit(key, value); }); } }

function(doc) { if ('user' in doc && 'albums' in doc) { doc.albums.forEach(function(album){ if ('tracks' in album) { album.tracks.forEach(function(track){ emit(doc.user, 1) }); } }); } }

couchdb — stepping through the map reduce example
docs passed to map
map emits values from each doc
emitted values are sorted by keys
chunks of rows with same key passed to reduce
chunks too big, re-reduce
repeat until no duplicate keys

couch vs mongo

couch can be small, mongo not
mongo shards, couch replicates
mongo enables ad hoc queries – couch requires views
couch is made for web – rest is afterthought in mongo

Summary

schemaless json stores
powerful map reduce
scalable

Possible library applications

multiple domain metadata repository

4. Graph dbs

interconnected data
nodes and edges (edges are the relationships)
both can have metadata
queries traverse data
good for networks, object-oriented data

open source

neo4j
orientdb
hypergraphdb

neo4j

built by Neo Tech
written in java

interacting with the db

gremlin/groovy command line console
cypher query language
rest api
web UI (includes gremlin)

adding data in using gremlin
chain commands
out == outE.inV
in == ???

Demo
* gremlin> g.v(7).outE.inV.title * ==> Time Bandits * ==> Twelve Monkeys * ==> Jabberwocky * ==> Monty Python and the Holy Grail * ==> null

looping – social network, find the friends of a friend, loops out over and over again

rest interface

getting path in rest, give starting and ending node, it will give you the path between them?

gremlin via rest

indexing

custom indexes

summary

model anything
pretty big
acid compel
does not scale well

Possible library applications

no suggestions offered

Talk summary

nosql dbs are fun to learn
new capabilities
trade-offs
don’t abandon your RDBMS just yet

February 2, 2013

Reading Notes: A reconsideration of mapping in a semantic world

Article citation:

Dunsire, G., Hillman, D., Phipps, J. & Coyle, K. (2011). A reconsideration of mapping in a semantic world. Proc. Int’l Conf. on Dublin Core and Metadata Applications 2011, 26-36.

Summary:

Dunsire, et al. discuss earlier methods of cross-walking metadata and compare it to the type of mapping potentially available in an open Semantic Web environment.

Quoted abstract:

For much of the past decade, attempts to corral the explosion of new metadata schemas (or formats) have been notably unsuccessful. Concerns about interoperability in this diverse and rapidly changing environment continue, with strategies based on syntactic crosswalks becoming more sophisticated even as the ground beneath library data shifts further towards the Semantic Web. This paper will review the state of the art of traditional crosswalking strategies, examine lessons learned, and suggest how some changes in approach–from record-based to statement-based, and from syntax-based to semantic-based–can make a significant difference in the outcome. The paper will also describe a semantic mapping service now under development.

Discussion:

Dunsire et al. begin the article by talking about the large amount of legacy data in various formats that has been generated and the need to crosswalk the data between those formats. In the traditional cataloging environment, data values have to be translated from one closed system based on a single metadata standard to another closed system based on another different metadata standard. So the elements have to be mapped between the systems and then the data values transformed between the systems to match different constraints of the elements in each system. Unfortunately, problems occur when the elements in each system do not have exactly the same meaning. There is no way to translate “approximately” when building a traditional crosswalk.

Additionally, the mapping of elements between systems is normally done separately from and outside of the method used to transform the data values.

Maps are developed, ingested and maintained as documents (usually spreadsheets) that are not actionable. Thus a further, but separate, step beyond the intellectual process of creating a map is the creation of programs that implement the mapping and transform data based on the decisions made during the creation of the maps. (p. 27)

They propose instead using RDF as a basis for developing maps.

What occurs to me though, is that the legacy data, by definition, cannot exist in the RDF/Linked Data world because it is just strings. Without the linkages, it is invisible. I wonder, does this mean that someone(s) must create metadata elements that correspond to the MARC fields and subfields to allow the relationships (maps) to be built? I think that something like this is happening in the Open Metadata Registry. I also think this is part of what Karen Coyle is always trying to get people to think about, but I am too lazy to go search for references right now.

Dunsire at al. discuss existing cross-walk strategies used for legacy data. Existing cross-walking strategies are generally top-down and include “pairwise” or “peer to peer” and “hub-and-spoke” or “switch” strategies. Pairwise strategies have only a one-to-one equivalency between element sets, this is their great weakness. Switch strategies allow more nuanced linkages but at the cost of an extremely large vocabulary.

Most existing crosswalks implement a peer to peer strategy. Because there has to be a one-to-one equivalence between the two mapped elements, there is no way to express “almost the same”. Switch strategies get around this by using a central vocabulary that can be used between any other two sets of elements. Unfortunately, this results in an extremely unwieldy (and ever-expanding) vocabulary to manage all of the possible equivalencies between all of the elements.

Dunsire et al. envision a process where elements can be related to each other in an ad hoc way through a series of RDF graphs.

A mapping is made to an appropriate existing RDF graph using OWL equivalence properties, as with a switch vocabulary in the top-down scenario. However, that graph need not, in itself, contain a mapping to the target vocabulary. Instead, the graph is “mapped” or connected to another graph, and so on until a graph containing the target concept is reached. (p. 28)

But this seems like it could be problematic to me. Is it the friend of a friend problem? When A knows B and B knows C, does C know A? Or this a problem with my thinking because I am going back them being equivalent rather than just somehow related? Because I am using the same relationship between A to B and B to C. So, if A knows B and B has heard of C, how are A and C related? Maybe not at all. Certainly, no assumption can be made about the relationship of A and C based on the relationships between A to B and B to C in this case. I think that without tight controls on the system, this process could easily go off the rails.

They specifically state that the mapping should be done with “OWL equivalence properties” so maybe I just need to go look at the OWL documentation to better understand how this type of mapping would work. The OWL documentation has been on my to-do list for awhile now, along with many, many other things. *sigh*

Also, the relationship between any two elements still has to be pre-defined, so in that sense, it is not really any more ad hoc than the switch vocabulary setup.

In an RDF environment, we can create relationships between the elements and use all elements (regardless of origin) together rather than translating the data value contained in one element to a different value for a different element in a different set.

In an open, non-transformative world, mapping applications can choose to treat all relationships as statements of equivalence or near-equivalence, or they can make use of more complex relationships in their applications. (p. 30)

Dunsire et al. give the example of DCMI’s Simple DC and DC Terms. These are two separate sets of elements that are related by property and subproperty.

The important thing to note is that the data values can stay the same, they are not required to be transformed into a different form for a different element set. They stay in their original element and the elements themselves have defined relationships within the system. So we end up with a collection of elements from different sets all being intermingled together rather than elements from just one set. For example, rather than having a MARC “record” to display, we might have a display assembled from individual elements taken from Dublin Core, RDA and FOAF. Hmm, it will be interesting to see how data storage in this new environment shakes out.

This mapping is done at the RDFs/OWL/SKOS level as part of the software programming. Setting up the semantics would not really be an on-the-fly sort of thing. But at the same time, it means that the whole process is integrated into one package together without the need to maintain separate documents in multiple places. I really like the idea that, even if transformations of the data values do take place, the original semantics of the elements/values would not be lost since it would all be connected within one application.

Dunsire et al. propose mappings be built on the relationship of one element refining another element; in the same sense that dct “dateCopyrighted” refines dc “date”. They don’t have exactly the same meaning, but they are related. How this might work is demonstrated in a figure showing possible relationships between the element “extent” in various element sets. I can see lots more wrangling over semantics in our future.

But I don’t think that a system that does this actually exists anywhere yet. They discuss how a toolkit/application would need to support the process of building the relationships by providing suggestions and constraints for users trying to map various properties. Maybe the eXtensible Catalog does some of this?

Dunsire et al. state that there is no need for authoritative mapping. Any given element can map to differing semantics at the same time. They cite the fact that are two versions of FRBR defined in RDF. This makes my cataloger’s heart want to curl up and hide. La, la, la, la, I can’t hear you! It sounds like chaos, but in reality, probably not. Conflicting semantics won’t be used in the same application and possibly not even in the same community (though this where the wrangling over semantics might come in again). Anyone can do anything, but that doesn’t mean that I have take everyone into account when I do my thing. So the “authoritative mapping” will still exist, but more likely, I’m guessing, as a consensus locally or within a community.

Finally, Dunsire et al. note that while the mapping process in the RDF environment is fundamentally different, it is still not a simple process. Many of the same problems still exist in the new environment.

There are issues of provenance and authorship of the map, version control and change management over time, the editorial and publishing cycle, management of group authorship and roles within the group including discussion and voting, and even evaluation of the validity of individual mapping statements based on the declared domains and ranges of mapping predicates. (p. 33)

The Open Metadata Registry is being developed to support this functionality for users to do this kind of semantic mapping between elements. Assuming that the eXtensible Catalog has mappings, I wonder if they allow or plan to allow administrative users (not the end-users) to add custom mappings into the system.

Dunsire at al. finish the article with four questions for further discussion:

• What will be the relationship of Application Profiles, specifying how sets of data elements should be assembled into packages or “records” for particular applications, to this ecology of mapping? Will communities wish to designate mappings that reflect their metadata point of view?

• We see the value of mappings as independent ontological statements with visible authority and ownership separate from the originating ontologies. Is there a value too, to formal endorsement in this environment?

• How does the mapping of individuals in value vocabularies fit in? Can these techniques be applied to value vocabularies in a useful way? Is the value different or less?

• What is the value of metadata registries such as the Open Metadata Registry, id.loc.gov, the Dublin Core registry, and vocab.org, in this environment. Can tools based within those registries encourage the growth of this environment? (p. 35)

December 10, 2012

ALA 2012: FRBR Presentation Four

“Current Research on and Use of FRBR in Libraries”
8am on Sunday, June 24, 2012
Speakers: Erik Mitchell & Carolyn McCallum, Thomas Hickey, Yin Zhang & Athena Salaba, Jennifer Bowen

Presentation 4 of 4

“FRBR and XC: Participatory Design”
Jennifer Bowen

No slides available for this presentation.

Bowen began her presentation with a brief introduction to the eXtensible Catalog (XC).

She then noted that user studies were built into the development of XC. The participatory design included observations of users working, surveys, and interviews. They asked users what they wanted.

The findings of that research were not FRBR-specific but what users wanted basically matched the FRBR model:

Users have preferred material and format types
Users want to know why items are on a result list
Users want to choose between versions of resource and see the relationships between resources

For ever-changing future needs, XC has a customizable user interface. Browsing of a collection of resources can be customized based on some common attribute or relationship within the collection.

Finally, Bowen concluded that their research showed that aspects of FRBR do address what users need to do.

Related links:
The results of the research are available in the book Scholarly Practice, Participatory Design and the eXtensible Catalog

December 3, 2012December 3, 2012

ALA 2012: FRBR Presentation Three

“Current Research on and Use of FRBR in Libraries”
8am on Sunday, June 24, 2012
Speakers: Erik Mitchell & Carolyn McCallum, Thomas Hickey, Yin Zhang & Athena Salaba, Jennifer Bowen

Presentation 3 of 4

“Research, Development and Evaluation of a FRBR-based Catalog Prototype”
Yin Zhang & Athena Salaba

Presentation Slides

Presentation outline:

background of the project
research and development of the project
user evaluation of the project
conclusion/next steps

Background

Zhang began the presentation by discussing the background of the project. While FRBR has the potential for libraries to develop better and more effective catalogs and discovery tools, there is not much in the way of guidance for how to implement FRBR. User studies are still few and far between. KSU received IMLS funding to develop and research FRBR-based systems. As part of that project, KSU conducted a series of user studies.

Methodology

1. Run user evaluation studies on FRBR-based catalogs already in existence
2. Put together a FRBR-ized data set
3. Develop an initial set of displays
4. User feedback on the developed prototypes

Step One

The first step was to evaluate existing FRBR-based catalogs. They evaluated three existing FRBR-based catalogs for user experiences and support for the FRBR tasks: OCLC WorldCat.org, FictionFinder, and Libraries Australias. The results of the evaluation served as the basis for their own FRBR prototype catalog.

Step Two

The next step was to extract Library of Congress bib records and authority records from WorldCat. They used OCLC’s Workset algorithm to identify works, but applied their own algorithm to identify expressions and manifestations. The results of this were used to develop FRBR-based displays.

Step Three

In the third step, they developed the layouts for the FRBR-based displays based of:

works from an author search
works from a subject search
works from a title search
expressions from a language/form search
manifestation (slide 7)

Step Four

Finally, they sought user feedback on the interface design. The study participants were interviewed using printed display layouts as prompts and asked about data elements and functions. The feedback was incorporated into the final prototype catalog programming.

Here I have appended a screenshot of the prototype catalog search results taken from presentation slide 10 and a screenshot taken of LC’s current catalog search results where I tried to run approximately the same search.

FRBR prototype catalog

Traditional catalog

Instead of the gazillion search results all strung out over many pages as seen in the traditional catalog (is this another record for the same thing that I already looked at three pages ago?), in the prototype, the records are gathered together under the author/title work sets and then by form and language. The resulting display seems cleaner and more compact, while still presenting plenty of information. It seems so obvious to me that catalogs should have always worked this way.

Study Design

Next Salaba discussed the study design for having users actually evaluate the FRBR prototype. They used a comparative approach: with the same set of records, they had users search using both the traditional catalog and the FRBR prototype catalog. The study group contained 34 participants and data was collected via observations, interviews, audio recordings and screen captures.

The participants were given two kinds of search strategies to pursue. The first set of searches were predefined and users were asked to evaluated the resulting displays. In the second set, participants were given criteria and allowed to use their own search strategies.

Findings

Overall, most users (85%) preferred the FRBR prototype for all of the searches they did. The table on slide 14 breaks down the findings into the categories of language or type of materials, author, title, title and publication information, entertainment, research, and a general topic. The biggest difference in searching the two catalogs was that the FRBR prototype allowed users to find expressions. Since the current catalog only provides access at the manifestation level and does not group by language or format, this cannot really be a surprise.

Features that the participants found “helpful”:
Grouping of results by work and expression (65%)
Refining results (24%)
Alphabetical order of results display (15%)
Interface appearance (24%)

Features that participants thought needed improvement:
More detail before manifestation level display (15%)
Prefer individual manifestation level results (9%)
Listing a resource under each language of a multi-language resource (3%)

88% of participants thought that clustering the resources by work/expression/manifestation made it easier to find things. 91% thought that the navigation made sense and was helpful in performing searches. One participant found the FRBR prototype less helpful for searching for a specific title, but helpful when searching for a specific topic.

Conclusions

Salaba noted the importance of user input into the design and implementation of FRBR-based catalogs. The study showed that users can successfully complete searching tasks using the FRBR-based catalog and that users do understand and can navigate the FRBR-based displays.

Finally, Salaba stated that more research is needed into other FRBR implementations, with more studies comparing those implementations. She noted that other issues include:

FRBRization algorithms
Existing MARC records
Attributes and relationships
FRBR-based catalogs the support user tasks
Displays

Additionally, it is unknown at this point how RDA and Linked Data will work into the whole equation.

Article (2007): From a Conceptual Model to Application and System Development

Poster (2007): User Research and Testing of FRBR Prototype Systems

Article (2009): User Interface for FRBR User Tasks in Online Catalogs

Article (2009): What is Next for Functional Requirements for Bibliographic Records? A Delphi Study

Book (2009): Implementing FRBR in Libraries: Key Issues and Future Directions

Presentation for the ALA 2010 Annual Conference: FRBRizing MARC Records Based on FRBR User Tasks

Presentation for ASIST 2010 Annual Conference: FRBR User Research and a User Study on Evaluating FRBR Based Catalogs

An abstract for a presentation at a panel discussion at ASIST 2010: FRBR Implementation and User Research

An abstract for a presentation at a panel discussion of FRBR at ASIST 2011: Developing FRBR-Based Library Catalogs for Users

October 19, 2012

ALA 2012: FRBR Presentation Two

“Current Research on and Use of FRBR in Libraries”
8am on Sunday, June 24, 2012
Speakers: Erik Mitchell & Carolyn McCallum, Thomas Hickey, Yin Zhang & Athena Salaba, Jennifer Bowen

This is the second of four presentations given at this session.

“FRBR at OCLC”
Thomas Hickey

No slides found online for this presentation.

Hickey spoke about the use of FRBR at OCLC.

OCLC manages 275 million bibliographic records at the work, expression and manifestation levels. The bib records are already clustered by work level. OCLC has now started the process of clustering “content” which is roughly equivalent to expression and manifestation.

Clustering is done by creating normalized keys from a combination of the author and title. The advantage is that the process is straight-forward and efficient. The disadvantage is that the algorithm misses cataloging variations.

OCLC is now working with the GLIMIR (Global Library Manifestation Identifier) project. This project will cluster records at the manifestation level and assign an identifier. The algorithms for this project go beyond the author title keys used for workset creation into even the note fields. [NOTE: Examples given in the code4lib article include: publishing information and pagination.]

Using the GLIMIR algorithms they have discovered the same manifestation hiding in different worksets. They tried pushing changes back up to the workset level but it didn’t work very well. [NOTE: the code4lib article gives several examples of ways the GLIMIR has improved their de-duplication efforts. Was it a computational/technical problem?] They are moving to Hadoop and HBase now [?to improve their ability to handle copious amounts of data?].

The goal is to pull together all of the keys, group them and then separate them into coherent work[?] clusters. One problem is the friend of a friend issue. This is used to cluster similar items, but if A links to B and B links to C, are A and C the same thing?

In sum:

the new algorithms are much more forgiving of variations in the records
the iterations can be controlled
the records are much easier to re-cluster (the processing takes hours rather than months)
the work cluster assignments can happen in real time

Worldcat contains:
1.8 billion holdings
275 million worksets
20% non-singltons
80% holdings [have 42 per workset?]
Top 30 worksets — 3-10 thousand records
30-100 holdings
largest group 3.3 million
2.7 million keys
GLIMIR content set — 483 records

Music is problematic for clustering.

VIAF and FRBR
VIAF contains 1.5 million uniform title records
links to and from expressions
link to author from author/title

OCLC can also do clustering in multiple alphabets, using many, many cross-references.

October 19, 2012October 19, 2012

ALA 2012: FRBR Presentation One

“Current Research on and Use of FRBR in Libraries”
8am on Sunday, June 24, 2012
Speakers: Erik Mitchell & Carolyn McCallum, Thomas Hickey, Yin Zhang & Athena Salaba, Jennifer Bowen

This is the first of four presentations given at this session.

“FRBRizing Mark Twain”
Erik Mitchell & Carolyn McCallum

The presentation slides are available on Slideshare or view them in the embedded slideshow below.

Current Research on and Use of FRBR in Libraries from cjmccallum

Erik Mitchell and Carolyn McCallum discussed their project to apply the FRBR model to a group of records relating to Mark Twain. McCallum organized the data manually while Mitchell created a program to do it in an automated fashion. They then compared the results. This presentation covered:

Metadata issues that arose from applying FRBR
Issues in migration
Comparison of the automated technique to an expert’s manual analysis

Carolyn McCallum spoke first about the manual processing portion of the project.

For this project, they focused on the Group 1 entities (work, expression, manifestation and item). They extracted 848 records from the Z. Smith Reynolds Library catalog at Wake Forest University for publications that were either by Mark Twain or about him. Using Mark Twain ensured that the data set had enough complexity to reveal any problems. The expert cataloger then grouped the metadata into worksets using titles and the OCLC FRBR key.

In the cataloger’s assessment, there were 410 records that grouped into 147 total worksets (each one having 2 or more expressions). The other 420 records sorted out into worksets with only one expression each. The largest worksets were for Huckleberry Finn (26 records) and Tom Sawyer (14 records). The most useful metadata was title, author, and a combination of title and author.

A couple of problems that were identified in the process were that whole to part and expression to manifestation were not expressed consistently across the records and that determining boundaries between entities was difficult. The line where one work changes enough to become another expression or even a completely different work can be open to interpretation. McCallum suggested that the entity classification should be guided by the needs of the local collection.

Mitchell then spoke about the automated version of the processing.

Comparison keys comprised of the OCLC FRBR keys (author & title) were again used to cluster records into worksets. The results were not as good as the manual expert process but were acceptable and comparable to OCLC’s results. To improve the results using the automated process, they built a Python script to extract normalized FRBR keys out of the MARC data and compared those keys. This did improve the results.

In conclusion, Mitchell noted that the metadata quality is not so much a problem as the intellectual content. The complex relationships between the various works/expressions/manifestations are simply not described by the metadata. Both methods, manual and automated are time and resource consuming. Finally, new data models, like Linked Data, “are changing our view of MARC metadata” (slide 21).

Question from the audience about problems [with the modeling process?]
Answer: Process could not deal well with multiple authors.

Other related links:
McCallum’s summary of their presentation (about halfway through the post).
A poster from the ASIS&T Annual Meeting in 2011

August 25, 2012August 25, 2012

Moving to new Quarters…

Old quarters:

B105 – Manual file and two student workstations	B115 – Lateral files, table, supply cabinets
B117 – Microfilm cabinets	B114 – The front office workstation, copier, fridge and table
B118 – Shelving!	B109 – Three full workstations and four bookcases