For the last month or two the Digg engineering team has spent quite a bit of time looking into, playing with and finally deploying Cassandra in production. It’s been a super fun project to take on – but even before the fun began we had to spend quite a bit of time figuring out Cassandra’s data model… the phrase “WTF is a ‘super column’” was uttered quite a few times. :)

If you’re coming from an RDBMS background (which is almost everyone) you’ll probably trip over some of the naming conventions while learning about Cassandra’s data model. It took me and my team members at Digg a couple days of talking things out before we “got it”. In recent weeks a bikeshed went down in the dev mailing list proposing a completely new naming scheme to alleviate some of the confusion. Throughout this discussion I kept thinking: “maybe if there were some decent examples out there people wouldn’t get so confused by the naming.” So, this is my stab at explaining Cassandra’s data model; It’s intended to help you get your feet wet & doesn’t go into every single detail but, hopefully, it helps clarify a few things.

BTW: this is long. If you’d rather have a PDF version of this you can download it here.

The Pieces

Let’s first go thru the building blocks before we see how they can all be stuck together:

Column

The column is the lowest/smallest increment of data. It’s a tuple (triplet) that contains a name, a value and a timestamp.

Here’s a column represented in JSON-ish notation:

{  // this is a column
    name: "emailAddress",
    value: "arin@example.com",
    timestamp: 123456789
}

That’s all it is. For simplicity sake let’s ignore the timestamp. Just think of it as a name/value pair.

Also, it’s worth noting is that the name and value are both binary (technically byte[]) and can be of any length.

SuperColumn

A SuperColumn is a tuple w/ a binary name & a value which is a map containing an unbounded number of Columns – keyed by the Column‘s name. Keeping with the JSON-ish notation we get:

{   // this is a SuperColumn
    name: "homeAddress",
    // with an infinite list of Columns
    value: {
        // note the keys is the name of the Column
        street: {name: "street", value: "1234 x street", timestamp: 123456789},
        city: {name: "city", value: "san francisco", timestamp: 123456789},
        zip: {name: "zip", value: "94107", timestamp: 123456789},
    }
}

Column vs SuperColumn

Columns and SuperColumns are both a tuples w/ a name & value. The key difference is that a standard Column‘s value is a “string” and in a SuperColumn the value is a Map of Columns. That’s the main difference… their values contain different types of data. Another minor difference is that SuperColumn‘s don’t have a timestamp component to them.

Before We Get Rolling

Before I move on I wanna simplify our notation a couple ways: 1) ditch the timestamps from Columns & 2) pull the Columns’ & SuperColumns’ names component out so that it looks like a key/value pair. So we’re gonna go from:

{ // this is a super column
    name: "homeAddress",
    // with an infinite list of columns
    value: {
        street: {name: "street", value: "1234 x street", timestamp: 123456789},
        city: {name: "city", value: "san francisco", timestamp: 123456789},
        zip: {name: "zip", value: "94107", timestamp: 123456789},
    }
}

to

homeAddress: {
    street: "1234 x street",
    city: "san francisco",
    zip: "94107",
}

Grouping ‘Em

There’s a single structure used to group both the Columns and SuperColumns…this structure is called a ColumnFamily and comes in 2 varieties Standard & Super.

ColumnFamily

A ColumnFamily is a structure that contains an infinite number of Rows. Huh, did you say Rows? Ya – rows :) To make it sit easier in your head just think of it as a table in an RDBMS.

OK – each Row has a client supplied (that means you) key & contains a map of Columns. Again, the keys in the map are the names of the Columns and the values are the Columns themselves:

UserProfile = { // this is a ColumnFamily
    phatduckk: {   // this is the key to this Row inside the CF
        // now we have an infinite # of columns in this row
        username: "phatduckk",
        email: "phatduckk@example.com",
        phone: "(900) 976-6666"
    }, // end row
    ieure: {   // this is the key to another row in the CF
        // now we have another infinite # of columns in this row
        username: "ieure",
        email: "ieure@example.com",
        phone: "(888) 555-1212"
        age: "66",
        gender: "undecided"
    },
}

Remember: for simplicity we're only showing the value of the Column but in reality the values in the
map are the entire Column.

You can think of it as a HashMap/dictionary or associative array. If you start thinking that way then you’re are the right track.

One thing I want to point out is that there’s no schema enforced at this level. The Rows do not have a predefined list of Columns that they contain. In our example above you see that the row with the key “ieure” has Columns with names “age” and “gender” whereas the row identified by the key “phatduckk” doesn’t. It’s 100% flexible: one Row may have 1,989 Columns whereas the other has 2. One Row may have a Column called “foo” whereas none of the rest do. This is the schemaless aspect of Cassandra.

A ColumnFamily Can Be Super Too

Now, a ColumnFamily can be of type Standard or Super.

What we just went over was an example of the Standard type. What makes it Standard is the fact that all the Rows contains a map of normal (aka not-Super) Columns… there’s no SuperColumns scattered about.

When a ColumnFamily is of type Super we have the opposite: each Row contains a map of SuperColumns. The map is keyed with the name of each SuperColumn and the value is the SuperColumn itself. And, just to be clear, since this ColumnFamily is of type Super, there are no Standard ColumnFamily‘s in there. Here’s an example:

AddressBook = { // this is a ColumnFamily of type Super
    phatduckk: {    // this is the key to this row inside the Super CF
        // the key here is the name of the owner of the address book

        // now we have an infinite # of super columns in this row
        // the keys inside the row are the names for the SuperColumns
        // each of these SuperColumns is an address book entry
        friend1: {street: "8th street", zip: "90210", city: "Beverley Hills", state: "CA"},

        // this is the address book entry for John in phatduckk's address book
        John: {street: "Howard street", zip: "94404", city: "FC", state: "CA"},
        Kim: {street: "X street", zip: "87876", city: "Balls", state: "VA"},
        Tod: {street: "Jerry street", zip: "54556", city: "Cartoon", state: "CO"},
        Bob: {street: "Q Blvd", zip: "24252", city: "Nowhere", state: "MN"},
        ...
        // we can have an infinite # of ScuperColumns (aka address book entries)
    }, // end row
    ieure: {     // this is the key to another row in the Super CF
        // all the address book entries for ieure
        joey: {street: "A ave", zip: "55485", city: "Hell", state: "NV"},
        William: {street: "Armpit Dr", zip: "93301", city: "Bakersfield", state: "CA"},
    },
}

Keyspace

A Keyspace is the outer most grouping of your data. All your ColumnFamily‘s go inside a Keyspace. Your Keyspace will probably named after your application.

Now, a Keyspace can have multiple ColumnFamily‘s but that doesn’t mean there’s an imposed relationship between them. For example: they’re not like tables in MySQL… you can’t join them. Also, just because ColumnFamily_1 has a Row with key “phatduckk” that doesn’t mean ColumnFamily_2 has one too.

Sorting

OK – we’ve gone through what the various data containers are about but another key component of the data model is how the data is sorted. Cassandra is not queryable like SQL – you do not specify how you want the data sorted when you’re fetching it (among other differences). The data is sorted as soon as you put it into the cluster and it always remains sorted! This is a tremendous performance boost for reads but in exchange for that benefit you’re going to have to make sure to plan your data model in a such a way that you’re able satisfy your access patterns.

Columns are always sorted within their Row by the Column‘s name. This is important so i’ll say it again: Columns are always sorted by their name! How the names are compared depends on the ColumnFamilys CompareWith option. Out of the box you have the following options: BytesType, UTF8Type, LexicalUUIDType, TimeUUIDType, AsciiType, and LongType. Each of these options treats the Columns’ name as a different data type giving you quite a bit of felxibility. For example: Using LongType will treat your Columns’ names as a 64bit Longs. Let’s try and clear this up by taking a look at some data before and after it’s sorted:


    // Here's a view of all the Columns from a particular Row in random order
    // Cassandra would "never" store data in random order. This is just an example
    // Also, ignore the values - they don't matter for sorting at all
    {name: 123, value: "hello there"},
    {name: 832416, value: "kjjkbcjkcbbd"},
    {name: 3, value: "101010101010"},
    {name: 976, value: "kjjkbcjkcbbd"}

So, given the fact that we’re using the LongType option, these Columns will look like this when they’re sorted:


    <!--
    ColumnFamily definition from storage-conf.xml
    -->
    <ColumnFamily CompareWith="LongType" Name="CF_NAME_HERE"/>

    // See, each Column's name is treated as a 64bit long
    // in effect, numerically ordering our Columns' by name
    {name: 3, value: "101010101010"},
    {name: 123, value: "hello there"},
    {name: 976, value: "kjjkbcjkcbbd"},
    {name: 832416, value: "kjjkbcjkcbbd"}

As you can see the Columns’ names were compared as if they were 64bit Longs (aka: numbers that can get pretty big). Now, if we’d used another CompareWith option we’d end up with a different result. If we’d set CompareWith to UTF8Type our sorted Columns’ names would be treated as a UTF8 encoded strings yielding a sort order like this:


    <!--
    ColumnFamily definition from storage-conf.xml
    -->
    <ColumnFamily CompareWith="UTF8Type" Name="CF_NAME_HERE"/>

    // Each Column name is treated as a UTF8 string
    {name: 123, value: "hello there"},
    {name: 3, value: "101010101010"},
    {name: 832416, value: "kjjkbcjkcbbd"},
    {name: 976, value: "kjjkbcjkcbbd"}

The result is completely different!

This sorting principle applies to SuperColumns as well but we get an extra dimension to deal with: not only do we determine how the SuperColumns are sorted in a Row but we also determine how the Columns within each SuperColumn are sorted. The sort of the Columns within each SuperColumn is determined by the value of CompareSubcolumnsWith. Here’s an example:


    // Here's a view of a Row that has 2 SuperColumns in it.
    // currently they're in some random order

    { // first SuperColumn from a Row
        name: "workAddress",
        // and the columns within it
        value: {
            street: {name: "street", value: "1234 x street"},
            city: {name: "city", value: "san francisco"},
            zip: {name: "zip", value: "94107"}
        }
    },
    { // another SuperColumn from same Row
        name: "homeAddress",
        // and the columns within it
        value: {
            street: {name: "street", value: "1234 x street"},
            city: {name: "city", value: "san francisco"},
            zip: {name: "zip", value: "94107"}
        }
    }

Now if we decided to set both CompareSubcolumnsWith & CompareWith to UTF8Type we’d have the following end result:


    // Now they're sorted

    {
        // this one's first b/c when treated as UTF8 strings
        { // another SuperColumn from same Row

            // This Row comes first b/c "homeAddress" is before "workAddress"
            name: "homeAddress",

            // the columns within this SC are also sorted by their names too
            value: {
                // see, these are sorted by Column name too
                city: {name: "city", value: "san francisco"},
                street: {name: "street", value: "1234 x street"},
                zip: {name: "zip", value: "94107"}
            }
        },
        name: "workAddress",
        value: {
            // the columns within this SC are also sorted by their names too
            city: {name: "city", value: "san francisco"},
            street: {name: "street", value: "1234 x street"},
            zip: {name: "zip", value: "94107"}
        }
    }

I want to note that in the last example CompareSubcolumnsWith & CompareWith were set to UTF8Type but this doesn’t have to be the case. You can mix and match the values of CompareSubcolumnsWith & CompareWith as necessary.

The last bit about sorting I want to mention is that you can write a custom class to perform the sorting. The sorting mechanism is pluggable… you can set CompareSubcolumnsWith and/or CompareWith to any fully-qualified class name as long as that class implements org.apache.cassandra.db.marshal.IType (aka you can write custom comparators).

Example Schema

Alrighty – Now we’ve got all the pieces of the puzzle so let’s finally put ‘em all together and model a simple blog application. We’re going to model a simple app with the following specs:

  • support a single blog
  • we can have multiple authors
  • entries contain title, body, slug & publish date
  • entries can be associated with any # of tags
  • people can leave comments but cant register: they enter profile info each time (just keeping it simple)
  • comments have text, time submitted, commenter’s name & commenter’s name
  • must be able to show all posts in reverse chronological order (newest first)
  • must be able to show all posts within a given tag in reverse chronological order

Each of the following sections will describe a ColumnFamily that we’re going to define in our app’s Keyspace, show the xml definition, talk about why we picked the particular sort option(s) as well as display the data in the ColumnFamily w/ our JSON-ish notation.

Authors ColumnFamily

Modeling the Authors ColumnFamily is going to be pretty basic; we’re not going to do anything fancy here. We’re going to give each Author their own Row & key it by the Author‘s full name. Inside the Rows each Column is going to represent a single “profile” attribute for the Author.

This is an example of using each Row to represent an object… in this case an Author object. With this approach each Column will serve as an attribute. Super simple. I want to point out that since there’s no “definition” of what Columns must be present within a Row we kinda sorta have a schemaless design.

We’ll be accessing the Rows in this ColumnFamily via key lookup & will grab every Column with each get (ex: we won’t ever be fetching the first 3 columns from the Row with key ‘foo’). This means that we don’t care how the Columns are sorted so we’ll use BytesType sort options because it doesn’t require any validation of the Columns’ names.


<!--
    ColumnFamily: Authors
    We'll store all the author data here.

    Row Key => Author's name (implies names must be unique)
    Column Name: an attribute for the entry (title, body, etc)
    Column Value: value of the associated attribute

    Access: get author by name (aka grab all columns from a specific Row)

    Authors : { // CF
        Arin Sarkissian : { // row key
            // and the columns as "profile" attributes
            numPosts: 11,
            twitter: phatduckk,
            email: arin@example.com,
            bio: "bla bla bla"
        },
        // and the other authors
        Author 2 {
            ...
        }
    }
-->
<ColumnFamily CompareWith="BytesType" Name="Authors"/>

BlogEntries ColumnFamily

Again, this ColumnFamily is going to act as a simple key/value lookup. We’ll be storing 1 entry per Row. Within that Row the Columns will just serve as attributes of the entry: title, body, etc (just like the previous example). As a small optimization we’ll denormalize the tags into a Column as a comma separated string. Upon display we’ll just split that Column‘s value to get a list of tags.

The key to each Row will be the entries slug. So whenever we want to grab a single entry we can simply look it up by its key (slug).


<!--
    ColumnFamily: BlogEntries
    This is where all the blog entries will go:

    Row Key +> post's slug (the seo friendly portion of the uri)
    Column Name: an attribute for the entry (title, body, etc)
    Column Value: value of the associated attribute

    Access: grab an entry by slug (always fetch all Columns for Row)

    fyi: tags is a denormalization... its a comma separated list of tags.
    im not using json in order to not interfere with our
    notation but obviously you could use anything as long as your app
    knows how to deal w/ it

    BlogEntries : { // CF
        i-got-a-new-guitar : { // row key - the unique "slug" of the entry.
            title: This is a blog entry about my new, awesome guitar,
            body: this is a cool entry. etc etc yada yada
            author: Arin Sarkissian  // a row key into the Authors CF
            tags: life,guitar,music  // comma sep list of tags (basic denormalization)
            pubDate: 1250558004      // unixtime for publish date
            slug: i-got-a-new-guitar
        },
        // all other entries
        another-cool-guitar : {
            ...
            tags: guitar,
            slug: another-cool-guitar
        },
        scream-is-the-best-movie-ever : {
            ...
            tags: movie,horror,
            slug: scream-is-the-best-movie-ever
        }
    }
-->
<ColumnFamily CompareWith="BytesType" Name="BlogEntries"/>

TaggedPosts ColumnFamily

Alright – here’s where things get a bit interesting. This ColumnFamily is going to do some heavy lifting for us. It’s going to be responsible for keeping our tag/entry associations. Not only is it going to store the associations but it’s going to allow us to fetch all BlogEntrys for a certain tag in pre-sorted order (remember all that sorting jazz we went thru?).

A design point I want to point out is that we’re going have our app logic tag every BlogEntry with the tag “__notag__” (a tag I just made up). Tagging every BlogEntry with “__notag__” will allow us to use this ColumnFamily to also store a list of all BlogEntrys in pre-sorted order. We’re kinda cheating but it allows us to use a single ColumnFamily to serve “show me all recent posts” and “show me all recent posts tagged ‘foo’”.

Given this data model if an entry has 3 tags it will have a corresponding Column in 4 Rows… 1 for each tag and one for the “__notag__” tag.

Since we’re going to want to display lists of entries in chronological order we’ll make sure each Columns name is a time UUID and set the ColumnFamilys CompareWith to TimeUUIDType. This will sort the Columns by time satisfying our “chronological order” requirement :) So doing stuff like “get the latest 10 entries tagged ‘foo’” is going to be a super efficient operation.

Now when we want display the 10 most recent entries (on the front page, for example) we would:

  1. grab the last 10 Columns in the Row w/ key “__notag__” (our “all posts” tag)
  2. loop thru that set of Columns
  3. while looping, we know the value of each Column is the key to a Row in the BlogEntries ColumnFamily
  4. so we go ahead and use that to grab the Row for this entry from the BlogEntries ColumNFamily. this gives us all the data for this entry
  5. one of the Columns from the BlogEntries Row we just grabbed is named “author” and the value is the key into the Authors ColumnFamily we need to use to grab that author’s profile data.
  6. at this point we’ve got the entry data and the author data on hand
  7. next we’ll split the “tags” Columns value to get a list tags
  8. now we have everything we need to display this post (no comments yet – this aint the permalink page)

We can go through the same procedure above using any tag… so it works for “all entries” and “entries tagged ‘foo’”. Kinda nice.


<!--
    ColumnFamily: TaggedPosts
    A secondary index to determine which BlogEntries are associated with a tag

    Row Key => tag
    Column Names: a TimeUUIDType
    Column Value: row key into BlogEntries CF

    Access: get a slice of entries tagged 'foo'

    We're gonna use this CF to determine which blog entries to show for a tag page.
    We'll be a bit ghetto and use the string __notag__ to mean
    "don't restrict by tag". Each entry will get a column in here...
     this means we'll have to have #tags + 1 columns for each post.

    TaggedPosts : { // CF
        // blog entries tagged "guitar"
        guitar : {  // Row key is the tag name
            // column names are TimeUUIDType, value is the row key into BlogEntries
            timeuuid_1 : i-got-a-new-guitar,
            timeuuid_2 : another-cool-guitar,
        },
        // here's all blog entries
        __notag__ : {
            timeuuid_1b : i-got-a-new-guitar,

            // notice this is in the guitar Row as well
            timeuuid_2b : another-cool-guitar,

            // and this is in the movie Row as well
            timeuuid_2b : scream-is-the-best-movie-ever,
        },
        // blog entries tagged "movie"
        movie: {
            timeuuid_1c: scream-is-the-best-movie-ever
        }
    }
-->
<ColumnFamily CompareWith="TimeUUIDType" Name="TaggedPosts"/>

Comments ColumnFamily

The last thing we need to do is figure out how to model the comments. Here we’ll get to bust out some SuperColumns.

We’ll have 1 Row per entry. The key to the Row will be the entries slug. Within each Row we’ll have a SuperColumn for each comment. The name of the SuperColumns will be a UUID that we’ll be applying the TimeUUIDType to. This will ensure that all our comments for an entry are sorted in chronological order. The Columns within each SuperColumn will be the various attributes of the comment (commenter’s name, comment time etc).

So, this is pretty simple as well… nothing fancy.


<!--
    ColumnFamily: Comments
    We store all comments here

    Row key => row key of the BlogEntry
    SuperColumn name: TimeUUIDType

    Access: get all comments for an entry

    Comments : {
        // comments for scream-is-the-best-movie-ever
        scream-is-the-best-movie-ever : { // row key = row key of BlogEntry
            // oldest comment first
            timeuuid_1 : { // SC Name
                // all Columns in the SC are attribute of the comment
                commenter: Joe Blow,
                email: joeb@example.com,
                comment: you're a dumb douche, the godfather is the best movie ever
                commentTime: 1250438004
            },

            ... more comments for scream-is-the-best-movie-ever

            // newest comment last
            timeuuid_2 : {
                commenter: Some Dude,
                email: sd@example.com,
                comment: be nice Joe Blow this isnt youtube
                commentTime: 1250557004
            },
        },

        // comments for i-got-a-new-guitar
        i-got-a-new-guitar : {
            timeuuid_1 : { // SC Name
                // all Columns in the SC are attribute of the comment
                commenter: Johnny Guitar,
                email: guitardude@example.com,
                comment: nice axe dawg...
                commentTime: 1250438004
            },
        }

        ..
        // then more Super CF's for the other entries
    }
-->
<ColumnFamily CompareWith="TimeUUIDType" ColumnType="Super"
    CompareSubcolumnsWith="BytesType" Name="Comments"/>

Woot!

That’s it. Out little blog app is all modeled and ready to go. It’s quite a bit to digest but in the end you end up with a pretty small chunk of XML you’ve gotta store in the storage-conf.xml:


    <Keyspace Name="BloggyAppy">
        <!-- other keyspace config stuff -->

        <!-- CF definitions -->
        <ColumnFamily CompareWith="BytesType" Name="Authors"/>
        <ColumnFamily CompareWith="BytesType" Name="BlogEntries"/>
        <ColumnFamily CompareWith="TimeUUIDType" Name="TaggedPosts"/>
        <ColumnFamily CompareWith="TimeUUIDType" Name="Comments"
            CompareSubcolumnsWith="BytesType" ColumnType="Super"/>
    </Keyspace>

Now all you need to do is figure out how to get the data in and out of Cassandra ;) . That’s all accomplished via the Thrift Interface. The API wiki page does a decent job at explaining what the various endpoints do so I won’t go into all those details. But, in general, you just compile the cassandra.thrift file and use the generated code to access the various endpoints. Alternatively you can take advantage of this Ruby client or this Python client.

Alrighty… hopefully all that made sense & you finally understand WTF a SuperColumn is and can start building some awesome apps.

  • Digg
  • del.icio.us
  • Facebook
  • Reddit
  • Twitter
Tagged: , ,

88 comments

[...] WTF is a SuperColumn? An Intro to the Cassandra Data Model — Arin Sarkissianarin.me [...]

Matt Corgan on Sep 09, 2009 at 8:46 pm

Thanks for taking the time to do that writeup. I still have a few questions as I’m trying to differentiate Cassandra’s data mechanics from the new .20 release of HBase.

What is the difference between a super-column like:

homeAddress: {
street: “1234 x street”,
city: “san francisco”,
zip: “94107″,
}

and the BigTable or HBase style of concatenating nested keys together into something like:

homeAddress/street:”1234 x street”,
homeAddress/city: “san francisco”,
homeAddress/zip: “94017″

Wouldn’t they be sorted the same way on disk and be just as efficient for range queries?

Is it that you avoid storing the string “homeAddress” redundantly?… is that the only savings? Maybe that really adds up if you’re doing inbox search and storing billions of doc ids.

Can supercolumns only add one level of nesting beyond normal columns? That seems limiting considerng BigTable and HBase can append an arbitrary number of nested keys together.

Does Cassandra somehow take care of creating the supercolumns for you like a database index is automatically updated when table rows are modified?

Can you update the columns in the row of a supercolumn without overwriting the whole row? For example, if a facebook user sends his 10,000th message with the word Steelers in it, does that mean 10,000 columns need to be overwritten (something like 100KB), or can a single column be sqeezed into the front of a supercolumn?

I think my confusion about Cassandra stems from both the naming scheme and the difficulty I’ve had in finding documentation about the exact layout of the underlying data files. Understanding the file layout of a datastore is crucial to designing an efficient application around it.

Thanks again,
Matt

Vitaly Kushner on Sep 10, 2009 at 4:05 am

As I understand another option for sorting is by-date. The timestamp that you eliminated from the data model in the sake of clarity can be used to sort the data on disk and retrieve it later.

Thijs (Shenzhen) on Sep 10, 2009 at 7:01 am

Great introduction to Cassandra. I’ve read other articles about Cassandra, but yours is by far the easiest to understand. Thanks

Matt Nowack on Sep 10, 2009 at 9:22 am

I think the examples would be easier to read if you removed the comments from them. I understand why they are there, to elucidate the point of the previous paragraph, but it makes it harder to see the structure of the data model.

Otherwise very nice read.

Weekly Web Nuggets #76 : Code Monkey Labs on Sep 14, 2009 at 7:31 am

[...] solution – Cassandra – into production. Want even more details? His co-worker Arin Sarkissian has a follow-up post with more in-depth details about [...]

links for 2009-09-18 « dstelow notes… on Sep 18, 2009 at 3:06 pm

[...] WTF is a SuperColumn? An Intro to the Cassandra Data Model — Arin Sarkissian For the last month or two the Digg engineering team has spent quite a bit of time looking into, playing with and finally deploying Cassandra in production. It’s been a super fun project to take on – but even before the fun began we had to spend quite a bit of time figuring out Cassandra’s data model… the phrase “WTF is a ’super column’” was uttered quite a few times. (tags: dev database toread top10 dataaccess architecture) [...]

Cassandra DB « Jetfar.com on Sep 21, 2009 at 4:47 am

[...] 3. Column Querying: BigTable really popularised (and proved) the concept of the column database for large scale applications. Cassandra really is similar to BigTable from this perspective, but introduces the SuperColumn. [...]

William Chan on Sep 21, 2009 at 5:55 pm

Great tutorial on Cassandra data modeling!! It is really helpful.

Dan Larsen on Sep 30, 2009 at 4:05 am

Very nice explanation!

It’s pretty simple stuff, but hard to understand when your brain is cluttered by RDBMS thoughts…

What strikes me, is that this way of modelling data, actually fits pretty nicely into the thoughts I had to get out of my brain, way back when I was trying grasp those RDBMS models :-P

It’s the way many data should’ve been structured and queried all along…

Again… Thank you for this excellent explanation!

Best regards
Dan Larsen

Michiel van Vlaardingen on Oct 24, 2009 at 7:02 am

Thanks for writing this. It really helps me understanding the basic approach and associated limitations/difficulties.

Patricio Echague on Oct 29, 2009 at 10:51 am

Awesome article. The comments on the code were really useful. It helped me a lot to understand Cassandra model.

Best

[...] AppEngine, por exemplo, não aceita chaves compostas porque usa o BigTable por baixo. Existem pequenas diferenças entre implementações orientadas a [...]

Cassandra – Getting Started on Nov 05, 2009 at 5:58 am

[...] data model has been described more than once. I will try to add a more javaish view to the set which I find easiest and most powerful to [...]

[...] by Google’s Bigtable paper, section 2.  (Cassandra drops historical versions, and adds supercolumns.) In both systems, you have rows and columns like you are used to seeing, but the rows are sparse: [...]

optimus prime on Nov 17, 2009 at 3:11 pm

How efficient is getting list of posts per tag in this example? first there is one query to get post ids, and then 10 separate queries to get each post, and then for each post a bunch of other queries to get author etc…

In an RDBMS this can all be done with one complicated sql with complex joins. Doesn’t sound very efficient to me.

Arin on Nov 17, 2009 at 4:27 pm

yo optimus prime – you’d actually do a multiget – not 10 single gets.

i concentrated this entry on the data model not API usage – but ya, doing 10 gets instead of 1 multiget would be wasteful.

ps: next time use a real email address

riffraff on Nov 19, 2009 at 11:51 pm

nice article, I’m maybe finally getting the cassandra data model, thank you :)

Although, I am still not clear on one thing, is it possible to query on different attributes or only on the key ?

E.g. if I want to index something like the classical “spam/ham” example, meaning I want rows like I’d like to be able to query on both user, item, or user+evaluation+time (give me all her evaluations of kind of ham in last week) but it is my feeling this would require quite a bit of contortion.

[...] is not not as powerful as the thrift interface. You can for instance not execute get queries in Super Columns, those queries will create a java [...]

zznate on Nov 27, 2009 at 3:02 pm

Thanks for a clear and realistic explanation! FYI – your storage-conf.xml does not work verbatim on Cassandra 0.4.2. The type attribute of your super column definition should be ColumnType.

Otherwise, this was quite helpful for me in understanding how a real world example could work.

Cheers,
Nate

Miguel Cobá on Nov 29, 2009 at 8:50 pm

Great article. Finally I can see a real, complete example about the way to model a known problem in a schema-less, non-RDBMS domain.
A couple of questions.
The insert of a new post with 2 tags will implies 4 writes to the Cassandra db. One in the BlogEntries CF, and 3 on the TaggedPosts CF (one for __notag__ and two for each of the 2 tags for the post), right? This will not be atomic, AFAICT from the cassandra docs (the inserts are atomic only for single CF). Am I wrong?
Next, what about editing a post and changing the tags associated with it. Suppose a post has initially 1 tag: tag1 and after editing I delete the tag1 and add now a tag2.
In a RDBMS world I would search for the tags that contain my post id and delete the id from all of the tags it is part of. Then I will insert the id in the new tags.
In cassandra I suppose that before deleting the old tags (and this imply that my app must compare the “before” tags with the “after” tags in order to determine which tags were deleted) I must delete the column that contains in its value, the post slug (how can I do that?) for the row key matching the old, deleted tag. Then I will add a new column in the row for the new tag (or maybe add a brand new row if the tag isn’t present before). Kind of complex, but maybe also a consecuense of a RDBMS way of thinking about the problem.
Other questions, how can I tell if a row already contains a given column. For example, what if I put the tags: tag1,tag2,tag3,tag1 to a given post. Will this add two columns for the post in the tag1 row of the TaggedPosts CF?
Anyway, your explanation is very good and has vanished a lot of doubts I had with the cassandra data model.
I must recognize that even the cassandra’s terminology is very confusing at the moment (and the resistance to change it, as seen in the mailing lists discussions, isn’t very encouraging) a good example can be very helpful in explain the data model of cassandra and, at least for the moment, help ignore the confusing cassandra terminology.
Thanks,
Miguel

Ive Comu on Jan 06, 2010 at 9:27 am

Arin,

Nice explanation – thank you for your time.

If you could also include some code (in java / php …) to do gets/ multigets / puts , that would clarify the “group’em” part even more.

Thanks again

buckmin on Jan 27, 2010 at 6:01 am

Nice post.

I have experimented with some schemaless databases i think i am starting to learn how to think non-RDBMS. Still, there are some concepts that bug me. Like ones Miguel mentioned and more importantly just a little more complex queries like ‘posts with guitar and movie tag’ or ‘posts with guitar tag but not with movie tag’. I read in somewhere that we need to intersect or union the ‘grab last posts with guitar tag’ and ‘grab last posts with movie tag’ results manually. Coming from a RDBMS realm this seems too odd for me. Is there a more elegant way of doing this, apart from creating another CF for posts with two tags? I don’t know, how far can we go denormalizing everything?

josh on Feb 05, 2010 at 9:13 am

This is really helpful, thanks for spending the time.

NOSQL introduction | Thomas Koeppen Blog on Feb 08, 2010 at 2:09 pm

[...] WTF is a SuperColumn? An Intro to the Cassandra Data Model Sep. 2009 Author: [...]

xxb on Feb 24, 2010 at 4:09 am

nice article, this helps me smooth out confusions of cassandra data model.

Yonatan on Feb 25, 2010 at 7:06 am

Any answer’s for Miguel Cobá’s question ?

Yonatan on Feb 25, 2010 at 7:08 am

and another one – I understand that I need to work hard cause the data is not normalized – so when data is changed I need manually to update all different CF and Columns.
but what about the joins ? is still fast enough when I do the joins manually ?
manually I mean run a query – get list of data extract values considered as keys for other CF and run a query on another CF (and sometimes on a third CF) ?

– Yonatan

Marcin on Mar 02, 2010 at 4:21 am

I’m just trying to get my head around Cassandra, and I _think_ I understand how the data model works. There is however one simple use case I can’t work out how you’d represnent: for the AddressBook example, say you had two friends with the same real full name – how would you key the super-columns then? Obviously you’d need a unique identifier (say a UUID), but what if you wanted to then get back a list of your friends in alphabetical order of real name?

Pablo Rosales on Mar 04, 2010 at 11:40 am

Hey, great post, its very useful! At first Cassandra is confusing but so far I love it.

Brian Mansell on Mar 04, 2010 at 7:21 pm

First of all, I must say that this is an excellent article!

As this appears as one of the first Google results when searching for Cassandra, I’d recommend posting an update which reflects the storage configuration changes (e.g. type -> ColumnType) made in recent versions of Cassandra.

thanks for all your efforts with this tutorial!

[...] WTF Is A Super Column?!? Cassandra reading list Facebook’s overview of Cassandra [...]

[...] WTF Is A Super Column?!? Cassandra reading list Facebook’s overview of Cassandra [...]

Muhammad Nasrullah on Mar 14, 2010 at 8:51 am

Outstanding article, thank you for putting this up!

Mondain on Mar 14, 2010 at 10:50 am

Nice post, I’ll be using this info for my first “experiment” with Cassandra. I would like to point out a possible typo on this line:
comments have text, time submitted, commenter’s name & commenter’s name
Is the duplication of commenters name intended?

ElangovanS on Mar 14, 2010 at 10:38 pm

Hi… this is pretty neat article and so the client software. Are there any examples that explains how to setup the storage configuration in the code?

Enlaces rápidos (15-03-2010) | Sentido Web on Mar 15, 2010 at 4:33 am

[...] WTF is a SuperColumn? An Intro to the Cassandra Data Model [...]

Mark Brady on Mar 15, 2010 at 10:37 am

First,

It seems like a super column is essentially what Oracle calls a Nested table. A column in which each “cell” can contain a table.

A Column family is a table. If you want it stored sorted, it’s a Index-Organized Table.

It seems the “Schema-less” databases are finding that schemas are actually pretty useful after all.

The only difference is that there no predefined column/attribute list. That concept makes a fair bit of sense to me in the context of a doctors visit, where there are thousands of symptoms and important missing symptoms and thousands of mostly null columns make little sense. In a lot of other cases it seems that developers just don’t want to bother with requirements… why plan an object’s attributes, I’ll add a new one any time. What stops developers from adding multiple attributes which all mean the same thing?

Optimal Male on Mar 15, 2010 at 3:30 pm

Thanks Arin – very useful explanation…
But i also have the same questions as Miguel posted earlier. Specifically, the major one I can’t get my head around is that in your example, several column families and rows are updated when a new blog is posted – but there is no atomicity. Wouldn’t this lead to all sorts of inconsistencies in your data? I can’t understand how, even for something as simple as this blog example, your application could live with that? Do you have any thoughts on how that is meant to be handled?

[...] Shared WTF is a SuperColumn? An Intro to the Cassandra Data Model — Arin Sarkissian. [...]

Mike v. Dyke on Mar 22, 2010 at 4:04 am

Let’s map Cassandras terminologie to common used terms:
1) a “Column” is a key-value pair plus timestamp (=attribute)
2) a “Super Column” is a map of attributes (=row)
3) a “Standard Column Family” is a map of rows (=table)
4) a “Super Column Family” is a map of tables (=table of nested tables)
5) a “Keyspace” is a map of “Column Families” (=data base)
Why make things more difficult than they are? It’s quite simple though!

PS: Please replace my previous comment by this one. Thanks!

Soluzioni NoSQL, Apache Cassandra | Webbare on Apr 05, 2010 at 9:00 am

[...] Un esempio di schema in Cassandra: http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model [...]

[...] will let Arin Sarkissian describe why you should care about TimeUUIDType: Since we’re going to want to display lists of entries in chronological order we’ll make sure [...]

[...] Cassandra for Dummies Oh yeah, NoSQL is hot this days. [...]

C.T. on Apr 07, 2010 at 9:21 pm

Great post, very clear, thank you very much.

In RDBMS, most time I fetch data with ONE get and multi-join, in Cassandra, it seems to use multi-get and no-join, very interesting.

Kenneth Haugland on Apr 08, 2010 at 2:31 am

Great article! Bookmarked this for later use!
Regards
Kenneth

andy.edmonds.be › links for 2010-04-10 on Apr 10, 2010 at 4:03 pm

[...] WTF is a SuperColumn? An Intro to the Cassandra Data Model — Arin Sarkissian (tags: cassandra nosql database tutorial programming) This was written by andy. Posted on Sunday, April 11, 2010, at 1:03 am. Filed under Delicious. Bookmark the permalink. Follow comments here with the RSS feed. Post a comment or leave a trackback. [...]

[...] WTF is a SuperColumn? An Intro to the Cassandra Data Model [...]

Eric Twilegar on Apr 14, 2010 at 5:38 am

At the end of the day this is a object oriented distributed db. You are storing something that has more than two dimensions in many cases. For instance their is a row per tag…but it has a list of blogs attached to it. I could see a normal DB table called Tags with one line for each tag and just a comma delimited list of blog_id’s for each entry attached. You’d have to a use a list because most RDBMS don’t have a column type of “table”…I wish they did….some how XML types though.

My big concern is documentation. How do I know that I used bday versus birthday in as the name in some super column. Is the developer just supposed to grab a random record and dump it? Do you have to spend hours writing up HTML docs…did you write a tool that grabs random records and generate it?

I think it’s cool the db has forgotten about the schema, but it seems like the API (or a layer up ) needs to be developed to maintain some basic schema. So your code would be insertUser(bday=’1/1/2000′ ) and it’d take care of all the API calls. If you look at the API calls there is a ton of boilerplate code to just get something inserted.

Also, can someone write this from the top down. I mean it would nice to see how you would replace this from start to finish

select * from users where last_login < sixteydaysago()

That is how would you find out what users haven't logged in in a while so you could send them an email reminding them to. I think the real problem people can't get these concepts is because it's like starting with a "CREATE TABLE" statement. Lets start with an actual real world query and go backwards. How would you implement JOINs. I know you don't do JOINs, but at the end of the day you have to get at the related data somehow.

I think Casandra will be great…but the docs leave a lot to be desired at this point.

[...] WTF is a SuperColumn? An Intro to the Cassandra Data Model: a very helpful look at the cassandra data model View full post on Jeremy Zawodny’s linkblog [...]

[...] was very helpful in getting me up to speed on the data structure used in Cassandra. You can checkout his post, download the pdf or view the scribd doc [...]

Napolux on Apr 22, 2010 at 5:28 am

Cool article. I really like it, it’s a superb guide for “nosql beginners”

Földi Tamás on Apr 24, 2010 at 2:27 am

Great stuff, I recommended to some of my colleagues

[...] Cassandra has been proven with large data sets, with indexing done by pre-defining “supercolumns” that provide the mapping between indexes and their corresponding data values. CouchDB [...]

Slobodan Utvic on Apr 26, 2010 at 4:23 am

One thing I find very hard to wrap my mind around is how to handle on delete garbage collection. If I were to delete BlogPost is there any way to delete reference from TaggedPosts without using brute force approach? In general question would be how to build efficient garbage collector for Cassanda :)

[...] Cassandra is a structured store. It has a rich data model that goes beyond a simple key-value model. This is a good thing as some of the complexity can be pushed to Cassandra, leading to simpler and more efficient applications. There is problem though: the guys who designed Cassandra have given complex names to the (rather simple) data model elements. This post tries to explain  (again) WTF is a SuperColum! [...]

Andrew Babkin on May 05, 2010 at 12:39 am

Can storage-conf.xml be changed after some data has been loaded to cassandra? I understand there may be troubles if you change existing elements, but can you add new element without affecting system stability?

[...] Github page is pretty spartan. Let’s flesh it out but first you might want to read WTF is a SuperColumn and Up and running with [...]

[...] WTF is a SuperColumn? An Intro to the Cassandra Data Model [...]

[...] WTF is a SuperColumn? An Intro to the Cassandra Data Model  [...]

[...] Since there is a lack of advanced beginner documentation, i have made an little example based on the “wtf is a supercolumn datamodel” [...]

[...] price of scaling is that Cassandra provides poor support for ad-hoc queries, emphasizing denormalization instead.  For analytics, the upcoming 0.6 release (in beta now) offers Hadoop map/reduce integration, but [...]

Sagar on May 14, 2010 at 9:37 pm

I have just started with NoSQL for some work and your article has been extremely helpful! Thank you.

jesus scissorhands on May 19, 2010 at 2:06 am

People, don’t bother asking questions. clearly nobody is listening. Sucks, man. there’s a lot of important questions here. would be awesome if Mike v. Dyke’s statement was validated/refuted:

Let’s map Cassandras terminologie to common used terms:
1) a “Column” is a key-value pair plus timestamp (=attribute)
2) a “Super Column” is a map of attributes (=row)
3) a “Standard Column Family” is a map of rows (=table)
4) a “Super Column Family” is a map of tables (=table of nested tables)
5) a “Keyspace” is a map of “Column Families” (=data base)

or Eric Twilegar’s simple question,

SELECT * FROM users WHERE last_login < sixteydaysago()

would this use of KeyRange, KeySlice, SliceRange, TokenRange in Thrift? does a separate denormalized CF keyed on last_login need to be created to support this view?

Follow-up article would be awesome!

how would that view be achieved? a

joe on Jun 04, 2010 at 3:12 am

quick question about the blog example app:

since cassandra is non relational and document based (?) why can you not store all the information for a blog entry in a single row?

e.g. (pseudo-json):

blog-post
-author
-author name
-# posts
-etc…
-body

surely then you only need to make a single query to get all the details of the post?

i realise there would be data duplication since you would still want an authors table with the same data in it, and any modifications to an author would need to be replicated.

but could you not store data in this way?

Arun Manivannan on Jun 05, 2010 at 5:58 am

This is by far the best source of explanation for Column and Super families. thanks a ton.

[...] Voilà les principaux éléments définis par Cassandra (Cet article ainsi que la documentation Cassandra fournissent un point de vue plus complet sur les différents [...]

[...] lost? Here are a the main elements Cassandra defined (this article or the Cassandra documentation provide a more in depth view of the different types of structures [...]

Ahmad Hashem on Jun 18, 2010 at 8:05 pm

you rocks :) This is the best article found so far describing what is “Cassandra”.
It took me couple of hours in vain surfing their Apache site. but your article gave it to me in 20 minutes.
Thanks

Diogo on Jun 21, 2010 at 7:35 pm

What´s a command to get all BlogEntries from Autors in “cassandra-cli”?

[...] WTF is a SuperColumn? An Intro to the Cassandra Data Model http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model [...]

squirrel on Jul 02, 2010 at 10:25 pm

What is the difference between SuperColumn and Row? Is Row a SuperColumn? Is it stored in ColumnFamilly like SuperColumn? Without JSON notation, like this:

CF = {
‘row_key’ : {
name: ‘row_key’
value: {
// columns
// …
}
}
}

or like this:

CF = {
‘row_key’ : {
// columns
// …
}
}

?

steve on Jul 03, 2010 at 2:45 pm

took me a while to get my head around this but once i did it was very useful, great blog, thanks a million

[...] quite a bit of time learning it. There are great articles on the Web that explain the model. Read WTF is a SuperColumn? An Intro to the Cassandra Data Model and my favorite one – Installing and using Apache Cassandra With Java. This blog post is my [...]

大话Cassandra数据模型 | Jim的blog on Jul 12, 2010 at 10:35 pm

[...] WTF is a SuperColumn? An Intro to the Cassandra Data Model [...]

Rafael Forte on Jul 14, 2010 at 6:24 pm

Arin,

Thanks a lot for the article. It is a very good introduction to Cassandra’s data model.

Regards,

[...] Cassandra by example (Twitter alike one) http://github.com/ericflo/twissandra WTF is a supercolumn Cassandra model Cassandra NoSQL Database Apache Cassandra NoSQL Live Dynamo derivatives: Cassandra [...]

Loreto Parisi on Jul 21, 2010 at 7:32 am

When using Keyspace

and a SCF key name like string ‘scream-is-the-best-movie-ever’ and a Time UUID (UUID v1) for sub-column key name, such as

scream-is-the-best-movie-ever : { // row key = row key of BlogEntry
// oldest comment first

ffffffff-94db-11df-a8a0-63617373636c : { // SC Name
// all Columns in the SC are attribute of the comment
commenter: Joe Blow,
email: joeb@example.com,
comment: you’re a dumb douche, the godfather is the best movie ever
commentTime: 1250438004
},

I have got the error:

“UUIDs must be exactly 16 bytes”

This means that also the SCF column name (the SCF row key ‘ scream-is-the-best-movie-ever’) should be a Time UUID v1. Or what?

[...] статьи: WTF is a SuperColumn? An Intro to the Cassandra Data Model // Автор Арин Саркесян Related postsАрхитектура больших [...]

huckleberry finn at borders on Jul 30, 2010 at 12:06 am

The post is written in very a good manner and it entails many useful information for me. I am happy to find your distinguished way of writing the post. Now you make it easy for me to understand and implement the concept. Thank you for the post.

[...] on the data model is here and [...]

Sankar on Aug 05, 2010 at 2:52 am

Good Article

Mallikarjun on Aug 12, 2010 at 9:46 pm

This is a Great Job…

Nice explanation…….

Mallikarjun on Aug 17, 2010 at 11:46 pm

Hi,

I have a requirement of using Cassandra in my application. In my application there is one table with lot of data and most of my application uses that table. Due to lot of data,performance of the application is decreasing when i use that table is in Oracle.

So, I have decided to use the Cassandra database for that one table and all other tables in oracle. Lot of business logic is dependent on that table.

No my question is, Can I use the Cassandra for a table which has lot of business logic.

I am unable to implement lot of where clauses for Cassandra database.

Is there any supporting tool to use Cassandra in an efficient way?

Please let me know…
i am in urgency..

Thanks in advance

By Mallik

[...] if you haven’t already you must read: WTF is a SuperColumn? An Intro to the Cassandra Data Model from Arin Sarkissian over at [...]

[...] WTF is a SuperColumn? An Intro to the Cassandra Data Model Cassandraデータモデル入門 – Arin [...]

Signup to Leave a Comment

Or Login via Facebook

Lots of responses on my laptop question. Thanks for the feedback guys 2 hrs ago

Search This Blog