More permanent stuff at http://www.dusbabek.org/~garyd

## 26 March 2012

### Calculating Long Running Averages

Something I've been working on has me calculating averages over time on data arriving into a system.  I don't know beforehand how many pieces of data there will be, which makes the calculation difficult.  This operation needs to be done for both floating point and integers.

The floating point solution turned out to be fairly simple:

```double average = 0d;
int count = 0;

// when a number comes in, do this:
average += (new_number - average) / ++count;
```

Here's how it works:  Realize that when a new number comes in, it is going to pull the average slightly higher or lower.  That difference is this calculation.  Since `count` is always growing this means that numbers arriving later have far less influence on the average unless they are fairly large (as it should be).

The integer solution looks different even though it is essentially the same thing.  What complicates the problem is that I want to avoid using floating point math if at all possible.  There are several different computations that would work, but here is the one that seemed to follow the true average as much as possible:

```long average = 0, remainder = 0;
int count = 0;

// when a number comes in, do this:
count++;
average += (new_number + remainder) / count;
remainder = (new_number + remainder) % count;
```

Here is how it works:  since I cannot add incremental pieces of a whole number as they arrive (as I did in the floating point example), those incremental chunks are tracked in `remainder`.  Eventually, if the inputs are fairly uniform, enough delta builds up in `remainder` to add or subtract a number from the average, or it happens right away if the new number is significantly higher or lower than the average.

I wanted to verify correctness of the algorithm, so I wrote a simulator.  I found early on, the averages vary from the true average a bit, but this variance goes away as `count` increases.  To compensate for this, I keep track of the sum of all new_numbers and calculate average the normal way until either `sum` or `count` overcome predetermined thresholds.

## 19 April 2011

### Thoughts on Rdio

Clearly, each of the services get some things right.  I could go into detail about why I like Rdio and chose it over its competitors (Last.fm, Pandora and Slacker), but this post is mainly about why Rdio is right for me and a what would make it better.

The Good
Playlists
I like Rdio mainly because it gives me flexible listening options:

• It lets me have playlists with cherry-picked songs.
• It can generate playlists based on an artist.
• I can queue up entire albums.
• I can listen to songs offline using my android phone.
API
Although they didn't have it when I started my subscription, Rdio have added a public API, which is cool.  This means that developers will create interesting applications that use Rdio (unlike say, Pandora). I've played around with it a bit and it is pretty natural.

Discovery
I absolutely *love* that I can listen to entire albums at once.  Remember that band you heard that one song from once and you meant to go back and check them out?  Only, you had a terrible time finding a decent way to sample all their music and you weren't ready to make the commitment of actually paying for it.  (We have all made that mistake before.)  Rdio solves that problem: queue up their catalog and listen to everything for a day or two.  Then go buy, or not.

Room For Improvement
Offline Content
• I wish the desktop application had offline mode like the mobile app.
• Managing offline media from the mobile application is cumbersome, especially if you have synced a lot of songs.
• Managing media from the desktop application is even more difficult.
• It would be awesome if I could create a station based on an artist and then sync those songs (in much the same way I can create a playlist and sync those songs with one action).

Most of my grief with offline songs would be eliminated if there were a way to expire the offline content so I didn't have to manage it myself (like a DVR).  E.g.: keep this song for a [week, month, etc.], delete it after I play it [once, twice, three times, etc.]

Playlists
Rdio does not have "genius mode."  I would like to generate an awesome playlist based on a single song of my choosing (like iTunes).  They have artist-based playlists, but it isn't the same.  Theoretically, an enterprising programmer should be able to create software that exports an iTunes genius playlist (or any iTunes playlist) to an Rdio playlist.  Sounds like a fun weekend project.

I wish I could create multiple queues (e.g.: a "work" queue, a "home" queue, and a "quiet sunday" queue).  I could use playlists for this but I cannot add entire albums to playlist unless I add the songs one at a time.

The Rdio machine learning algorithms could take a lesson from Pandora or Last.fm.  They are not that good.

If you use queues and then go listen to another song, your currently playing queue item (an album) is lost.  This used to bite me a lot and annoyed me until I eventually changed my habits.  This could be resolved if Rdio just remember where I was in the queue when I play something adhoc.

Streaming
It takes too long to go from pushing the play button on my mobile player to actually hearing something.  If this is a problem of pushing bits, maybe Rdio could push some low-quality data (doesn't require as much bandwidth) at first to get some data on the users player quickly, then come in later with the full-quality audio.  If this is a latency problem, my bad.

Rating, Scrobbling and Data
It would be cool to scrobble love/hate from Rdio.  Currently I have to go to the Last.fm website to do it.  Also, it would be nice to rate songs (like iTunes) or tag them as I listen to them.  This is helpful for music discovery at times when I do not really want to be distracted by the process of copying down a song or artist name.

Something to Show For it All
This isn't a deal-breaker (none of it is, after all--I am paying for the service).  It saddens me that if I give Rdio \$10 a month for the next 10 years and then stop, I will have nothing to show for it but the memories.  It would be great if I could keep some of the music permanently.

I realize that it is the same value proposition as cable TV and that I would have my cake and eat it too.  But folks--this is music.  I am used to keeping what I pay for.

Summary
I like Rdio.  Out of all the players so far, it works best for me.  It as the most flexible listening options of any streaming music service, but I think there is room for improvement.  I hope they keep up the good work.

## 01 December 2010

### Introducing Casserole

I've you've done much work with Cassandra clusters, you've probably gotten very familiar with bin/nodetool, which is the command line utility for poking Cassandra nodes.  If you are like most people, you have probably developed a love/hate relationship with it (your -h and -p fingers get sore quickly).

Well, here is something else you can love and hate.  Casserole is a gui tool that encapsulates some of the functionality of nodetool.  Right now, it primarily monitors clusters, but the groundwork is in for performing operations as well.

As the readme says: this tool currently sucks.  It hasn't been tested much and I've only worked on it at odd times over the last few weeks.  If you find bugs, please report them!  Even better, fix them and send me pull requests.

I've got branches that target 0.7-beta3 and 0.7-rc1 (yes, things are still changing too much IMO).  ``ant run`` should get you off the ground quickly.  Sorry: no 0.6 support at the moment.  My plan is to maintain Casserole as long as there is interest, or place it in cassandra/contrib if there is interest for that.

## 29 September 2010

### RESTful Cassandra

A lot of people, when first learning about Cassandra, wonder why there isn't any easier (say, RESTful) way to perform operations.  I did.  It didn't take someone very long to point out that it mainly has to do with performance.  Cassandra spends a significant amount of resources marshaling data and Thrift currently does that very efficiently.

So I put away my RESTlessness.
I've heard more people lately clamoring for the feature, so I gave some thought about how I'd go about it.  One approach would be to wrap Thrift.  That would be nice from a coupling standpoint, but I think performance would be pretty crappy.  After all, it is just adding another layer of marshaling that needs to be done; nobody needs that.
I eventually arrived at the decision that an HTTP Cassandra Daemon/Server pair similar to the existing Avro and Thrift versions would do the trick.  It would basically be a straight port, with a few minor caveats.  One big thing is that HTTP uses a new connection for each request, so storing Cassandra session information in threadlocals is gone out the window.  This means that authentication needs to be abandoned, done with every request, or we need to use HTTP sessions.  Punt.
Today, while half-listening to some lectures at ICOODB I decided to see how hard it would be to throw something together.  I ended up with two classes containing stripped down implementations of get and set.  I pushed the whole thing to github if anybody is interested.  I used the built-in Sun HTTP service because I didn't want any extra dependencies and building services on top of it is pretty straightforward.  Results are returned in JSON format that should match what you would see if you used sstable2json to export data.

This is clearly a proof of concept, but I think it demonstrates that the idea is sound and could be implemented fairly quickly.  Maintenance would be another story.  One problem with maintaining the Cassandra Avro bindings is that they regularly get out of sync with what is capable using Thrift.  An HTTP Cassandra wrapper would suffer the same fate without an active champion.  I'm interested, but I'm not *that* interested.

Anyway, have fun.

-- DETAILS --
The following URI formats are expected:
/get/keyspace/column_family/row_id/super_column/column_start/column_end/consistency_level.  If you don't want to pass it in, leave it blank.  Empty strings are interpreted as null when appropriate.

Here is an example from my tests:  http://127.0.0.1:9160/get/Keyspace1/Standard1/1//10/11/ONE

The main thing to deduce here is that the super column is empty (see the double slash?).  If you haven't realized by now, I've gone ahead with the assumption that your keys and column names are strings.  This isn't good enough.  All the details we need to become type-aware are available in the comparator for the column family.  As a shortcut for now, you can append "?asString" to the end of the URI to have all byte[] values converted to strings.  Without it they are displayed as hex.

Updating works the same way: /set/keyspace/coumn_family/row_key/super_column/column/value/consistency_level

e.g.: http://127.0.0.1:9160/set/Keyspace1/Standard1/1//11/aaa/ONE

UPDATE: I went ahead and created two additional implementations that use jetty (bare and with servlets).  This generated a bit more code, but opens up the way to getting sophisticated with sessions.

## 11 March 2010

### Running Multiple Cassandra Nodes on a Single Host

One of the first Cassandra tickets I worked on had me reviewing some code that visualized the node ring.  Properly testing the code required that I run a cluster.

But I didn't have access to a cluster. Neither did I feel like creating a virtual cluster by building a VM and cloning it several times.  What I wanted was to run several instances of Cassandra on a single machine with multiple interfaces, all pointed at the same compiled code (without multiple svn checkouts).

The Cassandra wiki explains how to tweak Cassandra settings by editing cassandra.in.sh, but doesn't explain what needs to be done to run concurrent instances.

It turned out not to be too difficult.  I figured it might be daunting enough to Cassandra noobs (of whom we're seeing more of lately due to some great exposure), that a blog post might be helpful.

This tutorial assumes that you'll want to run multiple instances of Cassandra on code built by ant and not a standalone jar.  I am also assuming that you are a) just playing around, or b) intend to do some development.  This is not a tutorial explaining how Cassandra should be run in production.

Note: I apologize for the way this looks.  Blogger is not a friend of ordered lists.

1. Make sure you've got aliases to localhost (e.g.: 127.0.0.2, 127.0.0.3, etc.).  Mac OS X doesn't have this enabled by default, so you'll have to manually create aliases:

`sudo ifconfig lo0 alias 127.0.0.2 up`
`sudo ifconfig lo0 alias 127.0.0.3 up`
2. Decide where you're going to keep things.  You can keep them with your code, but that just isn't neat.  Pick a directory somewhere, call it \$cass_stuff.
3. Then, for each node in your little cluster, do this:

1. From your svn checkout, copy the conf directory into \$cass_stuff.  You can rename it to something like conf0 (or conf1, etc.).  I'll assume \$conf from here on out.
2. Copy bin/cassandra.in.sh to \$cass_stuff.  Give it a name that helps you associate it with the conf directory you just created (node0.in.sh or whatever).
3. Open node0.in.sh in an editor and make the following changes:

1. Hardcode cassandra_home to the location of your trunk.  This will give you the flexibility to run Cassandra from anywhere.
2. Set CASSANDRA_CONF to the conf directory you just created.
3. In the JVM_OPTS change the jdwp address= setting.  The default is 8888, but you should include the unique IP you chose for this node along with the port, e.g.: 127.0.0.2:8888.  Not specifying a host causes the debugger to bind to 0.0.0.0:8888 and you'll have port binding problems when you bring up more than one node.
4. pick a unique port for com.sun.management.jmxremote.port, but make sure you have at least one node listening on 8080 since all the Cassandra tools assume JMX is listening there.  Unfortunately, you can't pick the JMX host, 0.0.0.0 is assumed.  I was under the impression this could be changed by specifying java.rmi.server.hostname, but had no luck going down that road.  (Please leave a comment if you figure out a way for this to work, but I think it might be hopeless.)
4. Open \$cass_stuff/\$conf/storage-conf.xml in an editor and make the following changes:

1. specify unique locations for CommitLogDirectory and DataFileDirectory.  Don't bother with CalloutLocation or StagingFileDirectory.
To run you may wish to use another script for each node:

#!/bin/sh
CASSANDRA_INCLUDE=\$cass_stuff/
export CASSANDRA_INCLUDE
cd
bin/cassandra -f

One downside to this approach is that if you're tracking trunk, it is your responsibility to make sure you notice changes to the default storage-conf.xml and cassandra.in.sh and apply them to your environments.

Cassandra is supported by an active and welcoming community.  If you'd like to learn more about the project, check out our wiki, mailing list or hop on #cassandra on freenode.

## 15 December 2009

### Dear Entrepreneurs, this is something I would pay for...

Dear Entrepreneurs,

This is something I would gladly pay \$20 a month for...

A device that, according to my tastes, downloads new music from the Internet whenever it connects.  I would be able to listen to music without restriction while I am disconnected from the network.  I wouldn't own the music, except for roughly 20 tracks a month that I select which would then become mine as MP3s (for FLAC or whatever DRM-less technology makes sense).  I could then load them into iTunes, give them to my brother, or (if I'm feeling sinister) make them available on a P2P network.

The music could come from anywhere: iTunes, Amazon, The Labels, or artists themselves.

The content sources exist.  The recommendation engines exist.  Devices exist.

I suspect the audience/market exists.  (At least, I hope so.  If not, and nobody is willing to pay for music, we're going to need to find another model.  And it will still necessarily involve a money exchange between producers and cosumers and/or advertisers.)

Is there such a system already?

## 13 December 2009

### Christmas Mix 2009

I've been making  Christmas mixes for my family the last couple years.  It's not your typical Bing Crosby stuff, and requires some digging on my part.  I finally started blogging about it last year and think I'm going to make it a tradition.  So here goes... Christmas with an indie slant.  And I did a better job checking on the lyrics this year for family appropriateness.

The links this year are coming at you from Lala by way of Google.  Message me if things stop working.  (This blog post has turned out much like my Christmas shopping: it gets sloppy towards the end.)

1.  "Holiday Road" by Matt Pond PA.  This is the only repeat from last years list.  I love this song because the vacation movies still connect with me at a level I am entirely uncomfortable with.

2.  "Blue Christmas" by Dread Zeppelin.  Believe it or not, there is a nice smattering of Christmas to choose from with these guys. Where else can you get Elvis, Led Zeppelin, Reggae and Christmas in one track?

3.  "Christmas is Going to the Dogs" by Eels.  Hard choice between this and "Everything's Gonna Be Cool This Christmas".

4.  "I Wish It Was Christmas Today"  by Julian Casablancas.  This one is for the kids.  I wish that I could still feel the way I did when I was a young boy after Thanksgiving.  Christmas, although only four weeks away, seemed like it sat on the other side of eternity.  As an adult, it comes and goes so fast I barely have time to enjoy it.  Message to kids: enjoy it while you can.  Responsibility steals the fun from Christmas!

5.  "Christmas Time is Here Again (Bring Out the Joy!)" by My Morning Jacket.  Peaceful.  I'll let you google for this one.  It's a live take from a radio broadcast.

6.  "Listening to Otis Redding at Home During Christmas" by Okkervil River.  Not a traditional Christmas tune, but a good one to follow MMJ, if only for the indie vibe.  This song reminds me of "New Slang" by The Shins, but with less jade and desperation.  Slightly more hopeful. :)

7.  "X-Mas Card" by MU330.  Not my normal thing, but the instrumental intro with the horns is fun.

8.  "Yule Shoot Your Eye Out" by Fall Out Boy.  If you haven't checked out "Can You See Santa From the Southside," now is the time to skedaddle over to Amazon and do so.

9.  "Baby, It's Cold Outside (Mulato Beat Remix)" by Louis Armstrong and Velma Middleton.  Shopko gave away a Christmas sampler in 2004 and this was on it.  This is, by far, the Christmas album that gets play in our house (not the one this song links to).  It comes on while we're preparing meals and we find ourselves breaking frequently to get our grooves on.  No kidding.  Six people from 2 to 35 shaking a leg in the kitchen.

10. "O Come All Ye Faithful" by Weezer.  Traditional Christmas tune done right by a modern band.

And some bonus songs from last years mix:

Bonus 1:  "Fairytale of New York" by the Pogues.  This one is definitely not for the kids and is a guilty pleasure of mine.  Who can resist: "You're a bum, you're a punk / You're an old slut on junk."  Ahh, the holidays.

Bonus 2:  "Frosty the Snowman" by the Cocteau Twins.  Year after year, my favorite Frosty rendition.