More permanent stuff at http://www.dusbabek.org/~garyd

24 August 2013

Book Review: Javascript The Definitive Guide 6th Ed.

TL;DR: This book is an excellent reference.

I decided to pick this book up a few years ago when I found out I'd be writing significant pieces of software using Node.js.  I had used Javascript in the past for client-side work in the browser, so I was familiar with it, but didn't have the deep understanding I figured I'd need to be productive on a large project.

I have the PDF version of the book.

Part I contains chapters that go into a lot of detail about how Javascript works.  This is where you can learn about types, Javascript's functional nature and how to build classes.  This is excellent material if your goal is to understand language internals.

There is a chapter that isn't there though: "Crap to Watch Out For."  This would include information about IEEE754 floating points, null vs undefined, truthy/falsy and so on.  Instead, that information is sprinkled around the book.  (More often than not, you figure it out on your own through sad experience.)

The chapter on server side Javascript was more or less pointless.  It was neither a good primer for Rhino or Node.js.

Part III contains an exhaustive reference of core Javascript APIs.  I found this section indispensable.  Every method of every key class type is documented.  Some include examples.  I found the cross-references (linked in the PDF) extremely helpful.

Parts II and IVwere not useful to me since I was not doing client side programming.

Overall, I'd rate this book positively.  It helped me figure things out and continues to serve as a reference.

Full Disclosure: This book was given to me by O'Reilly with the hope that I'd publish a review on it.  Other than the book, I have received no other consideration from O'Reilly.

26 March 2012

Calculating Long Running Averages

Something I've been working on has me calculating averages over time on data arriving into a system.  I don't know beforehand how many pieces of data there will be, which makes the calculation difficult.  This operation needs to be done for both floating point and integers.

The floating point solution turned out to be fairly simple:

double average = 0d;
int count = 0;

// when a number comes in, do this:
average += (new_number - average) / ++count;

Here's how it works:  Realize that when a new number comes in, it is going to pull the average slightly higher or lower.  That difference is this calculation.  Since count is always growing this means that numbers arriving later have far less influence on the average unless they are fairly large (as it should be).

The integer solution looks different even though it is essentially the same thing.  What complicates the problem is that I want to avoid using floating point math if at all possible.  There are several different computations that would work, but here is the one that seemed to follow the true average as much as possible:

long average = 0, remainder = 0;
int count = 0;

// when a number comes in, do this:
count++;
average += (new_number + remainder) / count;
remainder = (new_number + remainder) % count;

Here is how it works:  since I cannot add incremental pieces of a whole number as they arrive (as I did in the floating point example), those incremental chunks are tracked in remainder.  Eventually, if the inputs are fairly uniform, enough delta builds up in remainder to add or subtract a number from the average, or it happens right away if the new number is significantly higher or lower than the average.

I wanted to verify correctness of the algorithm, so I wrote a simulator.  I found early on, the averages vary from the true average a bit, but this variance goes away as count increases.  To compensate for this, I keep track of the sum of all new_numbers and calculate average the normal way until either sum or count overcome predetermined thresholds.

19 April 2011

Thoughts on Rdio

Clearly, each of the services get some things right.  I could go into detail about why I like Rdio and chose it over its competitors (Last.fm, Pandora and Slacker), but this post is mainly about why Rdio is right for me and a what would make it better.

The Good
Playlists
I like Rdio mainly because it gives me flexible listening options:

  • It lets me have playlists with cherry-picked songs.
  • It can generate playlists based on an artist.
  • I can queue up entire albums.
  • I can listen to songs offline using my android phone.
API
Although they didn't have it when I started my subscription, Rdio have added a public API, which is cool.  This means that developers will create interesting applications that use Rdio (unlike say, Pandora). I've played around with it a bit and it is pretty natural.

Discovery
I absolutely *love* that I can listen to entire albums at once.  Remember that band you heard that one song from once and you meant to go back and check them out?  Only, you had a terrible time finding a decent way to sample all their music and you weren't ready to make the commitment of actually paying for it.  (We have all made that mistake before.)  Rdio solves that problem: queue up their catalog and listen to everything for a day or two.  Then go buy, or not.

Room For Improvement
Offline Content
  • I wish the desktop application had offline mode like the mobile app.  
  • Managing offline media from the mobile application is cumbersome, especially if you have synced a lot of songs.  
  • Managing media from the desktop application is even more difficult.
  • It would be awesome if I could create a station based on an artist and then sync those songs (in much the same way I can create a playlist and sync those songs with one action).

Most of my grief with offline songs would be eliminated if there were a way to expire the offline content so I didn't have to manage it myself (like a DVR).  E.g.: keep this song for a [week, month, etc.], delete it after I play it [once, twice, three times, etc.]

Playlists
Rdio does not have "genius mode."  I would like to generate an awesome playlist based on a single song of my choosing (like iTunes).  They have artist-based playlists, but it isn't the same.  Theoretically, an enterprising programmer should be able to create software that exports an iTunes genius playlist (or any iTunes playlist) to an Rdio playlist.  Sounds like a fun weekend project.

I wish I could create multiple queues (e.g.: a "work" queue, a "home" queue, and a "quiet sunday" queue).  I could use playlists for this but I cannot add entire albums to playlist unless I add the songs one at a time.

The Rdio machine learning algorithms could take a lesson from Pandora or Last.fm.  They are not that good.

If you use queues and then go listen to another song, your currently playing queue item (an album) is lost.  This used to bite me a lot and annoyed me until I eventually changed my habits.  This could be resolved if Rdio just remember where I was in the queue when I play something adhoc.

Streaming
It takes too long to go from pushing the play button on my mobile player to actually hearing something.  If this is a problem of pushing bits, maybe Rdio could push some low-quality data (doesn't require as much bandwidth) at first to get some data on the users player quickly, then come in later with the full-quality audio.  If this is a latency problem, my bad.

Rating, Scrobbling and Data
It would be cool to scrobble love/hate from Rdio.  Currently I have to go to the Last.fm website to do it.  Also, it would be nice to rate songs (like iTunes) or tag them as I listen to them.  This is helpful for music discovery at times when I do not really want to be distracted by the process of copying down a song or artist name.

Something to Show For it All
This isn't a deal-breaker (none of it is, after all--I am paying for the service).  It saddens me that if I give Rdio $10 a month for the next 10 years and then stop, I will have nothing to show for it but the memories.  It would be great if I could keep some of the music permanently.

I realize that it is the same value proposition as cable TV and that I would have my cake and eat it too.  But folks--this is music.  I am used to keeping what I pay for.

Summary
I like Rdio.  Out of all the players so far, it works best for me.  It as the most flexible listening options of any streaming music service, but I think there is room for improvement.  I hope they keep up the good work.

01 December 2010

Introducing Casserole

I've you've done much work with Cassandra clusters, you've probably gotten very familiar with bin/nodetool, which is the command line utility for poking Cassandra nodes.  If you are like most people, you have probably developed a love/hate relationship with it (your -h and -p fingers get sore quickly).

Well, here is something else you can love and hate.  Casserole is a gui tool that encapsulates some of the functionality of nodetool.  Right now, it primarily monitors clusters, but the groundwork is in for performing operations as well.

As the readme says: this tool currently sucks.  It hasn't been tested much and I've only worked on it at odd times over the last few weeks.  If you find bugs, please report them!  Even better, fix them and send me pull requests.

I've got branches that target 0.7-beta3 and 0.7-rc1 (yes, things are still changing too much IMO).  `ant run` should get you off the ground quickly.  Sorry: no 0.6 support at the moment.  My plan is to maintain Casserole as long as there is interest, or place it in cassandra/contrib if there is interest for that.

29 September 2010

RESTful Cassandra

A lot of people, when first learning about Cassandra, wonder why there isn't any easier (say, RESTful) way to perform operations.  I did.  It didn't take someone very long to point out that it mainly has to do with performance.  Cassandra spends a significant amount of resources marshaling data and Thrift currently does that very efficiently.

So I put away my RESTlessness.
I've heard more people lately clamoring for the feature, so I gave some thought about how I'd go about it.  One approach would be to wrap Thrift.  That would be nice from a coupling standpoint, but I think performance would be pretty crappy.  After all, it is just adding another layer of marshaling that needs to be done; nobody needs that.
I eventually arrived at the decision that an HTTP Cassandra Daemon/Server pair similar to the existing Avro and Thrift versions would do the trick.  It would basically be a straight port, with a few minor caveats.  One big thing is that HTTP uses a new connection for each request, so storing Cassandra session information in threadlocals is gone out the window.  This means that authentication needs to be abandoned, done with every request, or we need to use HTTP sessions.  Punt.
Today, while half-listening to some lectures at ICOODB I decided to see how hard it would be to throw something together.  I ended up with two classes containing stripped down implementations of get and set.  I pushed the whole thing to github if anybody is interested.  I used the built-in Sun HTTP service because I didn't want any extra dependencies and building services on top of it is pretty straightforward.  Results are returned in JSON format that should match what you would see if you used sstable2json to export data.

This is clearly a proof of concept, but I think it demonstrates that the idea is sound and could be implemented fairly quickly.  Maintenance would be another story.  One problem with maintaining the Cassandra Avro bindings is that they regularly get out of sync with what is capable using Thrift.  An HTTP Cassandra wrapper would suffer the same fate without an active champion.  I'm interested, but I'm not *that* interested.

Anyway, have fun.

-- DETAILS --
The following URI formats are expected:
/get/keyspace/column_family/row_id/super_column/column_start/column_end/consistency_level.  If you don't want to pass it in, leave it blank.  Empty strings are interpreted as null when appropriate.

Here is an example from my tests:  http://127.0.0.1:9160/get/Keyspace1/Standard1/1//10/11/ONE

The main thing to deduce here is that the super column is empty (see the double slash?).  If you haven't realized by now, I've gone ahead with the assumption that your keys and column names are strings.  This isn't good enough.  All the details we need to become type-aware are available in the comparator for the column family.  As a shortcut for now, you can append "?asString" to the end of the URI to have all byte[] values converted to strings.  Without it they are displayed as hex.

Updating works the same way: /set/keyspace/coumn_family/row_key/super_column/column/value/consistency_level

e.g.: http://127.0.0.1:9160/set/Keyspace1/Standard1/1//11/aaa/ONE


UPDATE: I went ahead and created two additional implementations that use jetty (bare and with servlets).  This generated a bit more code, but opens up the way to getting sophisticated with sessions.

11 March 2010

Running Multiple Cassandra Nodes on a Single Host

One of the first Cassandra tickets I worked on had me reviewing some code that visualized the node ring.  Properly testing the code required that I run a cluster. 

But I didn't have access to a cluster. Neither did I feel like creating a virtual cluster by building a VM and cloning it several times.  What I wanted was to run several instances of Cassandra on a single machine with multiple interfaces, all pointed at the same compiled code (without multiple svn checkouts).

The Cassandra wiki explains how to tweak Cassandra settings by editing cassandra.in.sh, but doesn't explain what needs to be done to run concurrent instances.

It turned out not to be too difficult.  I figured it might be daunting enough to Cassandra noobs (of whom we're seeing more of lately due to some great exposure), that a blog post might be helpful. 

This tutorial assumes that you'll want to run multiple instances of Cassandra on code built by ant and not a standalone jar.  I am also assuming that you are a) just playing around, or b) intend to do some development.  This is not a tutorial explaining how Cassandra should be run in production.

Note: I apologize for the way this looks.  Blogger is not a friend of ordered lists.

  1. Make sure you've got aliases to localhost (e.g.: 127.0.0.2, 127.0.0.3, etc.).  Mac OS X doesn't have this enabled by default, so you'll have to manually create aliases:

    sudo ifconfig lo0 alias 127.0.0.2 up
    sudo ifconfig lo0 alias 127.0.0.3 up
  2. Decide where you're going to keep things.  You can keep them with your code, but that just isn't neat.  Pick a directory somewhere, call it $cass_stuff.
  3. Then, for each node in your little cluster, do this:

    1. From your svn checkout, copy the conf directory into $cass_stuff.  You can rename it to something like conf0 (or conf1, etc.).  I'll assume $conf from here on out.
    2. Copy bin/cassandra.in.sh to $cass_stuff.  Give it a name that helps you associate it with the conf directory you just created (node0.in.sh or whatever).
    3. Open node0.in.sh in an editor and make the following changes:

      1. Hardcode cassandra_home to the location of your trunk.  This will give you the flexibility to run Cassandra from anywhere.
      2. Set CASSANDRA_CONF to the conf directory you just created.
      3. In the JVM_OPTS change the jdwp address= setting.  The default is 8888, but you should include the unique IP you chose for this node along with the port, e.g.: 127.0.0.2:8888.  Not specifying a host causes the debugger to bind to 0.0.0.0:8888 and you'll have port binding problems when you bring up more than one node.
      4. pick a unique port for com.sun.management.jmxremote.port, but make sure you have at least one node listening on 8080 since all the Cassandra tools assume JMX is listening there.  Unfortunately, you can't pick the JMX host, 0.0.0.0 is assumed.  I was under the impression this could be changed by specifying java.rmi.server.hostname, but had no luck going down that road.  (Please leave a comment if you figure out a way for this to work, but I think it might be hopeless.)
    4. Open $cass_stuff/$conf/storage-conf.xml in an editor and make the following changes:

      1. specify unique locations for CommitLogDirectory and DataFileDirectory.  Don't bother with CalloutLocation or StagingFileDirectory.
      2. replace ListenAddress with the IP of your host.
      3. replace RPCAddress with the IP of your host.
To run you may wish to use another script for each node:

#!/bin/sh
CASSANDRA_INCLUDE=$cass_stuff/
export CASSANDRA_INCLUDE
cd
bin/cassandra -f

One downside to this approach is that if you're tracking trunk, it is your responsibility to make sure you notice changes to the default storage-conf.xml and cassandra.in.sh and apply them to your environments.


Cassandra is supported by an active and welcoming community.  If you'd like to learn more about the project, check out our wiki, mailing list or hop on #cassandra on freenode.

15 December 2009

Dear Entrepreneurs, this is something I would pay for...

Dear Entrepreneurs,

This is something I would gladly pay $20 a month for...

A device that, according to my tastes, downloads new music from the Internet whenever it connects.  I would be able to listen to music without restriction while I am disconnected from the network.  I wouldn't own the music, except for roughly 20 tracks a month that I select which would then become mine as MP3s (for FLAC or whatever DRM-less technology makes sense).  I could then load them into iTunes, give them to my brother, or (if I'm feeling sinister) make them available on a P2P network.

The music could come from anywhere: iTunes, Amazon, The Labels, or artists themselves.

The content sources exist.  The recommendation engines exist.  Devices exist. 

I suspect the audience/market exists.  (At least, I hope so.  If not, and nobody is willing to pay for music, we're going to need to find another model.  And it will still necessarily involve a money exchange between producers and cosumers and/or advertisers.)

Is there such a system already?