Planning

Planning

Sometimes I wonder why I bother…

Bayesian Classification on Rails

A project I've been working on watches Twitter for some search keywords, with the goal of finding new customers, jobs, items for sale, etc. For example, a computer repair shop might want to watch for the keywords "laptop" and "broken", and then reply to tweets where they think they can help.

But as anyone who uses Twitter can attest, even with some very specific search terms, language filtering and geocoding, there is going to be a lot of white noise. I decided to take this one step further.

Bayesian classification (your garden-variety spam filter) in ruby is quite easy, thanks to ruby-stemmer and the excellent classifier gem. The canonical example:

require 'classifier'
b = Classifier::Bayes.new :categories => ['Interesting', 'Uninteresting']
b.train_interesting "here are some good words. I hope you love them"
b.train_uninteresting "here are some bad words, I hate you"
b.classify "I hate bad words and you" # returns 'Uninteresting'

Of course, if you're implementing this in a Rails application, chances are you want the classifier to learn from real data over time. In my case, I want it to learn that a tweet is uninteresting when I delete it, and I want it to learn that a tweet is interesting when I visit the Tweet#show action.

It seems the usual method is to marshal the classifier object with madeleine, which creates a new snapshot file each time you train it. This is both easy and fast, but we're going to end up with thousands or millions of snapshot files in no time flat. Additionally, all bets are off if we have a few users who are really into cheap viagra. We need to give each User his own classifier and let him train it over time.

First, let's set up our environment. Grab the latest ruby-stemmer and classifier gems from Github, and build them from source. I recommend this because the gem versions I got on my first try were way out of date and quite broken, and because you'll need a classifier fork with my remove_stemmer method to marshal your classifiers using ActiveRecord.

$ git clone git://github.com/aurelian/ruby-stemmer.git
$ cd ruby-stemmer
$ rake compile
$ sudo rake install
$ cd ..

$ git clone git://github.com/logankoester/classifier.git
$ cd classifier
$ sudo rake install

$ sudo gem install twitter

Generate a fresh rails app if you want to follow along.

$ rails classifier_rails_example
$ script/generate resource user id:integer classifier:text
$ script/generate resource keyword id:integer user_id:integer text:string
$ script/generate resource tweet id:integer keyword_id:integer user_id:integer text:string read:boolean interesting:boolean
$ script/generate migration ChangeClassifierDefaults

Alternatively, you can clone the code from this tutorial with:

$ git clone git@github.com:logankoester/classifier_rails_example.git

Now edit the migration you just created to look like this:

class ChangeClassifierDefaults < ActiveRecord::Migration
  def self.up
    change_column :tweets, :interesting, :boolean, :default => false
    change_column :tweets, :read, :boolean, :default => false
    change_column :keywords, :text, :string, :default => ""
  end

  def self.down
    change_column :keywords, :text
    change_column :tweets, :read
    change_column :tweets, :interesting
  end
end

…and run it

$ rake db:migrate

Open your config/environment.rb file, and add the following gems to the Initializer block.

config.gem 'ruby-stemmer', :lib => 'lingua/stemmer'
config.gem 'luisparravicini-classifier', :lib => 'classifier'
config.gem 'twitter'

Now we can use ActiveRecord's built-in YAML serialization to store the classifier.

class User < ActiveRecord::Base
  has_many :tweets
  has_many :keywords

  serialize     :classifier, Classifier::Bayes
  before_create :initialize_classifier
  before_update :remove_stemmer

private

  def initialize_classifier
    self.classifier = Classifier::Bayes.new(
      :categories => ['Interesting', 'Uninteresting']
    )
    remove_stemmer
  end

  def remove_stemmer
    self.classifier.remove_stemmer
  end
end

The remove_stemmer method requires a little explanation. When a Classifier is initialized, it also creates a Stemmer object to use, which ordinarily gets marshalled along with its Classifier. But when demarshalled later, the Stemmer object (which is really just a C extension) will get caught with its shorts down, and either throw an error like "Stemmer is not initialized", or in older versions, simply segfault your rails environment!

The solution is simple; my fork implements a remove_stemmer method on Classifier::Base, which will force the stemmer to be reinitialized the next time it is needed. Call this method before you marshal your classifier, and your troubles will melt away.

Moving on to the Tweet model, we want to classify each tweet when it is created.

class Tweet < ActiveRecord::Base
  belongs_to :user
  belongs_to :keyword

  before_save :classify
 
  def classify
    text = self.text.gsub /#{self.keyword.text}/, "
    if self.user.classifier.classify(text) == 'Interesting'
      self.interesting = true
    end
  end
end

Of course, we don't want to throw off the results by including a word which is going to occur in every tweet, so we remove the search term from the text prior to classification.

Add a little method to your Keyword model to grab new tweets from the Twitter Search API

class Keyword < ActiveRecord::Base
  belongs_to :user
  has_many   :tweets, :dependent => :destroy

  after_save :search

  def search
    search = Twitter::Search.new(self.text).fetch
    search.results.each do |r|
      t = Tweet.create(
        :keyword => self,
        :user_id => self.user,
        :text => r.text
      )
      t.save
    end
  end
end

Almost done! Now we need to train our sweet new classifier. I've opted to do this entirely from the controller, so that messing around in the console won't inadvertently have an impact on the machine's learning. We also want to mark the tweet in question as already read, so that the lesson is only learned once.

class TweetsController < ApplicationController

  def show
    @tweet = Tweet.find(params[:id])
    unless @tweet.read?
      current_user.classifier.train_interesting(
        @tweet.text.gsub(/#{@tweet.keyword.text}/, ")
      )
      current_user.save
      @tweet.read = true
      @tweet.save
    end
  end

  def destroy
    if @tweet = Tweet.find(params[:id])
      if @tweet.destroy
        current_user.classifier.train_uninteresting(
          @tweet.text.gsub(/#{@tweet.keyword.text}/, ")
        )
        current_user.save
      end
    end
  end

end

And there you have it… a simple machine learning solution for extracting awesome tweets. Let's try it out!

Fire up a script/console session.

Loading development environment (Rails 2.3.5)
>> u = User.create
=> #<User id: 1, classifier: #<Classifier::Bayes:0xb64f2354 @categories={:Uninteresting=>{}, :Interesting=>{}}, total_words0, stemmernil, options{:encoding=>"UTF_8", :categories=>["Interesting", "Uninteresting"], :language=>"en"}, created_at: "2010-01-26 22:17:19", updated_at: "2010-01-26 22:17:19"

As you can see, our new user has a Bayesian Classifier waiting around to learn what kind of tweets he likes.

>> u.keywords.create(:text => "robots")
=> #<Keyword id: 1, user_id: 1, text: "robots", created_at: "2010-01-26 22:20:55", updated_at: "2010-01-26 22:20:55">
>> Tweet.all.size
=> 15

You can use the following oneliners from script/console to play around with the training:

Keyword.all.each {|k| k.search} # Rerun all searches to grab and classify more results
Tweet.all.each {|t| u.classifier.train_interesting(t.text) if t.text.downcase.include? "cyborgs" } # Any tweet with the word "cyborgs" is interesting
Tweet.all.each {|t| u.classifier.train_uninteresting(t.text) if t.text.downcase.include? "discount" } # Any tweet with the word "discount" is uninteresting
Tweet.find_all_by_interesting(true).each { |t| pp t.text }.size # Print the interesting tweets and count them
Tweet.all.each {|t| t.classify } # Rerun the classification on every tweet

Of course, this technique can be applied to sorting pretty much any kind of text. Interesting/uninteresting tweets are just one example from my life. Start hacking!

A Simple Auto-Follow script for Twitter

This morning I came across a forum post containing a fairly large list of people I wanted to follow from one of my Twitter accounts. There are a lot of auto-follow tools out there, but most of them are spammy "viral marketing" nonsense, want to store my Twitter password on their servers and I don't trust them. Here is a simple alternative written in Ruby.

Fun with ion3

Ion™ is a tiling tabbed window manager designed with keyboard users in mind.

In recent years I've been a GNOME / Compiz guy, but while I've enjoyed it's tight integration with Ubuntu and flashy effects, I've always missed the simplicity of so-called minimalist window managers, mainly fvwm. These days, however, practically everything I do happens inside a Firefox, gvim, or gnome-terminal.

I want keyboard-driven. I want scriptable. And I don't want windows hiding behind other windows. Ever.

Enter ion3. I've only been using it for the last 24 hours, and even though I haven't memorized all of the keymaps, or learned how to code in Lua (yet!), I already love it. So far the only problem I've not been able to overcome is a bug in the latest Adobe Flash that breaks fullscreen video. This isn't specific to ion3 - it's a problem with any focus-follows-mouse system. I hear there is a workound, but it didn't seem to work for me. I consider it a microscopic trade-off for such an efficient window manager. Many of my previously sluggish applications now run incredibly fast, and with a couple days of practice I'll be working faster too.

Installation

$ sudo apt-get install ion3 ion3-dev ion3-scripts ion3-doc

Now just log out, choose ion3 and start a new session. The first time you log in you'll be greeted with the man page, which I highly suggest reading. If you try not to "cheat" by using the mouse, you'll pick up almost everything in a couple of hours, and from there you'll find yourself navigating faster and faster until you don't have to think about it at all. Just like vim.

Ion is both simple and well-documented, so it would be pointless for me to write introductory tutorial. Instead, here are a couple tricks I've discovered.

Modifying your configuration

One of the first things you're going to want to do when you're done messing around is change a few settings. For the most part, this is done in a file called cfg_ion.lua. Copy the system-wide file (I found mine at /etc/X11/ion3/cfg_ion.lua) to  ~/.ion3/cfg_ion.lua and open it with a text editor.

$ mkdir ~/.ion3
$ cp `locate cfg_ion.lua | head -1` ~/.ion3/cfg_ion.lua
$ gvim ~/ion3/cfg_ion.lua

You'll need to restart Ion for your changes to take effect. Don't worry, all your applications will stay open; only the window manager needs to be restarted. Hit F12 and type session/restart.

I messed this file up a few times experimenting, and I'll probably mess it up a few more. If you screw up this file like I did, your F12 shortcut can disappear, and you'll need another way to restart Ion after you've fixed it. Keep a terminal open whenever you're editing, because you may not be able to launch one. The trick to restart Ion from the console is simple:

$ ps -e | grep ion3 # 21108 ?        00:00:16 ion3
$ kill -USR1 21108

Remapping Mod1

The Mod1 key is used to initiate most interactions with Ion. On most systems, this is Alt. This is usually a very bad choice, because a lot of other applications need the Alt key for other things. I tried the Flying Window key, but it turns out it's in a very uncomfortable place on the keyboard. The number keys are used a lot. Try reaching Win+6, and you'll see what I mean. CapsLock has been working great for me, and as an added bonus, makes it much more work to shout on IRC.

Check your keymaps with xmodmap -pm. On my system, Mod3 was unused, so I remapped CapsLock to that.

Edit (or create) ~/.Xmodmaprc and insert these lines at the bottom…

remove Lock = Caps_Lock
add Mod3 = Caps_Lock

Then run it…

/usr/bin/X11/xmodmap ~/.Xmodmaprc

Also add this line to ~/.Xsession so it is run automatically whenever you start X.

If your xmodmap -pm now reads…

mod3        Caps_Lock (0×42)

then you're in luck! Now you just need to edit the META variable near the top of your cfg_ion.lua to reflect the change

META="Mod3+"

and restart.

All done! I hope you enjoy learning and using Ion3 as much as I have. I don't think I'll be switching again any time soon.

Twitter account hacked

Twitter SpamFirst off, an apology to those who were annoyed all day today by Twitter spam: I'm sorry, I was asleep (I write better code at night) and didn't know what was going on until I woke up this evening. Not my favourite thing to wake up to!

I do a lot of freelance work with the Twitter API, and someone asked me for a quote on a clone of a Ponzi-style follower train site. Not the kind of apps I build, but it looked slightly more legit than usual and I was foolish to log in to have a look around - that's where the trouble started. The site in question used the pre-oAuth login (as unfortunately most sites still do) which requires giving up your Twitter screen name and password.

As soon as they had it I was locked out of my account due to "too many incorrect password attempts", and didn't recover the account until tonight. The password has been changed and the spam in question deleted. I've learned my lesson, and will only use oAuth credentials in the future - you won't receive any more spam from this account.

Dream

char_marshaI had the strangest dream this morning.

So I'm walking through a grocery store with an umbrella, and randomly I'm approached by a drunk Julia Deakin, who is yelling and slurring at me incomprehensibly. I look down to see what she's pointing at, and five black cats jump out of my umbrella. The topic drifts abruptly from my cats to the Palm Pre, and she offered me a job on the register. Which was a nice of her, but she clearly didn't work in the store.

If you can figure out the symbolism behind this one, the men in white coats should be there any minute.

Human Readable Text Compression

As a Web Service

TweetShrinkTweetShrink, a web service from TRNSFR, uses a database of common instant / text messaging abbreviations to reduce the number of characters in a tweet. It's essentially a human-readable compression algorithm. For example, "Some text to shrink" becomes "sum text 2 shrnk" when passed through their API.

But it doesn't enforce Twitter's 140 character limit, which means it can be used beyond Twitter for whatever you like. Back in March I released the tweetshrink gem for Ruby, and today I've updated it to 0.2 which includes a command line interface.

From the command line

First, make sure you have Ruby and Rubygems installed. On Debian-based operating systems (such as Ubuntu), this goes a little something like

$ sudo apt-get install ruby rubygems

Now install the gem from it's GitHub repository:

$ sudo gem sources -a http://gems.github.com # (only need to do this once)
$ sudo gem install logankoester-tweetshrink

You can use it from the command line like this:

$ echo "Some text to shrink" | tweetshrink
# Or with a file…
$ tweetshrink ./file_to_shrink.txt

From vim

Or, you can integrate it with vim for ultimate text shrinking convenience. Just add the following to your .vimrc:

""""""""""""""""""""""""""""""""""
" Tweetshrink text filter (:tws) "
""""""""""""""""""""""""""""""""""
autocmd BufEnter * vmap ,tws !tweetshrink
autocmd BufEnter * nmap ,tws !!tweetshrink

Now you can shrink a single line by hitting ,tws in Normal mode, or shrink a whole a visual block.

Of course, this is just as easy to integrate with your favorite text editor; I just happen to use vim.

On the Web

When I integrated this feature with my blog & tweet scheduler PingLater.fm, I realized TweetShrink didn't have a favicon. I needed an icon to use for the button, so I created these - feel free to use them for whatever.

PingLater.fm

PingLater.fm - WelcomeI've spent the last couple of days working hard on a new app for managing your web presence.

There is a service called Ping.fm for broadcasting updates to your blogs, Twitter, Facebook, and other social networks all at once, from one place.

This can save you a lot of time if you're trying to manage a brand or keep up with different groups of friends, and it makes it easier to prevent your presence on these sites from becoming stale.

The application I'm calling PingLater.fm takes it one step further. Now you can set up pings to be sent at a specified time in the future. You could schedule a product highlight for each day of the month, release new blog posts while you're off on vacation, or whatever else you want to use a service like this for.

It's free for now while I gather feedback and optimize the code, but free users will eventually be limited to 3 pings scheduled at a time.

I have a number of premium features in mind (RSS posting, image/video, iPhone…) to make it a really indispensable tool  for pro bloggers and internet marketing people.

But we'll get to that. For now, I just want to hear from you. Let me know what I can do to make this useful for you!

Click here to try it out!

Amazon EC2 Cheatsheet

I use Amazon EC2 every day and yet I always forget how to use their command-line tools. Here are a few common scenarios I run into, and their solutions.

Okay, just kidding, there's only one. I'm planning on editing this post over time :-)

Bundling an AMI from a running instance

  1. Use scp to copy your private key (pk-*.pem) to root@yourami:/mnt
  2. Log in as root and bundle the volume
    $ ec2-bundle-vol -d /mnt -k /mnt/pk-*.pem –cert /mnt/cert-*.pem -u YOUR_AWS_ACCOUNT_ID -s 10240

    Now you have several minutes to kill. Click play…

  3. Upload the image to Amazon S3. You may want to do this inside of screen; I had my ssh session time out on me while it was working a couple of times.
    $ s3cmd ls # List all S3 buckets
    $ ec2-upload-bundle -b YOUR_S3_BUCKET -m /mnt/image.manifest.xml
    $ ec2-upload-bundle -b YOUR_S3_BUCKET -m /mnt/image.manifest.xml -a YOUR_ACCESS_KEY -s YOUR_SECRET_ACCESS_KEY
  4. Register the AMI. This is something you need to do even when updating an image that has already been registered.
    $ ec2-register YOUR_S3_BUCKET/image.manifest.xml

    This will return an AMI identifier that can be used to run a new instance.

    $ ec2-run-instances YOUR_AMI_IDENTIFIER

More information on Creating an Image at Amazon


Deploying Sinatra to a sub-URI using Passenger

It's not hard, but it turns out there's a trick to it. I've run into this problem twice now, so I figure it should be documented. This is the solution if your "/" route is resulting in "Not Found" or an Apache directory listing.

You can read more about the problem at Ardekantur's "Phusion, Rack, Sinatra, and sub-domains", but here's my quick solution:

  1. Disable mod_autoindex if it is enabled.
  2. Make sure your RackBaseURI does not have a trailing slash.
  3. Add this before_filter to your Sinatra app:
    before do   request.env['PATH_INFO'] = '/' if request.env['PATH_INFO'].empty? end

I suppose an alternative solution would be feasible using Rack middleware, but this is what I'm using. Thanks to Ryan Funduk for helping me figure this stuff out.