Bayesian Classification on Rails

A project I've been working on watches Twitter for some search keywords, with the goal of finding new customers, jobs, items for sale, etc. For example, a computer repair shop might want to watch for the keywords "laptop" and "broken", and then reply to tweets where they think they can help.

But as anyone who uses Twitter can attest, even with some very specific search terms, language filtering and geocoding, there is going to be a lot of white noise. I decided to take this one step further.

Bayesian classification (your garden-variety spam filter) in ruby is quite easy, thanks to ruby-stemmer and the excellent classifier gem. The canonical example:

require 'classifier'
b = Classifier::Bayes.new :categories => ['Interesting', 'Uninteresting']
b.train_interesting "here are some good words. I hope you love them"
b.train_uninteresting "here are some bad words, I hate you"
b.classify "I hate bad words and you" # returns 'Uninteresting'

Of course, if you're implementing this in a Rails application, chances are you want the classifier to learn from real data over time. In my case, I want it to learn that a tweet is uninteresting when I delete it, and I want it to learn that a tweet is interesting when I visit the Tweet#show action.

It seems the usual method is to marshal the classifier object with madeleine, which creates a new snapshot file each time you train it. This is both easy and fast, but we're going to end up with thousands or millions of snapshot files in no time flat. Additionally, all bets are off if we have a few users who are really into cheap viagra. We need to give each User his own classifier and let him train it over time.

First, let's set up our environment. Grab the latest ruby-stemmer and classifier gems from Github, and build them from source. I recommend this because the gem versions I got on my first try were way out of date and quite broken, and because you'll need a classifier fork with my remove_stemmer method to marshal your classifiers using ActiveRecord.

$ git clone git://github.com/aurelian/ruby-stemmer.git
$ cd ruby-stemmer
$ rake compile
$ sudo rake install
$ cd ..

$ git clone git://github.com/logankoester/classifier.git
$ cd classifier
$ sudo rake install

$ sudo gem install twitter

Generate a fresh rails app if you want to follow along.

$ rails classifier_rails_example
$ script/generate resource user id:integer classifier:text
$ script/generate resource keyword id:integer user_id:integer text:string
$ script/generate resource tweet id:integer keyword_id:integer user_id:integer text:string read:boolean interesting:boolean
$ script/generate migration ChangeClassifierDefaults

Alternatively, you can clone the code from this tutorial with:

$ git clone git@github.com:logankoester/classifier_rails_example.git

Now edit the migration you just created to look like this:

class ChangeClassifierDefaults < ActiveRecord::Migration
  def self.up
    change_column :tweets, :interesting, :boolean, :default => false
    change_column :tweets, :read, :boolean, :default => false
    change_column :keywords, :text, :string, :default => ""
  end

  def self.down
    change_column :keywords, :text
    change_column :tweets, :read
    change_column :tweets, :interesting
  end
end

…and run it

$ rake db:migrate

Open your config/environment.rb file, and add the following gems to the Initializer block.

config.gem 'ruby-stemmer', :lib => 'lingua/stemmer'
config.gem 'luisparravicini-classifier', :lib => 'classifier'
config.gem 'twitter'

Now we can use ActiveRecord's built-in YAML serialization to store the classifier.

class User < ActiveRecord::Base
  has_many :tweets
  has_many :keywords

  serialize     :classifier, Classifier::Bayes
  before_create :initialize_classifier
  before_update :remove_stemmer

private

  def initialize_classifier
    self.classifier = Classifier::Bayes.new(
      :categories => ['Interesting', 'Uninteresting']
    )
    remove_stemmer
  end

  def remove_stemmer
    self.classifier.remove_stemmer
  end
end

The remove_stemmer method requires a little explanation. When a Classifier is initialized, it also creates a Stemmer object to use, which ordinarily gets marshalled along with its Classifier. But when demarshalled later, the Stemmer object (which is really just a C extension) will get caught with its shorts down, and either throw an error like "Stemmer is not initialized", or in older versions, simply segfault your rails environment!

The solution is simple; my fork implements a remove_stemmer method on Classifier::Base, which will force the stemmer to be reinitialized the next time it is needed. Call this method before you marshal your classifier, and your troubles will melt away.

Moving on to the Tweet model, we want to classify each tweet when it is created.

class Tweet < ActiveRecord::Base
  belongs_to :user
  belongs_to :keyword

  before_save :classify
 
  def classify
    text = self.text.gsub /#{self.keyword.text}/, "
    if self.user.classifier.classify(text) == 'Interesting'
      self.interesting = true
    end
  end
end

Of course, we don't want to throw off the results by including a word which is going to occur in every tweet, so we remove the search term from the text prior to classification.

Add a little method to your Keyword model to grab new tweets from the Twitter Search API

class Keyword < ActiveRecord::Base
  belongs_to :user
  has_many   :tweets, :dependent => :destroy

  after_save :search

  def search
    search = Twitter::Search.new(self.text).fetch
    search.results.each do |r|
      t = Tweet.create(
        :keyword => self,
        :user_id => self.user,
        :text => r.text
      )
      t.save
    end
  end
end

Almost done! Now we need to train our sweet new classifier. I've opted to do this entirely from the controller, so that messing around in the console won't inadvertently have an impact on the machine's learning. We also want to mark the tweet in question as already read, so that the lesson is only learned once.

class TweetsController < ApplicationController

  def show
    @tweet = Tweet.find(params[:id])
    unless @tweet.read?
      current_user.classifier.train_interesting(
        @tweet.text.gsub(/#{@tweet.keyword.text}/, ")
      )
      current_user.save
      @tweet.read = true
      @tweet.save
    end
  end

  def destroy
    if @tweet = Tweet.find(params[:id])
      if @tweet.destroy
        current_user.classifier.train_uninteresting(
          @tweet.text.gsub(/#{@tweet.keyword.text}/, ")
        )
        current_user.save
      end
    end
  end

end

And there you have it… a simple machine learning solution for extracting awesome tweets. Let's try it out!

Fire up a script/console session.

Loading development environment (Rails 2.3.5)
>> u = User.create
=> #<User id: 1, classifier: #<Classifier::Bayes:0xb64f2354 @categories={:Uninteresting=>{}, :Interesting=>{}}, total_words0, stemmernil, options{:encoding=>"UTF_8", :categories=>["Interesting", "Uninteresting"], :language=>"en"}, created_at: "2010-01-26 22:17:19", updated_at: "2010-01-26 22:17:19"

As you can see, our new user has a Bayesian Classifier waiting around to learn what kind of tweets he likes.

>> u.keywords.create(:text => "robots")
=> #<Keyword id: 1, user_id: 1, text: "robots", created_at: "2010-01-26 22:20:55", updated_at: "2010-01-26 22:20:55">
>> Tweet.all.size
=> 15

You can use the following oneliners from script/console to play around with the training:

Keyword.all.each {|k| k.search} # Rerun all searches to grab and classify more results
Tweet.all.each {|t| u.classifier.train_interesting(t.text) if t.text.downcase.include? "cyborgs" } # Any tweet with the word "cyborgs" is interesting
Tweet.all.each {|t| u.classifier.train_uninteresting(t.text) if t.text.downcase.include? "discount" } # Any tweet with the word "discount" is uninteresting
Tweet.find_all_by_interesting(true).each { |t| pp t.text }.size # Print the interesting tweets and count them
Tweet.all.each {|t| t.classify } # Rerun the classification on every tweet

Of course, this technique can be applied to sorting pretty much any kind of text. Interesting/uninteresting tweets are just one example from my life. Start hacking!

73 Comments

  1. AnonymousCritic says:

    Nice article!

  2. Rajesh says:

    This is pretty cool, I was thinking of something on these lines but I guess you beat me ;)

  3. [...] 27th, 2010 Bayesian Classification on Rails | Logan Koester (tags: ruby rails statistics bayes datamining bayesian [...]

  4. I don't usually comment on blog posts but I had to stop in and say thanks for writing this, I totally agree and with a little luck other people will understand where you are comin from.

  5. Do you actually think this was the best way to make a point?

  6. I was intending to do something similar to this a bit ago, but I couldn't to complete. It's great learning about your experience.

  7. Alright, just read this post, I have been doing research, and this blog has helped. Thanks.

  8. Don't you think it'd be smart to think twice about this? That's not to imply you're incorrect, but when you write things like this, it will upset some folks. And I ponder if you have given thought to the opposite side of this statement.

  9. Oh my god you will not belief this. This stupid kitten just farted on my leg!? I mean what's the matter with this!? I nourish that thing and I end up with that in exchange. I even now will not belief this. Anyways, you have quite a few important information there in your post. I knew Yahoo could take me to some useful stuff today :). Ok should search for that pet now! Have a nice evening you all!

  10. TSwain says:

    Super-Duper site! I am loving it!! Will come back again - taking your feeds too now, Thanks.

  11. Hey, I found your blog while searching on Google your post looks very interesting for me. I will add a backlink and bookmark your site. Keep up the good work!

    -Robert Shumake Fifth Third

  12. Steve Byrne says:

    You might want to mention that rake-compiler needs to be installed, and that ruby-stemmer does not seem to have it listed as a dependency.

  13. Steve Byrne says:

    git clone git@github.com:logankoester/classifier_rails_example.git did not work — basically the error I got was "ERROR: Permission to logankoester/classifier_rails_example denied to "

  14. Steve Byrne says:

    Maybe I don't understand Ruby quite as well as I think, but the article has several lines like the following:

    [
    text = self.text.gsub /#{self.keyword.text}/, "
    ]

    ending with a trailing double quote. Not sure what the intent here is, but it seems that most of the lines involving gsub are suffering from this malady.

  15. I enjoyed the article and thanks recompense posting such valuable conference forbear of all of us to be reasonable, I make for for all it both valuable and revealing and I drawing to flick sometimes non-standard due to it as again as I can.

    ray ban 3025

  16. Wonderful Evening, wow, no false wrong, specially from the major news corperations with the big slants to the left or right. Did you see last nights O'Rielly factor? haha, that was rediculous! Sorry, I'm rambling on once more. Have a Wonderful one!

  17. I enjoyed the article and thanks recompense posting such valuable tidings in lieu of of all of us to skim, I cheer up with regard to it both opportune and revelatory and I method to gather from it as commonly as I can.

    ray ban 3025

  18. SY0-201 says:

    Alright, just read this post, I have been doing research, and this blog has helped. Thanks.

  19. This is a mammoth blog and I force massive on reading it every morning mechanism thanks you
    honestly sharing it!

    louis vuitton handbags

  20. UGG Boots says:

    I in interpretation enjoyed reading your blog and breed it both edifying and interesting. I pine be fateful to bookmark it and vicinity in it as oft as I can.

    Thanks

    Bernice Franklin

    UGG Boots

  21. UGG Boots says:

    I found this article useful in a paper I am writing at university. Hopefully, I get an A+ now!

    Thanks

    Bernice Franklin

    UGG Boots

  22. Sooooooo amazing submit, i love some words so much and can i quote a few of them on my weblog? Also i have e-mailed you regarding would it be feasible for us to exchange our links, hope hearing from you soon.

  23. [...] - Bayesian Classification on Rails: Search keywords to find new customers ( for example using Twitter) - Job-Board: A simple job board app written in Sinatra, Mongo, Effigy, & HTML5 - configliere: Wise, discreet configuration for ruby scripts: integrate config files, environment variables and command line with no fuss [...]

  24. free trial says:

    After reading you blog, I thought your articles is great! I am very like your articles and I am very interested in the field of Free trial. Your blog is very useful for me .I bookmarked your blog! I trust you will behave better from now on; I hope she understands that she cannot exepct a raise.
    My blog Free trial

  25. WP Themes says:

    Genial dispatch and this fill someone in on helped me alot in my college assignement. Thank you for your information.

  26. very useful read. I would love to follow you on twitter.

  27. blog.logankoester.com; You saved my day again.

  28. I've got a great number of car projects going on, and I think my girl wants to give me a new beld sander on my birthday too.

  29. Saw your site bookmarked on Reddit.I love your site and marketing strategy.Your site is very useful for me .I bookmarked your site!

  30. Hello, I attempted to email you regarding this post that i've a few inquires, but can't seem to reach you. Please email me when have a minute. Thanks.

  31. For some reason only half of the submit is being displayed, is it my browser or the site?

  32. Fantastic post! This could aid lots of people find out about this matter. Do you want to incorporate video clips together with these? It could undoubtedly help out. Your reason was spot on and owing to you; I probably won't have to describe everything to my pals. I can simply direct them here

  33. Hailey Segel says:

    Heya i got to your site by mistake when i was searching bing for something off topic here but i do have say your site is really helpful, like the theme and the content on here…so thanks for me procrastinating from my previous task, lol

  34. The post gave us an important Brainstorm session of all the possibilities we could utilize on our blog.

  35. Welcome First time jumped here on your site, founde on Bing.

  36. WP Themes says:

    Nice fill someone in on and this enter helped me alot in my college assignement. Thanks you seeking your information.

  37. Viagra says:

    There is obviously a lot to know about this. I think you made some good points in Features also.

  38. and loans says:

    Saw your site bookmarked on Reddit.I love your site and marketing strategy.Your site is very useful for me .I bookmarked your site!
    My Home and loans

  39. ocljlwijlmcds says:

    ocljlwijlmcds

  40. Fantastic post! This could aid lots of people find out about this matter. Do you want to incorporate video clips together with these? It could undoubtedly help out. Your reason was spot on and owing to you; I probably won't have to describe everything to my pals. I can simply direct them here

  41. [...] Achievements | Tiffany Markman | Copywriting - Editing - Corporate Training | Sandton - JohannesburgHow to Find Sex: the Art of the One Night Stand – Part 19 â

  42. Moro Gwisho says:

    Hi Logan,

    please complete my site. I am extremely disappointed with the way you have handled this project - I believe you are taking my patience for granted. This may also be because you are holding my $350 and I have nothing to show for it. I will not push any further. If you cannot complete this project by the last day of this month, then I will let it go. I will count my losses and move on.

  43. Cheers! Still another brilliant post, that's the key reason why my partner and I come back for your internet site habitually!

    MOON

  44. After reading you site, Your site is very useful for me .I bookmarked your site!

  45. After reading you site, Your site is very useful for me .I bookmarked your site!
    I am been engaged 10 years on the Free finance personal software If you have some questions, please get in touch with me.

  46. beaccuffsaw says:

    I do think this is a most incredible website for proclaiming great wonders of Our God!

  47. The availability of Viagra pills on the internet means that the online pharmacies are now being patronised by most people who want to buy Viagra cheap. Not only does the internet offer a more affordable option than the local pharmacies for Viagra pills, they are also more convenient in terms of the confidentiality they offer customers.

  48. Shamwow FAQs says:

    Love this post!! I could not of of said it any better myself

  49. xxx rental says:

    would it be possible to translate your website into spanish because i have difficulties of speaking to english, and as there are not many pictures on your website i would like to read more of what you are writting .

  50. would it be possible to translate your website into spanish because i have difficulties of speaking to english, and as there are not many pictures on your website i would like to read more of what you are writting .

  51. Guy Hulshoff says:

    Congratulations for posting such a useful weblog. Your blog isn't only informative and also extremely artistic too. There usually are very few individuals who can write not so simple articles that creatively. Keep up the great work !!

  52. Pharme302 says:

    Hello! beedbdk interesting beedbdk site!

  53. Pharmd833 says:

    Very nice site! [url=http://yieapxo.com/qoqast/2.html]cheap cialis[/url]

  54. Doctors say things like physiologic changes in the man's life could account for lower male fertility, as well as lifestyle and genetics.

  55. Hey there… I just would like to say thank you since you shared your personal thoughts with this blog. After checking out all of this blog, I am your ideas on those latest violence in Nigeria. Thanks.

  56. This is one technology that I would love to be able to use for myself. It’s definitely a cut above the rest and I can’t wait until my provider has it. Your insight was what I needed. Thanks

  57. Penny says:

    blog.logankoseter.com, how do yhou do it?

Leave a Reply