Entrepreneurship, Linux, and Ruby
Posts tagged ruby
Does it take you a while to get started in the morning?
May 18th
It does for me. When I actually am ready to sit down and do some Rails hacking, I still have to fire up a couple irb consoles, open my editor, make sure Redis is running, start a resque worker, autotest, log tails, etc etc etc… I'd really rather get right to it, wouldn't you?
Ahhh, that's much better!
task :begin => [:gvim, :console, :logs, :watchr, :server, :resque]
task :console do
puts "Opening IRB console…"
`gnome-terminal –window-with-profile=railsconsole -x script/console`
end
task :logs do
puts "Opening log files…"
`gnome-terminal –window-with-profile=rails -t "Rails Logs" -x tail -f log/*`
end
task :watchr do
puts "Starting test watchr…"
`gnome-terminal –window-with-profile=rails -t "Test Watchr" -x rake watchr:test`
end
task :gvim do
puts "Starting gvim…"
sh 'gvim'
end
task :server do
puts "Starting application server…"
`gnome-terminal –window-with-profile=rails -t "Application Server" -x script/server`
end
task :resque do
puts "Starting resque web… (http://localhost:5678)"
`resque-web 2> /dev/null`
puts "Starting resque worker…"
`gnome-terminal –window-with-profile=rails -t "Resque Worker" -x rake day:quick_resque_worker`
end
task :quick_resque_worker do
sh "QUEUE=* rake resque:work"
end
end
namespace :watchr do
task :test do
sh "watchr test/test.watchr"
end
end
Into my Rakefile it goes! And if you like that, be sure to check out git-pivotal – grab the next thing to do and give it its own branch with just one more command. Anyone feeling clever enough to implement a rake day:end as well?
Bayesian Classification on Rails
Jan 26th
A project I've been working on watches Twitter for some search keywords, with the goal of finding new customers, jobs, items for sale, etc. For example, a computer repair shop might want to watch for the keywords "laptop" and "broken", and then reply to tweets where they think they can help.
But as anyone who uses Twitter can attest, even with some very specific search terms, language filtering and geocoding, there is going to be a lot of white noise. I decided to take this one step further.
Bayesian classification (your garden-variety spam filter) in ruby is quite easy, thanks to ruby-stemmer and the excellent classifier gem. The canonical example:
b = Classifier::Bayes.new :categories => ['Interesting', 'Uninteresting']
b.train_interesting "here are some good words. I hope you love them"
b.train_uninteresting "here are some bad words, I hate you"
b.classify "I hate bad words and you" # returns 'Uninteresting'
Of course, if you're implementing this in a Rails application, chances are you want the classifier to learn from real data over time. In my case, I want it to learn that a tweet is uninteresting when I delete it, and I want it to learn that a tweet is interesting when I visit the Tweet#show action.
It seems the usual method is to marshal the classifier object with madeleine, which creates a new snapshot file each time you train it. This is both easy and fast, but we're going to end up with thousands or millions of snapshot files in no time flat. Additionally, all bets are off if we have a few users who are really into cheap viagra. We need to give each User his own classifier and let him train it over time.
First, let's set up our environment. Grab the latest ruby-stemmer and classifier gems from Github, and build them from source. I recommend this because the gem versions I got on my first try were way out of date and quite broken, and because you'll need a classifier fork with my remove_stemmer method to marshal your classifiers using ActiveRecord.
$ cd ruby-stemmer
$ rake compile
$ sudo rake install
$ cd ..
$ git clone git://github.com/logankoester/classifier.git
$ cd classifier
$ sudo rake install
$ sudo gem install twitter
Generate a fresh rails app if you want to follow along.
$ script/generate resource user id:integer classifier:text
$ script/generate resource keyword id:integer user_id:integer text:string
$ script/generate resource tweet id:integer keyword_id:integer user_id:integer text:string read:boolean interesting:boolean
$ script/generate migration ChangeClassifierDefaults
Alternatively, you can clone the code from this tutorial with:
Now edit the migration you just created to look like this:
def self.up
change_column :tweets, :interesting, :boolean, :default => false
change_column :tweets, :read, :boolean, :default => false
change_column :keywords, :text, :string, :default => ""
end
def self.down
change_column :keywords, :text
change_column :tweets, :read
change_column :tweets, :interesting
end
end
…and run it
Open your config/environment.rb file, and add the following gems to the Initializer block.
config.gem 'luisparravicini-classifier', :lib => 'classifier'
config.gem 'twitter'
Now we can use ActiveRecord's built-in YAML serialization to store the classifier.
has_many :tweets
has_many :keywords
serialize :classifier, Classifier::Bayes
before_create :initialize_classifier
before_update :remove_stemmer
private
def initialize_classifier
self.classifier = Classifier::Bayes.new(
:categories => ['Interesting', 'Uninteresting']
)
remove_stemmer
end
def remove_stemmer
self.classifier.remove_stemmer
end
end
The remove_stemmer method requires a little explanation. When a Classifier is initialized, it also creates a Stemmer object to use, which ordinarily gets marshalled along with its Classifier. But when demarshalled later, the Stemmer object (which is really just a C extension) will get caught with its shorts down, and either throw an error like "Stemmer is not initialized", or in older versions, simply segfault your rails environment!
The solution is simple; my fork implements a remove_stemmer method on Classifier::Base, which will force the stemmer to be reinitialized the next time it is needed. Call this method before you marshal your classifier, and your troubles will melt away.
Moving on to the Tweet model, we want to classify each tweet when it is created.
belongs_to :user
belongs_to :keyword
before_save :classify
def classify
text = self.text.gsub /#{self.keyword.text}/, "
if self.user.classifier.classify(text) == 'Interesting'
self.interesting = true
end
end
end
Of course, we don't want to throw off the results by including a word which is going to occur in every tweet, so we remove the search term from the text prior to classification.
Add a little method to your Keyword model to grab new tweets from the Twitter Search API
belongs_to :user
has_many :tweets, :dependent => :destroy
after_save :search
def search
search = Twitter::Search.new(self.text).fetch
search.results.each do |r|
t = Tweet.create(
:keyword => self,
:user_id => self.user,
:text => r.text
)
t.save
end
end
end
Almost done! Now we need to train our sweet new classifier. I've opted to do this entirely from the controller, so that messing around in the console won't inadvertently have an impact on the machine's learning. We also want to mark the tweet in question as already read, so that the lesson is only learned once.
def show
@tweet = Tweet.find(params[:id])
unless @tweet.read?
current_user.classifier.train_interesting(
@tweet.text.gsub(/#{@tweet.keyword.text}/, ")
)
current_user.save
@tweet.read = true
@tweet.save
end
end
def destroy
if @tweet = Tweet.find(params[:id])
if @tweet.destroy
current_user.classifier.train_uninteresting(
@tweet.text.gsub(/#{@tweet.keyword.text}/, ")
)
current_user.save
end
end
end
end
And there you have it… a simple machine learning solution for extracting awesome tweets. Let's try it out!
Fire up a script/console session.
>> u = User.create
=> #<User id: 1, classifier: #<Classifier::Bayes:0xb64f2354 @categories={:Uninteresting=>{}, :Interesting=>{}}, total_words0, stemmernil, options{:encoding=>"UTF_8", :categories=>["Interesting", "Uninteresting"], :language=>"en"}, created_at: "2010-01-26 22:17:19", updated_at: "2010-01-26 22:17:19"
As you can see, our new user has a Bayesian Classifier waiting around to learn what kind of tweets he likes.
=> #<Keyword id: 1, user_id: 1, text: "robots", created_at: "2010-01-26 22:20:55", updated_at: "2010-01-26 22:20:55">
>> Tweet.all.size
=> 15
You can use the following oneliners from script/console to play around with the training:
Tweet.all.each {|t| u.classifier.train_interesting(t.text) if t.text.downcase.include? "cyborgs" } # Any tweet with the word "cyborgs" is interesting
Tweet.all.each {|t| u.classifier.train_uninteresting(t.text) if t.text.downcase.include? "discount" } # Any tweet with the word "discount" is uninteresting
Tweet.find_all_by_interesting(true).each { |t| pp t.text }.size # Print the interesting tweets and count them
Tweet.all.each {|t| t.classify } # Rerun the classification on every tweet
Of course, this technique can be applied to sorting pretty much any kind of text. Interesting/uninteresting tweets are just one example from my life. Start hacking!
A Simple Auto-Follow script for Twitter
Jan 15th
This morning I came across a forum post containing a fairly large list of people I wanted to follow from one of my Twitter accounts. There are a lot of auto-follow tools out there, but most of them are spammy "viral marketing" nonsense, want to store my Twitter password on their servers and I don't trust them. Here is a simple alternative written in Ruby.
Human Readable Text Compression
Jun 13th
As a Web Service
TweetShrink, a web service from TRNSFR, uses a database of common instant / text messaging abbreviations to reduce the number of characters in a tweet. It's essentially a human-readable compression algorithm. For example, "Some text to shrink" becomes "sum text 2 shrnk" when passed through their API.
But it doesn't enforce Twitter's 140 character limit, which means it can be used beyond Twitter for whatever you like. Back in March I released the tweetshrink gem for Ruby, and today I've updated it to 0.2 which includes a command line interface.
From the command line
First, make sure you have Ruby and Rubygems installed. On Debian-based operating systems (such as Ubuntu), this goes a little something like
Now install the gem from it's GitHub repository:
$ sudo gem install logankoester-tweetshrink
You can use it from the command line like this:
# Or with a file…
$ tweetshrink ./file_to_shrink.txt
From vim
Or, you can integrate it with vim for ultimate text shrinking convenience. Just add the following to your .vimrc:
" Tweetshrink text filter (:tws) "
""""""""""""""""""""""""""""""""""
autocmd BufEnter * vmap ,tws !tweetshrink
autocmd BufEnter * nmap ,tws !!tweetshrink
Now you can shrink a single line by hitting ,tws in Normal mode, or shrink a whole a visual block.
Of course, this is just as easy to integrate with your favorite text editor; I just happen to use vim.
On the Web
When I integrated this feature with my blog & tweet scheduler PingLater.fm, I realized TweetShrink didn't have a favicon. I needed an icon to use for the button, so I created these – feel free to use them for whatever.
Deploying Sinatra to a sub-URI using Passenger
Apr 8th
It's not hard, but it turns out there's a trick to it. I've run into this problem twice now, so I figure it should be documented. This is the solution if your "/" route is resulting in "Not Found" or an Apache directory listing.
You can read more about the problem at Ardekantur's "Phusion, Rack, Sinatra, and sub-domains", but here's my quick solution:
- Disable mod_autoindex if it is enabled.
- Make sure your RackBaseURI does not have a trailing slash.
- Add this before_filter to your Sinatra app:
before do request.env['PATH_INFO'] = '/' if request.env['PATH_INFO'].empty? end
I suppose an alternative solution would be feasible using Rack middleware, but this is what I'm using. Thanks to Ryan Funduk for helping me figure this stuff out.
TweetShrink gem for Ruby
Mar 6th
TweetShrink's API is so simple that this gem barely adds anything on top of HTTParty, but here it is.
require 'rubygems' require 'tweetshrink' t = TweetShrink.shrink "One wonders why" # t['difference'] => 4 # t['text'] => "1 wonders y" # t['original_text'] => "One wonders why"
You can get it from my github account, here, or via rubygems like
$ sudo gem install logankoester-tweetshrink
Enjoy.


