 |
By Robert Dempsey of Atlantic Dominion Solutions, LLC
Each Amazon Web Services (AWS) product is, in and of itself, very powerful. By combining them, you can produce true enterprise-class web applications. This article
guides you through building one of these applications. We will build a web-spidering solution using Ruby, Ruby on Rails, and Lucene. The application will run on Amazon Elastic Compute Cloud Beta (Amazon EC2™), will use Amazon Simple Queue Service (Amazon SQS) for job management, and will store any persistent data (in
addition to backups) on Amazon Simple Storage Service (Amazon S3).
Because of the complex nature of the solution presented here, this article will require an intermediate-to-high level of understanding of Ruby and Ruby on Rails, and an intermediate understanding of using AWS.
The solution will use Rails 2.0.1.
Background Reading
If you are new to using Ruby on Rails with AWS, please review the following articles to gain a foundation:
Solution Overview
The first step in planning your application is determining the AWS tool you will use for each portion of the application. Here's the breakdown:
- Amazon EC2 - for production, you would use two Amazon EC2 instances: one for the Rails application and MySQL (or any other supported) database, and another for the Lucene (Solr) server; for development, you will assume that you are using a single server for the entire solution
- Amazon SQS - for managing your spidering jobs
- Amazon S3 - for persistent storage and backups
Ruby has a number of libraries that make interaction with AWS a lot simpler. The Ruby gems (also listed in the README of the sample code) you will use for this application are:
- amazon-ec2 - for using Amazon EC2
- aws-s3 - for using Amazon S3
- SQS - for using Amazon SQS
- hpricot - for parsing HTML pages
The application itself is somewhat simplistic, and consists of three models:
- Site
- Job
- Page
Now that you know what you need for the application, let's take a look at the steps you and the application will take to produce searchable results.
- First, get set up by installing all necessary Ruby gems, creating your skeleton application, and installing your plug-ins.
- Next, create an interface to your database using Ruby on Rails that allows you to type a URL of a Site you want to spider. When a Site is saved into the database, it will automatically create a Job. When a Job is created, an Amazon SQS message is created, which contains the ID of the Job. A Site will have many Pages, each with its own attributes, including title, description, and content.
- When you have your Site and Job models, you'll create a Page model, update it to use the acts_as_solr plug-in, and then add search capability.
- When that is ready, you will perform some manual QA to test that everything works as it should.
- After that, you will add code that will handle Jobs using an Amazon SQS queue. A Job is defined as spidering a single site. Each time a site is created, a Job will be created in Amazon SQS.
- Finally, you will add the spidering code. The spidering code will be held inside the Job model and perform the following actions:
- Check the queue for Jobs. If one is found, it continues.
- Next, it will go into your database, retrieve the Job, and then grab the associated Site.
- After updating the status of the Job, the spider creates a Page using the URL from the Site.
- When it is at the Site URL, it retrieves the HTML content, parses it, stores the title, description, and content in the database, and then creates a Page for each link that the Page contains.
- When you have completed parsing the initial URL, you will pull all other links for the initial Site Page and parse each, again storing the title, description, and page content in the database.
- Upon Job completion, the Amazon SQS message is removed from the queue and you are ready to search it.
Still with me? It sounds rather complex; however, using Ruby and Rails, once
you know how to go about making the magic happen, it is simpler than you think. Let's get started!
Making the Magic Happen
Getting Set Up
To get started, install your Ruby gems:
Musashi:/ $ sudo gem install amazon-ec2 aws-s3 hpricot SQS
Note: If you are using RubyGems 0.9.5, by default it will install any dependencies these gems might have.
Next, create your skeleton Rails application:
Musashi:/ $ rails ec2_spider
Now, create the "ec_2_spider_development" database in MySQL and update config/database.yml as necessary. Now, get ready to install a great plug-in: acts_as_solr is a Rails plug-in that adds full-text search, along with other capabilities, from Apache's Solr, an open-source enterprise search server based on the Lucene Java search library. Solr features XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, and a web administration interface. It is also wicked fast and easy to use with the acts_as_solr plug-in. The current version of acts_as_solr requires Java Runtime Environment (JRE) 1.5 or later. If you are using a Mac with Leopard installed, chances are you already have it. If not, you can download it from Sun. Install the acts_as_solr plug-in into your Rails application using the following command:
Musashi:ec2_spider $ script/plugin install svn://svn.railsfreaks.com/projects/acts_as_solr/trunk
Note: The acts_as_solr plug-in comes with a copy of Solr ready to go.
When you have production ready and are using a separate Amazon EC2 instance for Solr, update config/solr.yml with the URL of the production server.
Update Environment.rb
To take advantage of all the Ruby goodness that you've just installed, and
also to connect to AWS from your application, add the following to the bottom of environment.rb:
require 'rubygems'
require 'sqs'
require 'hpricot'
require 'open-uri'
SQS.access_key_id = 'YOUR_ACCESS_KEY_ID'
SQS.secret_access_key = 'YOUR_SECRET_ACCESS_KEY'
Create Scaffold and Run Migrations
Now you are set up. Next, you are going to create scaffold for your main objects: Site, Page, and Job. Continuing with the idea of keeping it simple, use the built-in scaffold generator that Rails offers. As of Rails 2.0, all scaffold generated is inherently RESTful. In addition, your migrations will automatically add the code for the timestamp columns. Three commands get the job done:
Musashi:ec2_spider $ script/generate scaffold Site name:string url:string
Musashi:ec2_spider $ script/generate scaffold Page site_id:integer url:string title:string description:string content:text completed:boolean
Musashi:ec2_spider $ script/generate scaffold Job site_id:integer in_progress:boolean completed:boolean
When your models are created, double-check the migrations to ensure that all of the columns you want in your tables are there. Update migrations 002 and 003 so that "in_progress" and "completed" default to false. Save it, and then run the migrations.
class CreatePages < ActiveRecord::Migration
def self.up
create_table :pages do |t|
t.integer :site_id
t.string :url
t.string :title
t.string :description
t.text :content
t.boolean :completed, :default => false
t.timestamps
end
end
def self.down
drop_table :pages
end
end
class CreateJobs < ActiveRecord::Migration
def self.up
create_table :jobs do |t|
t.integer :site_id
t.boolean :in_progress, :default => false
t.boolean :completed, :default => false
t.timestamps
end
end
def self.down
drop_table :jobs
end
end
Musashi:/ $ rake db:migrate
Update Pages to Use Solr for Searches
In line with the acts_as_solr documentation, update the Page model to take advantage of it. You can add your relationships, while you are at it. First, the Page model.
class Page < ActiveRecord::Base
acts_as_solr :fields => [:title, :description, :content]
belongs_to :site
end
Next, you set up the other side of the relationship in your Site model, and complete your Job model.
class Site < ActiveRecord::Base
has_many :pages
has_many :jobs
end
class Job < ActiveRecord::Base
belongs_to :site
end
Next, create a search action in the Pages controller and a corresponding view (search.html.erb). When you are parsing Solr search results, rather than looping through @pages, we need to look through @pages.results. The new search form will handle that for you. Here is the code for the search function:
def search
@pages = Page.find_by_solr params[:query], :scores => true
respond_to do |format|
format.html # search.html.erb
format.xml { render :xml => @pages }
end
end
Now all you need is a way to search. Create the search form and handle user queries:
CODE FOR app/views/pages/search.html.erb GOES HERE
Excellent. You are about halfway through. Let's make sure that what you have so far works.
Manual Testing
Check to see whether you can perform basic CRUD functionality for a site. In the root of your application, run script/server. Then, go to http://localhost:3000/sites. If everything has gone well, you should see an index page. It's rather "plain Jane," but it's functional. Let's add a site to the database. Click New site, and type a name and URL for a site you want to parse. Be sure to type a complete URL, such as http://www.techcfl.com. So far so good. You'll want to try out both adding a Page and searching for one. This time, create a page using the console to create your record. Open two terminal windows. In the first, start the console by typing script/console at the command line. In the second window, start Solr by going to the root directory of your application and running rake solr:start. When everything is ready to go, run the following in the console:
>> my_site = Site.find(:first)
>> my_page = Page.new(:site => my_site, :url => "http://www.techcfl.com", :title => "ADS", :description => "Home page of ADS", :content => "There is lot's of great content here but no links")
>> my_page.save
>> my_site.reload
>> my_site.pages
>> Page.find_by_solr("ADS")
=> #<ActsAsSolr::SearchResults:0x218b5bc @solr_data={:docs=>[#<Page id: 1, site_id: 1, url: "http://www.techcfl.com", title: "ADS", description: "Home page of ADS", content: "There is lot's of great content here but no links", created_at: "2007-12-17 16:50:01", updated_at: "2007-12-17 16:50:01">], :max_score=>0.26492345, :total=>1}>
>> Page.find_by_solr("content")
=> #<ActsAsSolr::SearchResults:0x217706c @solr_data={:docs=>[#<Page id: 1, site_id: 1, url: "http://www.techcfl.com", title: "ADS", description: "Home page of ADS", content: "There is lot's of great content here but no links", created_at: "2007-12-17 16:50:01", updated_at: "2007-12-17 16:50:01">], :max_score=>0.25088048, :total=>1}>
If you watch the Solr terminal window as the logs fly by, you will see that Solr is working hard to return search results wicked fast. Add two pages and see how it handles that:
>> my_site = Site.find(:first)
>> my_page = Page.new(:site => my_site, :url => "http://www.aws.amazon.com", :title => "Amazon.com: Homepage: Amazon Web Services", :description => "Amazon Web Services provides open APIs for developers that are robust, reliable, scalable, inexpensive, and easy-to-integrate. Developer Forum and Developer Tools.", :content => "Amazon Web Services has announced its newest service, Amazon SimpleDB, which will be made available as a limited beta sometime in the next few weeks. Amazon SimpleDB is a web service for running queries on structured data in real time. Plus, there is lot's of great content here and some links")
>> my_page.save
>> my_site.reload
>> Page.find_by_solr("content")
=> #<ActsAsSolr::SearchResults:0x20a2b28 @solr_data={:docs=>[#<Page id: 1, site_id: 1, url: "http://www.techcfl.com", title: "ADS", description: "Home page of ADS", content: "There is lot's of great content here but no links", created_at: "2007-12-17 16:50:01", updated_at: "2007-12-17 16:50:01">, #<Page id: 2, site_id: 1, url: "http://www.aws.amazon.com", title: "Amazon.com: Homepage: Amazon Web Services", description: "Amazon Web Services provides open APIs for develope...", content: "Amazon Web Services has announced its newest service...", created_at: "2007-12-17 17:14:18", updated_at: "2007-12-17 17:14:18">], :max_score=>0.48608708, :total=>2}>
Great stuff. You can add a site, and after a site is in the database, along with its content, you can search for it. Now, kill off Solr using rake solr:stop and go back to your code to add some automation.
Updating Models for Job Creation
As mentioned earlier, when a Site is created you want a Job to be created, and when a Job is created, you want a message to be added to the Amazon SQS queue. You will keep that code in your models. Let's take a look:
class Site < ActiveRecord::Base
has_many :pages
has_many :jobs
def after_save
# Once a site is saved, create a Job
j = Job.new(:site => self)
j.save
end
end
You use the after_save method in Rails so that when a Site is saved a Job is automatically created. You will do the same thing with your Job model, so that a message, containing the Job ID, is added to your Amazon SQS queue.
class Job < ActiveRecord::Base
belongs_to :site
def after_save
# Once a Job is created, add it to the queue for processing
add_job_to_queue()
end
def add_job_to_queue
# Get the SQS queue
q = SQS.get_queue "ec2_spider_queue"
# Send a message made up of the job id
q.send_message "#{self.id}"
end
end
Adding the Spidering Code
The heart of your application is the spidering code, all of which you will place in your Job model. Let's take a quick look at the highly commented code and then go through it step by step.
def self.process_queue
# Check the queue for jobs
logger.info "*** Checking the queue ***"
queue = SQS.get_queue "ec2_spider_queue"
# If there is a job in the queue, grab it
logger.info "*** Grabbing the job from the queue ***"
message = queue.receive_message
# Find the job in the database and put it to in progress, using the message body
logger.info "*** Creating a Job object and updating the progress ***"
job = Job.find(message.body)
job = Job.find(2)
job.in_progress = true
job.save
# Create a page in the database using the top level URL we got from the site
logger.info "*** Creating a new Page from the site ***"
page = Page.new(:site => job.site, :url => job.site.url)
# Parse the page and store the contents in the Page object
doc = Hpricot(open(page.url))
page.title = doc.at("title").inner_html
page.description = doc.at("/html/head/meta[@name='description']")[:content]
page.content = doc.to_html
page.completed = true
# Save the page in the database
page.save
# Grab a list of links from the current Page and create a Page for each
logger.info "*** Creating pages from links ***"
doc.search("/html/body//a").each do |a|
url = a[:href]
# Unless the url is a mailto, is "/", or
unless url.include?("mailto") || (url == "/") || url.include?("feedburner")
# Check to see if we have a complete URL
unless url.include?(job.site.url) || url.include?("http://")
url = job.site.url + url
end
# Create a new Page
p = Page.new(:site => job.site, :url => url)
# Save the Page
p.save
end
end
# Grab a list of all pages for the current site that have yet to be parsed
logger.info "*** Parsing all pages created for this site from the first page ***"
pages = Page.find(:all, :conditions => ["site_id = ? AND completed = ?", job.site.id, false])
for p in pages
doc = Hpricot(open(p.url))
p.title = doc.at("title").inner_html unless doc.at("title").nil?
p.description = doc.at("/html/head/meta[@name='description']")[:content] unless doc.at("/html/head/meta[@name='description']")[:content].nil?
p.content = doc.to_html
p.completed = true
p.save
end
# Once we are finished, mark the job as complete
job.in_progress = false
job.completed = true
job.save
# Delete the job from the queue
message.delete
end
Taking It One Step at a Time
Let's go step-by-step through the code. First, you create a queue object so you can work with your Amazon SQS queue.
queue = SQS.get_queue "ec2_spider_queue"
Next, you create a message object and retrieve a message from your queue.
message = queue.receive_message
Now, you find the Job that corresponds to the body of the message, set it to in
progress, and save it back to the database.
job = Job.find(message.body)
job = Job.find(2)
job.in_progress = true
job.save
After that, you create a Page, set up the relationship by linking it to the Site, and set its URL to the same URL as the Site.
page = Page.new(:site => job.site, :url => job.site.url)
This is where it gets fun. Using hpricot, you create a document by opening the URL for the Page. When you have that, you parse through the returned HTML, grabbing the title, description, and finally all of the content, putting each into the attributes of the Page. When
you have all of that, the Page is considered complete. You set completed to true and save it. The Page is now searchable via Solr.
doc = Hpricot(open(page.url))
page.title = doc.at("title").inner_html
page.description = doc.at("/html/head/meta[@name='description']")[:content]
page.content = doc.to_html
page.completed = true
page.save
Now that you have your first page, parse through it, grabbing all of the links on the page. Check out the comments in the code for further explanation.
# Parse through the document and retrieve all of the links
doc.search("/html/body//a").each do |a|
# Create a url object to work with
url = a[:href]
# Unless the url is a mailto, is "/", or includes "feedburner", keep going
unless url.include?("mailto") || (url == "/") || url.include?("feedburner")
# Check to see if we have a complete URL.
# If the link includes "http://" it may be an external link, or a complete URL
# If the link includes the full Site URL we are good to go
# If the link is merely something such as "/pages/2" then we need to add on the complete Site URL in order to parse it later
unless url.include?(job.site.url) || url.include?("http://")
url = job.site.url + url
end
# Create a new Page from the URL
p = Page.new(:site => job.site, :url => url)
# Save the Page in the database
p.save
end
end
At this point, you have a Page record for each link on the page at the URL you were parsing.
pages = Page.find(:all, :conditions => ["site_id = ? AND completed = ?", job.site.id, false])
for p in pages
doc = Hpricot(open(p.url))
p.title = doc.at("title").inner_html unless doc.at("title").nil?
p.description = doc.at("/html/head/meta[@name='description']")[:content] unless doc.at("/html/head/meta[@name='description']")[:content].nil?
p.content = doc.to_html
p.completed = true
p.save
end
The job is now considered "done," so we mark it as such.
job.in_progress = false
job.completed = true
job.save
And finally, you delete the message from the queue.
message.delete
That's it!
Making it Run
You might have noticed that the process_queue method has a self. prefix. This makes it a class method and able to be called without having an instance of a Job to work with. If you want to see it work, run the following in the root of your application:
script/runner Job.process_queue
Let's Make Sure it Works
You now have a number of Pages in your database ready to be searched using Solr. Let's test it.
- Launch both Solr (
rake solr:start) and the application (script/server).
- Go to http://localhost:3000/sites.
- Add a site and save it.
- Run the script
script/runner Job.process_queue.
- When the script is complete, type http://localhost:3000/pages in your address bar and search for something.
- Sit back and enjoy.
Where To Go from Here
That's it. After determining the methodology, thanks to freely available Ruby gems and the ease of interacting with AWS, it is a somewhat simple matter to code the application. Now that you have a foundation, you can take the application and run with it. For this article, we kept the code simple and ran everything on one server. In production we would do the following:
- Back up both the MySQL and the Solr databases to Amazon S3.
- Have a three-server setup: one front end, one database server, and one Solr server.
- Add additional code to keep the job going and track when we have parsed all pages for a given site.
- Create a job to parse all links on a Site that link to external pages so we can spider those sites, too.
Learning More About AWS
This article highlights a few aspects of working with AWS. Here are a few more resources available to Ruby and Rails developers to help you learn more.
Ruby and Rails
Apache Solr
Common Resources on AWS
- AWS web site - Learn more about each web service on the AWS web site
- Developer Connection web site - The community web site for AWS developers includes forums on AWS, a Solutions Catalog for examples of what your peers have built, and more
- Resource Center - Part of the Developer Connection web site, the Resource Center has links to tutorials, code samples, technical documentation, and other resources for building your application on AWS
About the Author
After eight years as an MCSE and project manager, Robert Dempsey jumped from IT management and PHP/Visual Basic.NET development to Ruby on Rails. He is the project director of Atlantic Dominion Solutions, a Ruby on Rails development firm, and has recently launched Rails For All, a not-for-profit organization dedicated to promoting the use of Ruby on Rails. In addition, Robert presents on a regular basis at the Orlando Ruby Users Group, and has begun giving talks to Java user groups on topics including JRuby and Ruby on Rails.
|