Resources



Articles and Tutorials

Using Amazon S3, EC2, SQS, Lucene, and Ruby for Web Spidering

Click for a printer friendly version of this document Printer Friendly Save to del.icio.us
Average Review:

Robert Dempsey guides you through the steps of building a web spidering application using Amazon EC2, Amazon SQS, and Amazon S3.

AWS Products Used: Amazon SQS, Amazon EC2, Amazon S3
Language(s): Ruby
Date Published: 2008-01-25

By Robert Dempsey of Atlantic Dominion Solutions, LLC

Each Amazon Web Services (AWS) product is, in and of itself, very powerful. By combining them, you can produce true enterprise-class web applications. This article guides you through building one of these applications. We will build a web-spidering solution using Ruby, Ruby on Rails, and Lucene. The application will run on Amazon Elastic Compute Cloud Beta (Amazon EC2™), will use Amazon Simple Queue Service (Amazon SQS) for job management, and will store any persistent data (in addition to backups) on Amazon Simple Storage Service (Amazon S3).

Because of the complex nature of the solution presented here, this article will require an intermediate-to-high level of understanding of Ruby and Ruby on Rails, and an intermediate understanding of using AWS. The solution will use Rails 2.0.1.

Background Reading

If you are new to using Ruby on Rails with AWS, please review the following articles to gain a foundation:

Solution Overview

The first step in planning your application is determining the AWS tool you will use for each portion of the application. Here's the breakdown:

  • Amazon EC2 - for production, you would use two Amazon EC2 instances: one for the Rails application and MySQL (or any other supported) database, and another for the Lucene (Solr) server; for development, you will assume that you are using a single server for the entire solution
  • Amazon SQS - for managing your spidering jobs
  • Amazon S3 - for persistent storage and backups

Ruby has a number of libraries that make interaction with AWS a lot simpler. The Ruby gems (also listed in the README of the sample code) you will use for this application are:

  • amazon-ec2 - for using Amazon EC2
  • aws-s3 - for using Amazon S3
  • SQS - for using Amazon SQS
  • hpricot - for parsing HTML pages

The application itself is somewhat simplistic, and consists of three models:

  1. Site
  2. Job
  3. Page

Now that you know what you need for the application, let's take a look at the steps you and the application will take to produce searchable results.

  • First, get set up by installing all necessary Ruby gems, creating your skeleton application, and installing your plug-ins.
  • Next, create an interface to your database using Ruby on Rails that allows you to type a URL of a Site you want to spider. When a Site is saved into the database, it will automatically create a Job. When a Job is created, an Amazon SQS message is created, which contains the ID of the Job. A Site will have many Pages, each with its own attributes, including title, description, and content.
  • When you have your Site and Job models, you'll create a Page model, update it to use the acts_as_solr plug-in, and then add search capability.
  • When that is ready, you will perform some manual QA to test that everything works as it should.
  • After that, you will add code that will handle Jobs using an Amazon SQS queue. A Job is defined as spidering a single site. Each time a site is created, a Job will be created in Amazon SQS.
  • Finally, you will add the spidering code. The spidering code will be held inside the Job model and perform the following actions:
    1. Check the queue for Jobs. If one is found, it continues.
    2. Next, it will go into your database, retrieve the Job, and then grab the associated Site.
    3. After updating the status of the Job, the spider creates a Page using the URL from the Site.
    4. When it is at the Site URL, it retrieves the HTML content, parses it, stores the title, description, and content in the database, and then creates a Page for each link that the Page contains.
    5. When you have completed parsing the initial URL, you will pull all other links for the initial Site Page and parse each, again storing the title, description, and page content in the database.
    6. Upon Job completion, the Amazon SQS message is removed from the queue and you are ready to search it.

Still with me? It sounds rather complex; however, using Ruby and Rails, once you know how to go about making the magic happen, it is simpler than you think. Let's get started!

Making the Magic Happen

Getting Set Up

To get started, install your Ruby gems:

Musashi:/ $ sudo gem install amazon-ec2 aws-s3 hpricot SQS

Note: If you are using RubyGems 0.9.5, by default it will install any dependencies these gems might have.

Next, create your skeleton Rails application:

Musashi:/ $ rails ec2_spider

Now, create the "ec_2_spider_development" database in MySQL and update config/database.yml as necessary. Now, get ready to install a great plug-in: acts_as_solr is a Rails plug-in that adds full-text search, along with other capabilities, from Apache's Solr, an open-source enterprise search server based on the Lucene Java search library. Solr features XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, and a web administration interface. It is also wicked fast and easy to use with the acts_as_solr plug-in. The current version of acts_as_solr requires Java Runtime Environment (JRE) 1.5 or later. If you are using a Mac with Leopard installed, chances are you already have it. If not, you can download it from Sun. Install the acts_as_solr plug-in into your Rails application using the following command:

Musashi:ec2_spider $ script/plugin install svn://svn.railsfreaks.com/projects/acts_as_solr/trunk

Note: The acts_as_solr plug-in comes with a copy of Solr ready to go. When you have production ready and are using a separate Amazon EC2 instance for Solr, update config/solr.yml with the URL of the production server.

Update Environment.rb

To take advantage of all the Ruby goodness that you've just installed, and also to connect to AWS from your application, add the following to the bottom of environment.rb:

require 'rubygems'
require 'sqs'
require 'hpricot'
require 'open-uri'

SQS.access_key_id = 'YOUR_ACCESS_KEY_ID'
SQS.secret_access_key = 'YOUR_SECRET_ACCESS_KEY'

Create Scaffold and Run Migrations

Now you are set up. Next, you are going to create scaffold for your main objects: Site, Page, and Job. Continuing with the idea of keeping it simple, use the built-in scaffold generator that Rails offers. As of Rails 2.0, all scaffold generated is inherently RESTful. In addition, your migrations will automatically add the code for the timestamp columns. Three commands get the job done:

Musashi:ec2_spider $ script/generate scaffold Site name:string url:string
Musashi:ec2_spider $ script/generate scaffold Page site_id:integer url:string title:string description:string content:text completed:boolean
Musashi:ec2_spider $ script/generate scaffold Job site_id:integer in_progress:boolean completed:boolean

When your models are created, double-check the migrations to ensure that all of the columns you want in your tables are there. Update migrations 002 and 003 so that "in_progress" and "completed" default to false. Save it, and then run the migrations.

class CreatePages < ActiveRecord::Migration
  def self.up
    create_table :pages do |t|
      t.integer :site_id
      t.string :url
      t.string :title
      t.string :description
      t.text :content
      t.boolean :completed, :default => false
      t.timestamps
    end
  end

  def self.down
    drop_table :pages
  end
end
	
class CreateJobs < ActiveRecord::Migration
	def self.up
	  create_table :jobs do |t|
	    t.integer :site_id
	    t.boolean :in_progress, :default => false
	    t.boolean :completed, :default => false
	    t.timestamps
	  end
	end

	def self.down
	  drop_table :jobs
	end
end
Musashi:/ $ rake db:migrate

Update Pages to Use Solr for Searches

In line with the acts_as_solr documentation, update the Page model to take advantage of it. You can add your relationships, while you are at it. First, the Page model.

class Page < ActiveRecord::Base
  acts_as_solr :fields => [:title, :description, :content]
  belongs_to :site
end

Next, you set up the other side of the relationship in your Site model, and complete your Job model.

class Site < ActiveRecord::Base
  has_many :pages
  has_many :jobs
end

class Job < ActiveRecord::Base
  belongs_to :site
end

Next, create a search action in the Pages controller and a corresponding view (search.html.erb). When you are parsing Solr search results, rather than looping through @pages, we need to look through @pages.results. The new search form will handle that for you. Here is the code for the search function:

def search
  @pages = Page.find_by_solr params[:query], :scores => true
    
  respond_to do |format|
    format.html # search.html.erb
    format.xml  { render :xml => @pages }
  end   
end

Now all you need is a way to search. Create the search form and handle user queries:

CODE FOR app/views/pages/search.html.erb GOES HERE

Excellent. You are about halfway through. Let's make sure that what you have so far works.

Manual Testing

Check to see whether you can perform basic CRUD functionality for a site. In the root of your application, run script/server. Then, go to http://localhost:3000/sites. If everything has gone well, you should see an index page. It's rather "plain Jane," but it's functional. Let's add a site to the database. Click New site, and type a name and URL for a site you want to parse. Be sure to type a complete URL, such as http://www.techcfl.com. So far so good. You'll want to try out both adding a Page and searching for one. This time, create a page using the console to create your record. Open two terminal windows. In the first, start the console by typing script/console at the command line. In the second window, start Solr by going to the root directory of your application and running rake solr:start. When everything is ready to go, run the following in the console:

>> my_site = Site.find(:first)
>> my_page = Page.new(:site => my_site, :url => "http://www.techcfl.com", :title => "ADS", :description => "Home page of ADS", :content => "There is lot's of great content here but no links")
>> my_page.save
>> my_site.reload
>> my_site.pages
>> Page.find_by_solr("ADS")
=> #<ActsAsSolr::SearchResults:0x218b5bc @solr_data={:docs=>[#<Page id: 1, site_id: 1, url: "http://www.techcfl.com", title: "ADS", description: "Home page of ADS", content: "There is lot's of great content here but no links", created_at: "2007-12-17 16:50:01", updated_at: "2007-12-17 16:50:01">], :max_score=>0.26492345, :total=>1}>
>> Page.find_by_solr("content")
=> #<ActsAsSolr::SearchResults:0x217706c @solr_data={:docs=>[#<Page id: 1, site_id: 1, url: "http://www.techcfl.com", title: "ADS", description: "Home page of ADS", content: "There is lot's of great content here but no links", created_at: "2007-12-17 16:50:01", updated_at: "2007-12-17 16:50:01">], :max_score=>0.25088048, :total=>1}>

If you watch the Solr terminal window as the logs fly by, you will see that Solr is working hard to return search results wicked fast. Add two pages and see how it handles that:

>> my_site = Site.find(:first)
>> my_page = Page.new(:site => my_site, :url => "http://www.aws.amazon.com", :title => "Amazon.com: Homepage: Amazon Web Services", :description => "Amazon Web Services provides open APIs for developers that are robust, reliable, scalable, inexpensive, and easy-to-integrate. Developer Forum and Developer Tools.", :content => "Amazon Web Services has announced its newest service, Amazon SimpleDB, which will be made available as a limited beta sometime in the next few weeks. Amazon SimpleDB is a web service for running queries on structured data in real time. Plus, there is lot's of great content here and some links")
>> my_page.save
>> my_site.reload
>> Page.find_by_solr("content")
=> #<ActsAsSolr::SearchResults:0x20a2b28 @solr_data={:docs=>[#<Page id: 1, site_id: 1, url: "http://www.techcfl.com", title: "ADS", description: "Home page of ADS", content: "There is lot's of great content here but no links", created_at: "2007-12-17 16:50:01", updated_at: "2007-12-17 16:50:01">, #<Page id: 2, site_id: 1, url: "http://www.aws.amazon.com", title: "Amazon.com: Homepage: Amazon Web Services", description: "Amazon Web Services provides open APIs for develope...", content: "Amazon Web Services has announced its newest service...", created_at: "2007-12-17 17:14:18", updated_at: "2007-12-17 17:14:18">], :max_score=>0.48608708, :total=>2}>

Great stuff. You can add a site, and after a site is in the database, along with its content, you can search for it. Now, kill off Solr using rake solr:stop and go back to your code to add some automation.

Updating Models for Job Creation

As mentioned earlier, when a Site is created you want a Job to be created, and when a Job is created, you want a message to be added to the Amazon SQS queue. You will keep that code in your models. Let's take a look:

class Site < ActiveRecord::Base
  has_many :pages
  has_many :jobs

  def after_save
    # Once a site is saved, create a Job
    j = Job.new(:site => self)
    j.save
  end

end

You use the after_save method in Rails so that when a Site is saved a Job is automatically created. You will do the same thing with your Job model, so that a message, containing the Job ID, is added to your Amazon SQS queue.

class Job < ActiveRecord::Base
  belongs_to :site
  
  def after_save
    # Once a Job is created, add it to the queue for processing
    add_job_to_queue()
  end
  
  def add_job_to_queue
    # Get the SQS queue
    q = SQS.get_queue "ec2_spider_queue"
    # Send a message made up of the job id
    q.send_message "#{self.id}"
  end
end

Adding the Spidering Code

The heart of your application is the spidering code, all of which you will place in your Job model. Let's take a quick look at the highly commented code and then go through it step by step.

def self.process_queue
  # Check the queue for jobs
  logger.info "*** Checking the queue ***"
  queue = SQS.get_queue "ec2_spider_queue"
    
  # If there is a job in the queue, grab it
  logger.info "*** Grabbing the job from the queue ***"
  message = queue.receive_message
    
  # Find the job in the database and put it to in progress, using the message body
  logger.info "***  Creating a Job object and updating the progress ***"
  job = Job.find(message.body)
  job = Job.find(2)
  job.in_progress = true
  job.save
    
  # Create a page in the database using the top level URL we got from the site
  logger.info "*** Creating a new Page from the site ***"
  page = Page.new(:site => job.site, :url => job.site.url)
  # Parse the page and store the contents in the Page object
  doc = Hpricot(open(page.url))
  page.title = doc.at("title").inner_html
  page.description = doc.at("/html/head/meta[@name='description']")[:content]
  page.content = doc.to_html
  page.completed = true
  # Save the page in the database
  page.save
    
  # Grab a list of links from the current Page and create a Page for each
  logger.info "*** Creating pages from links ***"
  doc.search("/html/body//a").each do |a|
    url = a[:href]
    # Unless the url is a mailto, is "/", or 
    unless url.include?("mailto") || (url == "/") || url.include?("feedburner")
      # Check to see if we have a complete URL
      unless url.include?(job.site.url) || url.include?("http://")
        url = job.site.url + url
      end
      # Create a new Page
      p = Page.new(:site => job.site, :url => url)
      # Save the Page
      p.save
    end
  end
    
  # Grab a list of all pages for the current site that have yet to be parsed
  logger.info "*** Parsing all pages created for this site from the first page ***"
  pages = Page.find(:all, :conditions => ["site_id = ? AND completed = ?", job.site.id, false])
  for p in pages
    doc = Hpricot(open(p.url))
    p.title = doc.at("title").inner_html unless doc.at("title").nil?
    p.description = doc.at("/html/head/meta[@name='description']")[:content] unless doc.at("/html/head/meta[@name='description']")[:content].nil?
    p.content = doc.to_html
    p.completed = true
    p.save
  end
    
  # Once we are finished, mark the job as complete
  job.in_progress = false
  job.completed = true
  job.save
   
  # Delete the job from the queue
  message.delete
end

Taking It One Step at a Time

Let's go step-by-step through the code. First, you create a queue object so you can work with your Amazon SQS queue.

queue = SQS.get_queue "ec2_spider_queue"

Next, you create a message object and retrieve a message from your queue.

message = queue.receive_message

Now, you find the Job that corresponds to the body of the message, set it to in progress, and save it back to the database.

job = Job.find(message.body)
job = Job.find(2)
job.in_progress = true
job.save

After that, you create a Page, set up the relationship by linking it to the Site, and set its URL to the same URL as the Site.

page = Page.new(:site => job.site, :url => job.site.url)	

This is where it gets fun. Using hpricot, you create a document by opening the URL for the Page. When you have that, you parse through the returned HTML, grabbing the title, description, and finally all of the content, putting each into the attributes of the Page. When you have all of that, the Page is considered complete. You set completed to true and save it. The Page is now searchable via Solr.

doc = Hpricot(open(page.url))
page.title = doc.at("title").inner_html
page.description = doc.at("/html/head/meta[@name='description']")[:content]
page.content = doc.to_html
page.completed = true
page.save

Now that you have your first page, parse through it, grabbing all of the links on the page. Check out the comments in the code for further explanation.

# Parse through the document and retrieve all of the links
doc.search("/html/body//a").each do |a|
	# Create a url object to work with
  url = a[:href]
  # Unless the url is a mailto, is "/", or includes "feedburner", keep going
  unless url.include?("mailto") || (url == "/") || url.include?("feedburner")
    # Check to see if we have a complete URL.
    # If the link includes "http://" it may be an external link, or a complete URL
		# If the link includes the full Site URL we are good to go
		# If the link is merely something such as "/pages/2" then we need to add on the complete Site URL in order to parse it later
    unless url.include?(job.site.url) || url.include?("http://")
      url = job.site.url + url
    end
    # Create a new Page from the URL
    p = Page.new(:site => job.site, :url => url)
    # Save the Page in the database
    p.save
  end
end

At this point, you have a Page record for each link on the page at the URL you were parsing.

pages = Page.find(:all, :conditions => ["site_id = ? AND completed = ?", job.site.id, false])
for p in pages
  doc = Hpricot(open(p.url))
  p.title = doc.at("title").inner_html unless doc.at("title").nil?
  p.description = doc.at("/html/head/meta[@name='description']")[:content] unless doc.at("/html/head/meta[@name='description']")[:content].nil?
  p.content = doc.to_html
  p.completed = true
  p.save
end

The job is now considered "done," so we mark it as such.

job.in_progress = false
job.completed = true
job.save	

And finally, you delete the message from the queue.

message.delete	

That's it!

Making it Run

You might have noticed that the process_queue method has a self. prefix. This makes it a class method and able to be called without having an instance of a Job to work with. If you want to see it work, run the following in the root of your application:

script/runner Job.process_queue

Let's Make Sure it Works

You now have a number of Pages in your database ready to be searched using Solr. Let's test it.

  • Launch both Solr (rake solr:start) and the application (script/server).
  • Go to http://localhost:3000/sites.
  • Add a site and save it.
  • Run the script script/runner Job.process_queue.
  • When the script is complete, type http://localhost:3000/pages in your address bar and search for something.
  • Sit back and enjoy.

Where To Go from Here

That's it. After determining the methodology, thanks to freely available Ruby gems and the ease of interacting with AWS, it is a somewhat simple matter to code the application. Now that you have a foundation, you can take the application and run with it. For this article, we kept the code simple and ran everything on one server. In production we would do the following:

  • Back up both the MySQL and the Solr databases to Amazon S3.
  • Have a three-server setup: one front end, one database server, and one Solr server.
  • Add additional code to keep the job going and track when we have parsed all pages for a given site.
  • Create a job to parse all links on a Site that link to external pages so we can spider those sites, too.

Learning More About AWS

This article highlights a few aspects of working with AWS. Here are a few more resources available to Ruby and Rails developers to help you learn more.

Ruby and Rails

Apache Solr

Common Resources on AWS

  • AWS web site - Learn more about each web service on the AWS web site
  • Developer Connection web site - The community web site for AWS developers includes forums on AWS, a Solutions Catalog for examples of what your peers have built, and more
  • Resource Center - Part of the Developer Connection web site, the Resource Center has links to tutorials, code samples, technical documentation, and other resources for building your application on AWS

About the Author

After eight years as an MCSE and project manager, Robert Dempsey jumped from IT management and PHP/Visual Basic.NET development to Ruby on Rails. He is the project director of Atlantic Dominion Solutions, a Ruby on Rails development firm, and has recently launched Rails For All, a not-for-profit organization dedicated to promoting the use of Ruby on Rails. In addition, Robert presents on a regular basis at the Orlando Ruby Users Group, and has begun giving talks to Java user groups on topics including JRuby and Ruby on Rails.



Related Documents
Type: Sample Code Code Sample for "Using S3, EC2, SQS, Lucene and Ruby for Web Spidering"

Discussion

The 5 most recent discussion messages. View full discussion.

"dbreaker"
Posts: 7
Registered: 2/22/08
Using Amazon S3, EC2, SQS, Lucene, and Ruby for Web Spidering - solr image?
Posted: Feb 22, 2008 1:29 PM PST
 
  Click to reply to this thread Reply

Anyone know if there is there an existing recommended ec2 solr image t    o use for this example?


Robert Dempsey
RealName(TM)
Posts: 1
Registered: 7/8/07
Re: Using Amazon S3, EC2, SQS, Lucene, and Ruby for Web Spidering - solr image?
Posted: Jun 21, 2008 7:49 AM PDT   in response to: "dbreaker"
 
  Click to reply to this thread Reply

Not that I have found, yet.



Reviews
Create Review Write a Review

how to create a queue, Apr 6, 2008 8:54 PM
Reviewer: zawoad
In the example you get a queue by SQS.get_queue method. I think i have to create a queue first. How can i create a queue? what is the method? thanks

where is the cloud?, Nov 26, 2008 12:35 PM
Reviewer: bookschmarty
a nice example for rails with solr, but s3 and ec2 are not really (usefully) applied here. would make sense with few additional steps though: if it were shown how to make images of this application and run it in the cloud! (that is what i expected from the title)
Welcome, Guest Help
Login Login