 |
By David Kavanagh
Using an architecture similar to the Monster Muck article, this article focuses on building automated pool management and services in a Java world. The software example would be largely based on the typica code sample and an open source project called “lifeguard.” The article also highlight's Amazon SQS's new means to get an approximate number of messages in the queue.
Background
Amazon has given us some compelling web services that can be used to build scalable systems to process large amounts of data. One such type of system uses the big three, Amazon S3, Amazon EC2 and Amazon SQS. By storing data in Amazon S3, sending work requests to Amazon SQS and processing those requests with servers in Amazon EC2 we can handle large volumes of data in a way that doesn't require a huge capital investment. To be more specific, image one type of server that can perform OCR on tiff images and produce text output. Another type of server analyzes the text output and extracts specific data for a customer. This means we have two different AMIs that represent those two types of servers. Each of those servers looks at a specific work queue for units of work to be processed. So, we have one Amazon SQS work queue feeding some number of servers that are ready to process the work from that queue. This is replicated for each type of service.
In this scenario, it would be very helpful to have something monitoring the running servers and work queues, then starting or terminating servers to meet demand. By architecting our application in this way and automating management of the servers, we could scale from zero servers when there is no work to many, many servers when the work load gets heavy. It should be easy to imagine how costs could be kept down by only running servers as needed, instead of full time (as in a traditional data center). In this article, we'll discuss such a system, an open source project called "lifeguard". It is a second generation pool manager that utilizes the new Amazon SQS feature to detect the approximate number of messages in the queue. See the links at the bottom of the article for downloads.
Project Scope
The first phase of this project will focus on core pool management and workflow features. This article will discuss the implementation of this first phase and all that goes into it. Here's an enumeration of the requirements.
- Multiple server pools can be managed at once
- Service instances will report busy/idle status to allow for better assessment of server utilization
- Pool idle capacity and approximate number of messages in the work queue will be used to adjust pool size
- Messages passed will be in XML format
- The pool manager will be deployable either as a standalone application or in a .war file
Nuts and Bolts
To describe the inner workings, it is sometimes easier when there is a concrete example to reference. The figure below illustrates an example where the workflow consists of OCR and data extraction. As the figure indicates, an Ingestor is used to take data (from an un-determined source) and stores it in Amazon S3, then sends a work request to the first service. The Ingestor sends a work status message to the Status Logger (like each service). Those status messages are persisted and can be used for tracking progress and running reports. Each service (OCR and Extractor) get a work request from a work queue (Amazon SQS) then pull their input data from Amazon S3. The service performs some work and writes the results back to Amazon S3, then reports status and sends a work request to the next service.
What aren't shown in the picture are the pool managers. In the case of having an OCR and an Extractor service, you'd have two pool managers. One responsible for monitoring the OCR work queue and running instances and the other monitoring the Extractor work queue and running instances.
There are a few areas I'd like to cover in some detail. The message passing interface is perhaps one of the most central features since it connects service instances, the pool manager, the logger and the client (during the ingestion process).
Message Types and Formats
Service instances need to report their state to the pool manager. This is pretty simple. Either the instance is busy or idle. We'll need the instance ID so we can keep track of which servers are idle (in case we need to shrink the pool). We might as well throw in the timestamp as well.
<InstanceStatus>
<InstanceId>i-1234567</InstanceId>
<State>busy</State>
<LastInterval>PT35S</LastInterval>
<Timestamp>2007-06-19T12:25:04</Timestamp>
</InstanceStatus>
To get work started, a special process is run called the ingestor. The ingestor needs to put data files into Amazon S3 and send a message that describes the work to be performed. This first message is sent to the first work queue. For example, let's say we want to OCR a scanned document (tiff file) and run it through a data extraction process to produce the required data fields. We'd have a simple workflow that consists of "OCR" and "Extraction" services. The ingestor would put the tiff file into a bucket in Amazon S3 and send a message to the OCR service that says where that file is (and some other stuff we like to call "overhead"). Here's an example work request message;
<WorkRequest>
<Project>DirectThought - OCR Test</Project>
<Batch>10001</Batch>
<ServiceName>ingest</ServiceName>
<InputBucket>testbucket</InputBucket>
<OutputBucket>testbucket</OutputBucket>
<Input>
<Key>ebed04f3d51ca6d260bf083a38a31cdf</Key>
<Type>image/tiff</Type>
<Location>S3</Location>
</Input>
</WorkRequest>
Once the OCR service has finished, it can report status. In most cases, it will report success as you see below. The message will be a record of what happened, from the input and output files, to the start and end times and instance ID of the server that ran the service. All of this allows us to track the data through the system.
<WorkStatus>
<Project>DirectThought - OCR Test</Project>
<Batch>10001</Batch>
<ServiceName>OCR</ServiceName>
<InputBucket>testbucket</InputBucket>
<OutputBucket>testbucket</OutputBucket>
<Input>
<Key>ebed04f3d51ca6d260bf083a38a31cdf</Key>
<Type>image/tiff</Type>
<Location>S3</Location>
</Input>
<Output>
<Key>ebed04f3d51ca6d260bf083a38a31cdf</Key>
<Type>text/plain</Type>
<Location>S3</Location>
</Output>
<StartTime>2007-06-19T12:30:30</StartTime>
<EndTime>2007-06-19T12:33:48</EndTime>
<InstanceId>i-1234567</InstanceId>
</WorkRequest>
Base Service
Each service needs to be able to monitor a work queue, parse the request, pull data from Amazon S3, process the data, put the result back into Amazon S3, report status and construct and send a new work request to the next service in the workflow. On top of that, the service must report busy/idle status to the pool manager. Of all of those things, only the processing of the data is unique to each service. A base class with the following interface has been created to ease the burden for service Implementors.
public abstract class AbstractBaseService {
public abstract List<MetaFile> executeService(File inputFile, WorkRequest request);
public class MetaFile {
public File file;
public String mimeType;
}
}
The basic information a service requires are the input file and the work request which may contain additional parameters. Each service produces output (one or more files). Therefore, it should return a list of objects that contain a file and mime-type.
Management Strategy
The central feature of this project is the ability to size the server pool appropriately. This can be a little tricky to get right. There are several pieces of information that can help, but putting those together in a meaningful way takes a bit of work. Let's cover the types of information we have.Queue Size
The last API update made to Amazon SQS added a feature a lot of users have been asking for since the early Amazon SQS days. Amazon gave us the ability to ask for the approximate number of message in the queue. Given the distributed nature of the service, yadda, yadda, this was supposed to be a little bit hard to estimate. It turns out that even an estimate is very helpful! Knowing that there are messages in the queue can be very helpful when trying to manage a pool that wants 0 instances at idle. (more on this later)
Idle/Busy Count
The pool manager tracks the busy and idle status of the instances. Each instance starts in an idle list, and as it reports busy status, moves to a busy list. The migrate back and forth as status is reported. In theory, looking at an empty idle list means that the pool is fully busy and we ought to think about ramping it up. In practice, Amazon SQS doesn't deliver messages in order, especially when there are a lot of messages in the queue in question. Furthermore, some messages might linger in the queue for longer than others before being delivered. So, the busy/idle state of any given server could be old information.
Idle/Busy Interval
Assuming we have a good indication of the pool being busy, or idle, it is trivial to track how long the pool has been in that state. Knowing how long the pool has been in a given state allows us to configure how fast we grow or shrink the pool. Not only can we grow/shrink the pool by some number of servers, but we can do it only after a configured interval has passed. The grow interval should (in theory) be configured based on some knowledge of the length of time the service takes to run. If you had a long running service, it would be a good idea to wait longer before deciding it is too busy and needs some more help (by adding more servers). I think in this case, you'd want to wait a bit longer, but add more servers at a time. For faster running services, I'd likely keep the ramp-up interval smaller and add fewer servers each time (simply because we'll be able to adjust the size more often).
Server and Pool Load
The service instances report a value that can help us determine their "busy duty cycle". Duty cycle is simply the relationship between busy time and the sum of busy and idle time over 1 busy/idle cycle. So, if it takes a service 4 seconds to run, and it is idle 2 seconds before picking up more work, we'd say that instance has a busy duty cycle of (4 / (4 + 2)) * 100 or 67. The pool manager keeps track of the most recent busy interval and idle interval for each server. Since Amazon SQS might not return the messages in order, or on-time, these values are our best guess of service instance load. Because an idle instance doesn't report any status, the pool manager also needs to periodically adjust the idle interval to keep the load value accurate for that instance.
The pool load value is simply an average taken from the load (busy duty cycle) values from all instances. So, we can say the pool is "busy enough" if the pool load value is over a certain threshold. The threshold value is another number I think should be tuned based on service execution time since fast running services will have more "idle" overhead time which contributes to a lower duty cycle for a service instance that might be running full throttle.
How Should We Manage the Pool?
There are several scenarios that we need to consider when formulating a management strategy. To keep this article a from becoming a book, I'll simply summarize the scenarios and what data will be used in each case.
Empty pool, new work arrives
When the pool is empty, no servers will be reporting status. The only indication of needing to ramp up will be messages in the work queue. As soon as one message appears in the work queue, one (or more) instances need to be fired up. The number is based on a configurable parameter (RampUpInterval).
Busy pool, ramping up
As the pool maintains busy status over time (RampUpDelay), the pool must be increased. A new server will be started based on RampUpInterval and the number of message in the queue. A configurable parameter (QueueSizeFactor) will be used to adjust the RampUpInterval to ramp up more quickly if there is a lot of work to be done.
Idle pool, ramping down
If the pool has been idle (below the load threshold) for more than the RampDownDelay, servers will be terminated. The RampDownInterval determines how many servers will be shutdown. If there is a minimum pool size configured, that many servers will be left running at all times.
OK, Let's Fire it Up
At this point, lifeguard can be run as a standalone application. In the near future, it will be deployable in a war file. To configure your copy, edit the aws.properties file in the conf directory. It contains the AWS accessId and secretKey as well as a queue prefix that is meant to help partition several lifeguard apps running in one AWS account. It does this by pre-pending the queue prefix to each Amazon SQS queue name to avoid conflicting queues.
The poolconfig.xml file is used to configure one or more pools to be managed. You'd need to know the service name, AMI ID, pool status queue and work queue. The other parameters can be tweaked to alter pool behavior.
Spring is used for dependency injection. It helps connect bits of the system together. For instance, it sets the values from the aws.properties file on the beans that need them. It also configures beans that monitor pool state. In the future, pool state will be tracked by a web application, either to provide a web UI or a query-able web service.
A script is provided to run lifeguard. It invokes com.directthought.lifeguard.RunManager, passing in the path to the poolconfig.xml file.
Realization of a Dream
Ok, well, maybe not so lofty, but consider that you might have work coming in from various sources during the day. In some cases, there could be a totally automated way that customers submit work to you. Up till now, your choice would have been to recognize that there is work to be done and launch some number of instances of a specific AMI to handle the work. Then, when the work was completed, you'd need to terminate those instances. When this happens with any regularity, you'll need to hire someone just to manage the servers! With the solution here, you can have lifeguard running (either on your own server, or on Amazon EC2) and managing your servers for you. I think it is safe to say that anytime you can automate your work process, you save money and monotony. The graphic below illustrates how a pool manager could ramp up the number of servers as it detects work and terminates servers as the work is completed. Lifeguard lets you tune how fast the pool scales up and down.
Next Steps
After building a reliable core for the pool manager, there is a laundry list of things to add on. Here are a few things;
- Status Logger - to capture messages from the status queue and persist them (likely to a relational database)
- Data Model for Projects and Workflows - tools to manage the data and provide workflow XML to the ingestor
- Pool Monitoring Tools - create a PoolMonitor bean to track and display pool statistics
- Other Ingestors - web based, client, sftp ingestor tools to extend the core ingestion code
- Reporting - the ability to monitor work in progress and generate reports on work completed
- Data Extraction - a way to pull data out of Amazon S3 by project, workflow, batch, etc.
Additional Resources
David Kavanagh is a software consultant for Direct Thought in Upstate New York. He has been designing and developing software for 15 years. For the past year, his focus has been on leveraging Amazon Web Services. He is the author of the open source typica library which provides a Java interface for an expanding set of Amazon Web Services and has been developing AWS-based applications for a variety of customers.
|