|
by Mitch Garnaat
SimpleDB? Tell Me More...
Amazon recently announced the availability of a new web service,
SimpleDB.
Like all of the Amazon Web Services, SimpleDB is designed to be highly available,
and highly scalable. Also, like other AWS offerings it has a pay-as-you-go
pricing model that is very attractive. The SimpleDB service, as the name implies,
offers database capabilities. When most developers hear the word database
they think relational database but there are other types of databases
in the world and SimpleDB is definitely one of those other types.
The first
difference you will notice in SimpleDB is that it is schemaless. In
relational databases you have to define your data model up front and changing
that model later can be painful. Not so, in SimpleDB. In SimpleDB you have
items (think rows) which can have up to 256 attributes (think columns)
where each attribute can have up to 256 values, all of which must be strings.
You can add and remove items and attributes easily and there's not a schema in sight.
The second main difference you will notice is that you won't need that handy
little pocket reference to SQL with SimpleDB. SimpleDB provides it's own simple
query language. Rather than the SQL reference book, you might want to brush up
on your set theory because SimpleDB and it's query language are very set oriented;
think unions and intersections and you'll be in the right ballpark.
Let's Get Active
Active Record
is a design pattern described by
Martin Fowler in his book,
Patterns of Enterprise
Application Architecture.
Mr. Fowler defines the pattern as:
An object carries both data and behavior. Much of this
data is persistent and needs to be stored in a
database. Active Record uses the most obvious approach,
putting data access logic in the domain object. This way
all people know how to read and write their data to and
from the database.
Wikipedia defines Active Record as:
...an approach to accessing data in a database. A
database table or view is wrapped into a class, thus an
object instance is tied to a single row in the
table. After creation of an object, a new row is added to
the table upon save. Any object loaded gets its
information from the database; when an object is updated,
the corresponding row in the table is also updated. The
wrapper class implements accessor methods or properties
for each column in the table or view.
Probably the best known implementation of the Active Record design pattern is the
ActiveRecord library used in Ruby on Rails
however examples of the pattern can be seen in virtually all of the modern web
frameworks such as Django,
TurboGears and others.
The goal of this article is to describe a simple persistence package that implements the
Active Record pattern on top of SimpleDB rather than a relational database such as MySQL.
This package is Python-based and builds on the basic SimpleDB interface provided by
the boto library. It
allows you to wrap SimpleDB persistence into Python classes and to hide
most of the details of how that persistence works. This software is
still very much a work in progress and will undoubtedly continue to change and evolve
based on user feedback but the code described here is currently available in the boto library
as the module boto.sdb.persist.
Properties
The first thing we are going to look at in our persistence
layer is Properties. If we were mapping to a relational
database, the Properties would be used to define the
columns within a table in our relational model. The
analogy to columns in SimpleDb is an attribute so in
the boto.sdb.persist
module we use Properties to define what SimpleDB
attributes will be associated with our Python classes.
Because all attribute values in SimpleDB are strings, we
could just define a single Property type called
StringProperty and be done. But in the real world, we
encounter data types other than strings so part of the
responsibility of the Property classes is to map common
data types (e.g. integers, booleans, etc.) to the
underlying string data type provided by SimpleDB. In
mapping these data types to strings, we want to keep in
mind that all comparisons done in queries in SimpleDB are
done lexicographically. So, if we store two integer
values, say 42 and 9 in SimpleDB and we want to be able to
query later about which value is greater or less than the
other, we need to do some extra work when we convert those
values to strings in SimpleDB. The Property classes take
care of all of that conversion for you and allow your
Python classes to deal with the attributes as native data
types.
Currently, the persistence layer provides the following types of
properties or fields that can be included within a Python class
definition:
StringProperty
Accepts any string as a value. The field is limited to
1024 characters by SimpleDB but can be further constrained using
the maxlength
parameter to the constructor. For example, this:
short = StringProperty(maxlength=10)
Would limit the size of the property
short to 10 characters.
SmallPositiveIntegerProperty
Accepts a positive integer value in the range of 0-65535. This value is
converted to a string before storing in SimpleDB. To convert, a zero-padded
string is constructed. For SmallPositiveInteger, the size of that string
is 5 characters. So, storing a value of 42 would result in:
00042
being stored in SimpleDB.
SmallIntegerProperty
Accepts an integer value in the range of -32768 - +32767. This value is
converted to a string before storing in SimpleDB. To convert, an offset value of
32768 is added to the value of the property. The resulting positive integer
value is then zero-padded in a manner similar to that described above.
So, storing a value of -42 would result in:
32726
being stored in SimpleDB.
PositiveIntegerProperty
Identical to the SmallPositiveInteger property except the range of value
is 0 - 4294967295 and the string stored in SimpleDB is padded to a length
of 10 characters.
IntegerProperty
Identical to the SmallInteger property except the range of value
is -21474836480 - +2147483647 and the string stored in SimpleDB is padded to a length
of 10 characters.
LargePositiveIntegerProperty
Identical to the SmallPositiveInteger property except the range of value
is 0 - 18446744073709551615 and the string stored in SimpleDB is padded to a length
of 20 characters.
LargeIntegerProperty
Identical to the SmallInteger property except the range of value
is -9223372036854775808 - +9223372036854775807 and the string stored in SimpleDB
is padded to a length of 10 characters.
BooleanProperty
Accepts a boolean True or False value. True values are stored as the string
"true" while False values are stored in SimpleDB as the string "false".
DateTimeProperty
Accepts a Python datetime object as a value. Values are converted to strings by
converting the datetime object to it's ISO8601 representation.
ObjectProperty
This kind of property stores a reference to another
Python object. The referenced Python object must be a
subclass of the SDBObject defined in the
persist module. It can be further constrained
using the `ref_class` parameter to the constructor.
The value stored in SimpleDB will be a string
consisting of information about the module the
object's class is contained in as well as the name of
the class and name of the object. This module and
class information must be loadable by Python at the
time this property is referenced.
S3KeyProperty
This kind of property stores a reference to an object
stored in S3. The size of attribute values in SimpleDB
is limited to 1024 bytes so it's obviously not going to
be possible to store the equivalent of BLOB's directly
in SimpleDB. Fortunately, AWS provides the S3 service
which does a fantastic job of storing as much data as
you want to throw at it. This property allows you to
take advantage of that. The value of this property
is a Boto Key object which represents one object in S3.
If you already have an instance of a Key object you can
just store it here. Or you can just store a string
of the form bucketname/keyname
and the S3KeyProperty will make the reference to that
existing S3 object. We'll see some examples of how
this works a bit later on.
One of the really nice features of SimpleDB is that you can store multiple
values with any attribute. To do that in a relational database you would
have to create some sort of mapping table but in SimpleDB it just happens.
So, to support his cool feature the boto.sdb.persist
offers list versions of all of the Properties described above. These list
versions do the exact same value conversions and validation as the scalar
Properties, they just allow multiple values to be stored. The list Properties
are:
- StringListProperty - a list of StringProperty values
- SmallIntegerListProperty - a list of SmallIntegerProperty values
- SmallPostiveIntegerListProperty - a list of SmallPositiveIntegerProperty values
- IntegerListProperty - a list of IntegerProperty values
- PostiveIntegerListProperty - a list of PositiveIntegerProperty values
- LargeIntegerListProperty - a list of LargeIntegerProperty values
- LargePostiveIntegerListProperty - a list of LargePositiveIntegerProperty values
- BooleanListProperty - a list of BooleanProperty values
- ObjectListProperty - a list of ObjectProperty values
The Example
To give a flavor of what the library is able to do, I have chosen to
take an existing MySQL database that contains information about my
music library and attempt to convert that data and load it into
SimpleDB. The original MySQL schema consists of three tables; disc,
genre, and track. The original MySQL definitions of the tables are
shown below.
CREATE TABLE `genre` (
`genre_pk` int(11) NOT NULL,
`genre_name` varchar(32) NOT NULL,
PRIMARY KEY (`genre_pk`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8_bin;
CREATE TABLE `disc` (
`disc_pk` int(11) NOT NULL auto_increment,
`disc_id` varchar(255) character set latin1 NOT NULL,
`disc_num_tracks` int(11) default NULL,
`disc_title` longtext character set latin1 NOT NULL,
`disc_genre_fk` int(11) NOT NULL,
`disc_year` varchar(10) character set latin1 NOT NULL,
`disc_seconds` int(11) default NULL,
`disc_access_cnt` int(10) unsigned NOT NULL,
`disc_access_ts` timestamp NOT NULL default CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP,
PRIMARY KEY (`disc_pk`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
CREATE TABLE `track` (
`track_pk` int(11) NOT NULL auto_increment,
`track_num` int(11) NOT NULL,
`track_frames` int(11) default NULL,
`track_title` longtext NOT NULL,
`track_genre_fk` int(11) NOT NULL,
`track_extra` varchar(128) default NULL,
`track_rating` int(11) NOT NULL,
`track_disc_fk` int(11) NOT NULL,
`track_access_cnt` int(11) NOT NULL,
`track_access_ts` timestamp NOT NULL default CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP,
PRIMARY KEY (`track_pk`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8_bin;
This is a pretty simple schema but there are foreign key relationships
between disc and genre, track and genre, and track and disc. To
simplify this example, I'm going to ignore some of the columns like
access counts, access timestamps and CD-related information like the
disc_id and frames count. So, to convert this database into SimpleDB,
I need to do the following:
- Dump MySQL data out to Comma-Separated Value files, one for each table
- Define the Python classes that will represent each of these entities
- Write a utility to read the CSV files and use the data to create instances of the Python objects
Because the Python classes will include persistant properties that
automatically store values in SimpleDB, by instantiating
instances of the classes and populating them with the data I
will automatically be storing the data in SimpleDB.2
For step 1 above, I just used standard MySQL commands to dump the data from each table
into a comma-separated value file. The three resulting files are:
For step 2, we need to define Python classes to represent the data in the
relational tables. Our classes will rely heavily on the Properties we defined above.
The Python Classes
Given those brief descriptions of the types of properties that are
available, let's define the Python classes we will need to capture this
music database in SimpleDB. The basic process will be to define a Python
class for each of the tables in our relational data model.
While we are at it, I'm going to address a couple of
shortcomings in the relational module right up front: The
relational model had no explict representation for
Artists. The Artist and Disc Title were combined into a
single field separated by the / character. I'm going to
fix that in the new SimpleDB model and create an explicit
Artist class.
So, let's look at the code!
from boto.sdb.persist.object import SDBObject
from boto.sdb.persist.property import *
class Genre(SDBObject):
number = PositiveIntegerProperty()
title = StringProperty()
class Artist(SDBObject):
name = StringProperty()
class Disc(SDBObject):
title = StringProperty()
num_tracks = SmallPositiveIntegerProperty()
year = StringProperty(max_length=4)
genre = ObjectProperty(ref_class=Genre)
artist = ObjectProperty(ref_class=Artist)
class Track(SDBObject):
number = SmallPositiveIntegerProperty()
title = StringProperty()
genre = ObjectProperty(ref_class=Genre)
disc = ObjectProperty(ref_class=Disc)
mp3_file = S3KeyProperty()
flac_file = S3KeyProperty()
Okay, that's pretty short and sweet. First, let's look at the
Genre class. We can see that it is a subclass of SDBObject.
The fields in genre table in the relational model consists of a
primary key which is really a widely accepted numeric equivalent of
the genre title. For example, the ID tags written in MP3 files will
reference the genre number, not the name. The other field, of course,
is the name or title of the genre. So, for our Genre class we have added
two properties to hold those values. We can, of course, add
other things to this class. We can have regular attributes that are
not persisted in SimpleDB as well as methods and other things that
Python allows in a class. The only rules are that it must subclass
SDBObject and that any data that should be stored in SimpleDB must be
represented as a Property within the class definition.
The Artist class is next and is even easier than the Genre class. All we
have is a StringProperty to hold the name of the Artist. Easy peasy.
Now let's try the Disc class. We have a few more Properties in this
class but some are things we have already seen, strings, integers, etc.
One interesting new property type here is the ObjectProperty. The
original MySQL data model used a foreign key relationship between
the Disc table and the Genre table so each row in the Disc table could
reference the primary key of the Genre that it was associated with. We
need a similar capability here for both Genre and Artist. That's where
the ObjectProperty comes in. An ObjectProperty is a reference to another
object persisted in SDB. By default, the reference can be to any valid
subclass of SDBObject but you can further limit the type of objects that can be referenced
using the ref_class parameter.
In our example, the genre property
must be an instance of the Genre class and the artist
property must be an instance of the Artist class.
Finally, we have the Track class. Most of the properties are similar to
ones we have already seen but this class does include two properties of
type S3KeyProperty. These two properties will be used to store the two
music file that can be associated with each Track in our system; a lossless
FLAC format file and a lossey MP3 format file. We'll see how those properties work
in the next section.
And a One, And a Two...
To kick the tires on our new persistent classes, I wrote a simple little import
utility that reads the CSV files dumped out of the relational database and creates
the corresponding objects. In the process, of course, all of the data gets stored
in SimpleDB. The import tool is here.
Having run the import script, let's try our objects out in Python. I'll intersperse
the Python commands with comments to explain what's going on.
Python 2.5 (r25:51918, Sep 19 2006, 08:49:13)
[GCC 4.0.1 (Apple Computer, Inc. build 5341)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from boto.sdb.persist import set_domain
>>> set_domain('music')
SimpleDB let's me create up to 100 domains (currently) and each of those domains
can hold up to 10 GB of data. And, because SimpleDB is schemaless, you can store
all sorts of data in a single Domain. The set_domain
command simply tells the boto.sdb.persist module
which Domain we are using at the moment. The name I'm using here is the same one I
passed as the -d option to the import tool.
>>> from music import *
This just imports my music classes defined above.
>>> rs = Artist.list()
>>> a = rs.next()
>>> a.name
u'Beck'
>>> a.id
u'c708ef11-fabc-4ea1-a445-67a0e4399396'
The first command says Return an iterator containing all instances of the Artist class.
The second command assigns the first object in that iterator to the variable
a and then we print the value of the
name attribute of a.
Also, if we access the id attribute of the artist
object we can see the UUID that is used as the item name in SimpleDB. All of the attributes
of the persisted objects are stored as regular attribute name/values in SimpleDB so you can
access them outside of the boto.sdb.persist module if you
want to.
So, what if we wanted to find all of the Discs associated with our Artist? Simple!
>>> discs = a.get_related_objects('artist')
>>> for disc in discs:
... print disc.title, disc.year
...
Mutations 1998
Sea Change 2002
Guero 2005
>>>
The get_related_objects method is available for
all persisted objects and allows you to find any objects that reference this object.
The argument passed in is the name of the attribute that actually references the
object. In our case, we called the ObjectProperty
attribute artist. Similarly, we can find all of
the Tracks related to one of the Discs.
>>> disc.title
u'Guero'
>>> disc.genre
Genre<fd77b7d1-700c-4713-9c16-6fc927f765a3>
>>> disc.genre.title
u'Rock'
>>> for track in disc.get_related_objects('disc'):
... print track.title, track.number
...
Broken Drum 8
Missing 4
Black Tambourine 5
Rental Car 12
Emergency Exit 13
Girl 3
Go It Alone 10
Earthquake Weather 6
Scarecrow 9
Farewell Ride 11
E-Pro 1
Hell Yes 7
>>>
Finally, let's assume that the FLAC and MP3 files for the Track Black Tambourine were stored in the
in the files /opt/music/bt.flac and
/opt/music/bt.flac, respectively. If
we wanted to associate those files with the Track object, t, which represents that
Track, we could do the following.
>>> t.title
u'Black Tambourine'
>>> import boto
>>> s3 = boto.connect_s3()
>>> b = s3.get_bucket('my_music')
>>> k = b.new_key('bt.flac')
>>> k.set_contents_from_filename('/opt/music/bt.flac')
>>> t.flac_file = k
Or, we could do it this way.
>>> t.mp3_file = 'my_music/bt.mp3'
>>> k = t.mp3_file
>>> k.set_contents_from_filename('/opt/music/bt.mp3')
Hopefully this has given you a flavor for one approach to object-level persistence
with SimpleDB. The boto.sdb.persist module
is still in the early stages and there are lots of other things that can be done
with it such as:
- Automatically add attributes to find related objects rather than rely
on the get_related_objects method.
- Full-text search on all or some Properties within objects
- Better integration with S3 and, potentially, SQS
- A richer set of Properties
Even as it stands, I find it very useful for persisting Python objects and the
fact that the underlying database is a scalable, highly-available service
in the cloud is really music to my ear!
Additional Resources
Mitch Garnaat is a software guy living in upstate New York.
|