|
Discussion Forums
|
Thread: Anyone using GlusterFS?
 |
This question is not answered.
Helpful answers available: 2.
Correct answers available: 1.
|
|
|
|
Replies:
46
-
Pages:
4
[
1
2
3
4
| Next
]
-
Last Post:
Jan 23, 2009 9:13 AM
by: Jason Villalta
|
|
|
Posts:
66
Registered:
1/9/07
|
|
|
|
Anyone using GlusterFS?
Posted:
Jan 29, 2007 1:32 PM PST
|
|
|
In my efforts to research and optimize muckFS, I stumbled onto GlusterFS [
http://www.gluster.org/docs/index.php/GlusterFS ] -- has anyone tried this on EC2?
It is Fuse based, so no need for customer kernels like Lustre, etc. It looks like it is fairly mature and robust, and it also looks like it might be possible to write a glusterFS "translator" to have it mirror content back to S3?
I'm not a big fan of reinventing the wheel needlessly, so if this works as well as they say, I may drop muckFS.
I'll take a stab at building and testing glusterFS on EC2, and then see how hard to make a "replicate-to-S3" glusterFS "translator".
|
|
Posts:
232
Registered:
8/24/06
|
|
|
|
Re: Anyone using GlusterFS?
Posted:
Jan 30, 2007 7:14 AM PST
in response to: dwmike
|
|
|
never used it before but it seems to be pretty cool.
If you try it let us know how well it went.
|
|
Posts:
66
Registered:
1/9/07
|
|
|
|
Re: Anyone using GlusterFS?
Posted:
Jan 30, 2007 7:18 AM PST
in response to: againnickname
|
|
|
I got it up and going last night -- I'm playing around with tuning and replication, etc. but my first reaction is: fast! :)
|
|
Posts:
17
Registered:
1/19/07
|
|
|
|
Re: Anyone using GlusterFS?
Posted:
Jan 30, 2007 8:56 AM PST
in response to: dwmike
|
|
|
Nice. Did you get it running within EC2?
The pertinent questions would be:
Does GlusterFS aggregate storage across all connected nodes (i.e. Machine1 has 1GB, Machine2 has 1GB, Machine3 has 1GB, does the total volume that can be mounted equal 3GB)?
Can volumes/bricks be added "on the fly", for example if you bring a new node/brick online can the volume be dynamically expanded or reduced based upon adding or deleting bricks, respectively?
Does GlusterFS support any type of parity or RAID-style architecture, in the event a node/brick goes down?
I spent some time yesterday on their site, the documentation and technical descriptions are outstanding but it doesn't seem like there is a lot of developer activity going on. At least the mailing list archives don't reflect widespread adoption.
I initially started looking into Redhat GFS for a distributed filesystem, the main problem with GFS is that you can't dynamically add or delete nodes/storage which is sort of the point of EC2. What you initially build in terms of storage you are stuck with, although you might be able to accomplish a dynamic filesystem using LVM/CLVM mojo of some sort. Given the non-persistent nature of EC2 and its dynamic IP addressing schema, GFS is currently not an option for EC2 deployment (nor is any other distributed filesystem I've reviewed for that matter). GFS is also a major roadblock for secure architecture, the way the filesystem is distributed all nodes have root level access to the entire spanned volume. If one node gets compromised, the entire cluster can be subverted (mknod can be used to create kmem devices on any part of the filesystem etc).
Let us know your findings, GlusterFS is very interesting.
|
|
Posts:
66
Registered:
1/9/07
|
|
|
|
Re: Anyone using GlusterFS?
Posted:
Jan 30, 2007 9:10 AM PST
in response to: turbochicken
|
|
|
I've spent the morning on IRC with the glusterFS developers -- I'll have an update soon... very encouraging...
|
|
Posts:
69
Registered:
9/1/06
|
|
|
|
Re: Anyone using GlusterFS?
Posted:
Jan 30, 2007 9:21 AM PST
in response to: dwmike
|
|
|
I also started to read the Gluster Documentation and it seems interesting. My first thought was that with a finetuned S3 transporter, as caching strategie, Gluster would first read/dirty-write to local storage, next to remote storage, and only lazily write to S3 (because of RAID in ec2 gluster nodes) or read if there is no HIT in the ec2 gluster nodes. dwmike, have you spent some more thoughts on the S3 transporter ?
I am particular interested in large, fast and reliable storage for a database. bonobo2000, are lurking ? What do you think, would Gluster make sense in this context ?
regards
Roberto
|
|
Posts:
66
Registered:
1/9/07
|
|
|
|
Re: Anyone using GlusterFS?
Posted:
Jan 30, 2007 9:43 AM PST
in response to: rsaccon
|
|
|
Ok - I just spent the last 2 hours on IRC with the glusterFS dev team -- they are definitely around and active. They have been thinking/working on glusterFS for about a year and half, the first active commit was back in September. It is very active. These guys are very responsive and very intrigued by S3 and EC2. Some of them come from old VA Linux days, and they say they have a good relationship with the Amazon folks -- anyone from Amazon want to jump in??
The whole glusterFS framework is designed to be "pluggable" through "translators". Their line was "why write a filesystem when you can write a translator". So by writing a storage/s3 translator, the rest of glusterFS just "works" and you/me don't have to reinvent the wheel for all the rest of the filesystem details. That is what we are working through now. (They took a lunch break hehe).
Let me do my best to answer the questions I've seen so far:
Q: Did you get it running within EC2?
A: Yes -- I use the CentOS4.4 AMI as my base. After adding fuse.ko and the Fuse rpms, I was able to build and install gluster on some storage "brick" instances and a client glusterfs instance quite easily. Performance was way faster than my muckFS. They also suggested I build a custom fuse.ko using their code instead of the stock fuse kernel for even better performance.
Q: Does GlusterFS aggregate storage across all connected nodes (i.e. Machine1 has 1GB, Machine2 has 1GB, Machine3 has 1GB, does the totalvolume that can be mounted equal 3GB)?
A: Yes -- in my test config of 1 client and 2 server bricks, my client sees the aggregate of the 2 server bricks as:
$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 4.0G 2.0G 1.8G 54% /
/dev/sda2 147G 206M 140G 1% /mnt
glusterfs:28845 294G 520M 279G 1% /mnt/glusterfs
Q: Can volumes/bricks be added "on the fly", for example if you bring anew node/brick online can the volume be dynamically expanded or reducedbased upon adding or deleting bricks, respectively?
A: At this point, it sounds like you would need to do a remount, although they discussed the idea of on-the-fly in the future. This is part of the larger reason for my muckOS project -to manage the distribution of config files and restarting of services when adding nodes to clusters for things like glusterFS or MySQL etc.
Q: Does GlusterFS support any type of parity or RAID-style architecture, in the event a node/brick goes down?
A: They are releasing the Automatic File Replication "translator" in 15 days for RAID 1 type mirroring. You can configure how many times a file should be replicated - even my file type. Also, the distribution of files to "bricks" has 4 different "translator" plugin options - random, round robin, NUFA (ie NUMA for files) and ALU - a least usage algorithm.
Q: How active is the project?
A: while the list may not be active, the IRC channel sure is -- a good chunk of their development team spent the morning with me.
I know there are more questions, but those are the ones I have answers for so far.
These guys seem very interested in writing a storage/s3 "translator" for glusterFS. Since glusterFS is already async in its writes, dealing with S3 as async persistent storage is not a problem.
More info to come as I learn it... :)
|
|
Posts:
17
Registered:
1/19/07
|
|
|
|
Re: Anyone using GlusterFS?
Posted:
Jan 30, 2007 9:56 AM PST
in response to: dwmike
|
|
|
GlusterFS sounds extremely interesting. There will still have to be some workaround for EC2's dynamic addressing, what would be nice is if the configuration mechanism was distributed in some form of PKI-enabled peer-to-peer method so that nodes could join and drop off without having to rebuild the entire configuration from scratch and restart everything. As long as a new node knows at least 1 IP address of the spanned filesystem it could rejoin on bootup then sync with the other peers that are members. Jabber would be nice for passing that config information between peers as well as performing heartbeat functions in case an EC2 instance dies unexpectedly.
When adding a new node/brick I wonder if there is any support for a rolling restart instead of having to cycle all nodes in the system at the same time?
|
|
Posts:
17
Registered:
1/19/07
|
|
|
|
Re: Anyone using GlusterFS?
Posted:
Jan 30, 2007 10:08 AM PST
in response to: dwmike
|
|
|
>Also, the distribution of files to "bricks" has 4 different "translator" plugin options -random, round robin, NUFA (ie NUMA for files) and ALU - a least usage algorithm.
Fascinating. They need a "Least Likely EC2 Instance to Fail and Not Come Back" option ;)
|
|
Posts:
66
Registered:
1/9/07
|
|
|
|
Re: Anyone using GlusterFS?
Posted:
Jan 30, 2007 11:08 AM PST
in response to: turbochicken
|
|
|
Alright - the glusterFS developers are planning on adding a "storage/s3" drop in replacement for their existing "storage/posix" translator. So from a client's perspective, it will not look any different than dealing with multiple "storage/posix" bricks. On the storage brick, it will interact with S3 to put/get files, and cache the results on the local posix filesystem. If your S3 filesystem is bigger than the 160GB ephemeral storage, glusterFS will purge old cache to make room for newer requests. And you should be able to have multiple s3 storage bricks for increased bandwidth to client nodes. You can also run both a s3 storage brick and a mounted filesystem on the same instance, so you don't *have* to have a separate instance as a file server.
The initial translator will be basic, but functional, and additional features like seek()on read using ranges will be later. Things like on-the-fly scaling would then be a general feature enhancement for glusterFS and not S3 specific -- see the glusterFS roadmap for their plans:
http://www.gluster.org/docs/index.php/GlusterFS_Roadmap
I'll keep you posted as I learn more...
|
|
Posts:
17
Registered:
1/19/07
|
|
|
|
Re: Anyone using GlusterFS?
Posted:
Jan 30, 2007 11:53 AM PST
in response to: dwmike
|
|
|
That's awesome!
|
|
Posts:
581
Registered:
6/22/06
|
|
|
|
Re: Anyone using GlusterFS?
Posted:
Jan 30, 2007 1:12 PM PST
in response to: dwmike
|
|
|
Mike, a great feature for an S3-backed translator would be replacement of a brick node with a new instance which recovers state from S3 automatically (on-demand or whatever). I guess that's pretty obvious.
Then, S3 changes the playing field a little w.r.t bricks and server nodes. For high availability you ideally want more than one server to be able to serve the same file. Something like raid mirroring. But if the server nodes are nothing more than caches over S3, there's not much point in different servers saving the same data to S3 in more than one S3 location - S3 is already a safe (redundant) storage system.
Also, I suspect you'll be able to get away with a cache of much less than 160GB in most cases, even if your file system is significantly larger. I suspect there is some locality of reference with file access most of the time. It would be interesting to work out what a 'typical' effective cache size is before you start seriously thrashing to S3 all the time.
Something that intrigued me about the GlusterFS docs was that the client seems to specify exactly which servers will club together in a virtual FS. The servers themselves don't seem to need to know about each other. I don't quite understand how this works when there are multiple clients accessing the virtual file system. How does each client know where to find a file? What happens when a client is set up to use only a subset of the servers?
But GlusterFS looks like a great starting point for development of an EC2 distributed file system.
Regards
Roland
|
|
Posts:
69
Registered:
9/1/06
|
|
|
|
Re: Anyone using GlusterFS?
Posted:
Jan 30, 2007 1:15 PM PST
in response to: dwmike
|
|
|
Great news, thanks for sharing the info.
|
|
Posts:
66
Registered:
1/9/07
|
|
|
|
Re: Anyone using GlusterFS?
Posted:
Jan 30, 2007 1:33 PM PST
in response to: RolandPJ@AWS
|
|
|
Thanks Roland - some feedback interspersed:
Mike, a great feature for an S3-backed translator would be replacement of a brick node with a new instance which recovers state from S3 automatically (on-demand or whatever). I guess that's pretty obvious.
Yep - the idea is that on start, an S3 brick would initialize itself from S3 - the metadata as well as the file contents will be stored on S3, so a new brick can bring itself up from scatch using just S3.
Then, S3 changes the playing field a little w.r.t bricks and server nodes. For high availability you ideally want more than one server to be able to serve the same file. Something like raid mirroring. But if the server nodes are nothing more than caches over S3, there's not much point in different servers saving the same data to S3 in more than one S3 location - S3 is already a safe (redundant) storage system.
Right - so different configurations for different people. Have one or two bricks for redundancy and bandwidth, or for that matter, put a brick on each instance that has a mount. Take a look at the performance charts:
http://www.gluster.org/docs/index.php/GlusterFS_Benchmarks
Have the data cached on a glusterFS brick lets the mounted filesystems have pretty amazing throughput, and then let the bricks do lazy async interaction with S3.
Also, I suspect you'll be able to get away with a cache of much less than 160GB in most cases, even if your file system is significantly larger. I suspect there is some locality of reference with file access most of the time. It would be interesting to work out what a 'typical' effective cache size is before you start seriously thrashing to S3 all the time.
Right - I just picked 160GB arbitrarily. I suspect the max cache size will be a configuration item.
Something that intrigued me about the GlusterFS docs was that the client seems to specify exactly which servers will club together in a virtual FS. The servers themselves don't seem to need to know about each other. I don't quite understand how this works when there are multiple clients accessing the virtual file system. How does each client know where to find a file? What happens when a client is set up to use only a subset of the servers?
I *think* this is the answer (I'm still kinda new to this too!):
http://www.gluster.org/docs/index.php/GlusterFS_FAQ#When_I_ask_for_an_.22ls.22.2C_does_it_query_all_the_fileservers_in_parallel_to_deliver_the_logical_summation_of_the_name-space.3F
Basically, stat info is via parallel queries to all bricks, and the client aggregates the responses back into the listing.
I think you want your client and server configs to match to ensure all clients are using the same aggregation of bricks for their filesystem.
But GlusterFS looks like a great starting point for development of an EC2 distributed file system.
Thanks again for the feedback. I learned a lot by writing muckFS, but I would far rather collaborate with an existing project like this.
|
|
Posts:
581
Registered:
6/22/06
|
|
|
|
|