Jimmy Tang

Yet more random mutterings

Crowbar for Deploying Systems

| CommentsPermalink

I've been eyeing crowbar recently, it looks pretty useful and interesting for deploying servers and applications. I haven't seen much if at all any documentation out there which suggests that people in the digital preservation and archiving fields are implementing systems at scale, I'm under the impression that most systems/sites are building systems up one piece at a time without much automation.

It seems to use chef in the backend for all the automation. I've been relearning puppet recently so that I can have reproducible environments with Vagrant.

There might be an advantage to learn and port all the existing modules that I have already created and configured to chef instead of puppet. If I did move to a chef automation in my vagrant environments then a few years from now when we go to full production we might be able to deploy the whole system from bare metal relatively quickly and repeatably.

Automating the deployments will mean that we will have documentation on the infrastructure itself. Either which way there is still a need to automate the fedora-commons, SOLR, mysql and postgres deployments at some point.

After all this thinking and pondering, I'm still using puppet. There's still the likes of ansible, cfengine, bcfg2 and juju. There is a never ending supply of these tools.

Ceph V0.53 RELEASED

| CommentsPermalink

There's a new release of Ceph, I hope that they release a stable soon so we can do further evaluations of the Ceph storage system. A few of my work colleagues are going to the Ceph workshop next week.

I'm wondering if anyone has taken the CRUSH algorithm and used it in other domains.

Hydracamp 2012 - Penn State

| CommentsPermalink

What do you do when you need a crash course on RoR, Hydra and frameworks for digital preservation and archiving? You go to Hydracamp!

The syllabus was

  • Day 1 - Rails, CRUD, TDD and Git
  • Day 2 - Collaborative development with Stories, Tickets, TDD and Git
  • Day 3 - Hydra, Fedora, XML and RDF (ActiveFedora and OM)
  • Day 4 - SOLR and Blacklight
  • Day 5 - Hydra-head, Hydra Access Controls

Most of the training sessions were hands on from day 1 which was refreshing, as it was hands on I getting the most out of the training session. It would have been better if I had known more ruby to move along some of the exercises more effectively.

To give an overview of what we had done (between ~30 people), we created a ruby on rails application titled “Twitter for Zombies”. With this small application everybody was frantically committing, pulling, merging and pushing code. It was highly informative and a good learning experience to see how fast things could move.

The training session also included a crash course into what Fedora and SOLR does and how Hydra interacts with these components. The third and fourth days were the most interesting as it showed how someone might convert from a typical RoR application into an application which uses Fedora as the persistance layer. The last day was really just a wrap up and Q&A session.

You could take a look at the github account for Project Hydra and have a peek at the hydracamp repo.

Digital Preservation and Archiving Is a HPC Problem?

| CommentsPermalink

I shall be going to SC2012 next month, I plan on hitting a few of the storage vendors for possible collaborations and flagging to them that we're on the look out for storage systems. One of the first observation that the reader will note is “where is that link between HPC and Digital Preservation and Archiving”. It's probably not obvious to most people, one of the big problems in the area of preservation and archiving is the the amount of data involved and the varied types of data. This is not taking into account of the issues with data access patterns.

Given that a preservation and archiving project will want to provide a trusted system, the system will want to read out every single byte that was put in to verify that the data is correct at somepoint (usually with some form of hashing).

Reading data out and checking that it's correct serially probably isn't the smartest solution. Nor is copying the data into 2-3 locations (where each site is maintaining 2-3 copies for backups and redundancy). The current and seemingly most popular solutions is to dump the data to a few offsite locations (such as S3 or SWIFT) compatible storage systems, then just hoping for the best that if anyone of the sites is down or corrupted there site can be restored from the other sites or from a backup. I need to delve deeper into the storage and data-distribution strategies that some of the bigger projects are taking. There has to be a smarter way of storing and preserving data without having to make copies of things.

I've often wondered how projects manage to copy/move data across storage providers in a reasonable amount of time without needing to wheel a few racks of disks around. It would also be interesting to see the error rates of these systems and how often errors are corrected. If they are corrected what is the computational cost of doing this.

If you have a multi-terabyte archive the problem isn't too bad, the more typical case these days might be in the order of the low hundreds of terabytes. I could only imagine what lager scale sites must deal with. I'm still not a fan of moving a problem from a local site to a remote site as it often shows that there is a lack of understanding to the problem. Storage in the preservation and archiving domain will probably turn into an IO and compute intensive operation at some point, especially if you want to do something with the data.

SLURM-Bank That Big Script for Banking in SLURM

| CommentsPermalink

A co-worker of mine (Paddy Doyle) had originally hacked at a perl script for reporting balances from SLURM's accounting system a year or two ago and he had figured out that it might be possible to do some minimalistic 'configuration' and scripting to get a system that's very basic but functional.

It was just one of those things that funding agencies wanted to justify how the system was being used, GOLD was clunky and obtrusive and complicated for what we wanted. Most of all we liked SLURM but not GOLD and Maui which was needed to get full accounting and banking (most of the features weren't used).

Being good and lazy engineers we got excited with the prospect of having the option of replacing SLURM, Maui and GOLD with just plain old SLURM we set out to write down the workflows for what we wanted to do and what the user and funding agencies actually wanted. With those ideas in mind we set out to implement as much as we could and needed in just plain old sh/bash scripting with a splash of perl. Replacing two components with one meant that we would have less work to do in the long run ;)

After a whole year of running with these scripts and just putting it online, I've noticed that there may be a few sites out there that might be using our scripts and workflows. It would be nice to find out how many people are using our implementation of a banking system in SLURM and if it's driven by sysadmins looking to account for usage or is it funding agencies looking for justification of the usage of a system.

I was going to be at the SLURM User Group Meeting 2012 to give a short talk on our experiences with the SLURM-Bank scripts and workflow, but sadly I have to be in the US during this meeting and my colleague “Paddy Doyle” will there instead of me. I would have liked to go and chat with the developers of SLURM to push for more advanced banking/accounting facilities in SLURM itself. Visiting BSC again would have been fun.

Ceph V0.52 RELEASED

| CommentsPermalink

The latest development branch of Ceph is out with some rather nice looking features, what's probably the most useful are the RPM builds for those that run RHEL6 like systems.

Still no real sight of backported kernel modules :P Also some of the guys in work here just deployed a ~200tb Ceph installation which I've access to a 10tb RBD for doing backups on.

A Poor Man's NAS Device With Ceph

| CommentsPermalink

Given that I have a number of old 64bit capable desktop machines and a collection of hard drives at home, I could have run Tahoe-LAFS like I do in work for backup purposes. In fact Tahoe works quite well for the technically capable user.

Recently I've decided that I need a more central location at home to store my photo collection (I love to take photos with my Canon DSLR and Panasonic LX5). Traditionally I would have just fired up git-annex to track the data and then setup a number of remotes to store the data, where one of them might be Tahoe-LAFS and the rest might be portable hard drives, remote machines etc…

I could have gone with any number of distributed storage solutions such as GlusterFS, iRODS, xrootd, Lustre or xtreemfs. I've worked with some of these systems in production and toyed with others. Since this is for a home system I can pick what I want and change it at will.

I probably have 2-3tb's of data to archive and store, I also want easy access to my data so NFS or CIFS exports are required. It wouldn't be unfeasible to acquire a few 2 or 3 terabyte drives for my old desktop machine which would effectively provide me with a 2 or 3 terabyte replicated data store. Given the amount of toying around and learning about Ceph in my spare time I would expect that Ceph would provide me with a pretty good “backend” system for storing my files and the option of “migrating my data from one machine to another machine” by adding and removing OSD's. The handiest feature for me will be the capability of expanding and shrinking the system as I need.

There probably aren't many people who would want to setup something like this for a home system, but it is an alternative to the usual RAID or LVM setup.

Here's my proposed setup which I'm going to setup in the next few spare weekends that I will have.

It would be great if Ceph offered some of of parity/erasure coding instead of plain replication. I'm greedy and I want to maximise my disks that I have, I wonder how low I can go on hardware with the Ceph software.

Ceph V0.48.2 ARGONAUT RELEASED

| CommentsPermalink

There's a new stable release of Ceph Argonaut, I seem to be having better luck with playing with the development releases of Ceph.

Oh how I wish that there was a backport of the kernel ceph and rbd drivers for RHEL6, I have a dodgy repo and some reverted commits that one of the guys in work told me about. It seems to run but it isn't great, it can be found at https://github.com/jcftang/ceph-client-standalone.

Going From Replicating Across OSD's to Replicating Across Hosts in a Ceph Cluster

| CommentsPermalink

Having learnt how to remove and add monitor's, meta-data and data servers (mon's, mds's and osd's) for my small two node Ceph cluster. I want to say that it wasn't too hard to do, the ceph website does have documentation for this.

As the default CRUSH map replicates across OSD's I wanted to try replicating data across hosts just to see what would happen. In a real world scenario I would probably treat individual hosts in a rack as a failure unit and if I had more than one rack of storage, I would want to treat each rack as the minimum unit.

One of the coolest features of ceph is that it allows me to play with different mappings and configurations of where my data gets allocated. There aren't many (if any) storage systems that I know of which provides this type of capability.

So the steps that I went through to get to what I wanted…

First I had to dump the CRUSH map from my cluster of two nodes and three (very unbalanced OSD's so I can play with the weights).

ceph osd getcrushmap -o /tmp/mycrushmap

The CRUSH map that is created is a binary file it must be decoded to plain text before you can edit it.

crushtool -d /tmp/mycrushmap > /tmp/mycrushmap.txt

Here's the map that is decoded from the binary file

# begin crush map

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2

# types
type 0 osd
type 1 host
type 2 rack
type 3 row
type 4 room
type 5 datacenter
type 6 pool

# buckets
host x.y.z.194 {
        id -2           # do not change unnecessarily
        # weight 2.000
        alg straw
        hash 0  # rjenkins1
        item osd.1 weight 1.000
        item osd.0 weight 1.000
}
host x.y.z.138 {
        id -4           # do not change unnecessarily
        # weight 1.000
        alg straw
        hash 0  # rjenkins1
        item osd.2 weight 1.000
}
rack rack-1 {
        id -3           # do not change unnecessarily
        # weight 3.000
        alg straw
        hash 0  # rjenkins1
        item x.y.z.194 weight 2.000
        item x.y.z.138 weight 1.000
}
pool default {
        id -1           # do not change unnecessarily
        # weight 2.000
        alg straw
        hash 0  # rjenkins1
        item rack-1 weight 2.000
}

# rules
rule data {
        ruleset 0
        type replicated
        min_size 1
        max_size 10
        step take default
        step choose firstn 0 type osd
        step emit
}
rule metadata {
        ruleset 1
        type replicated
        min_size 1
        max_size 10
        step take default
        step choose firstn 0 type osd
        step emit
}
rule rbd {
        ruleset 2
        type replicated
        min_size 1
        max_size 10
        step take default
        step choose firstn 0 type osd
        step emit
}

# end crush map

The relevant chunks of the config that I'm interested in is the rule NAME {} blocks. As I'm interested in making my data, meta-data and probably my rbd rule replicate across hosts, I naturally made the rule look like this

rule data {
        ruleset 0
        type replicated
        min_size 1
        max_size 10
        step take default
        step choose firstn 0 type host
        step emit
}

The above change is apparently incorrect as the last step before the step emit needs to be a device of some sort. I had found this out after posting the ceph-devel mailing list. There were two proposed solutions (thanks to Greg from inktank), the first proposed rule was

rule data {
        ruleset 0
        type replicated
        min_size 1
        max_size 10
        step take default
        step choose firstn 0 type host
        step choose firstn 1 osd
        step emit
}

Which selects n hosts then the first osd from each host, but it can't deal with an entire hosts failed OSD's. The second proposed rule was

rule data {
        ruleset 0
        type replicated
        min_size 1
        max_size 10
        step take default
        step chooseleaf firstn 0 type host
        step emit
}

The above rule will select n hosts and an OSD from the host. It's pretty obvious that the second rule is the one that I want. I would expect that if I had more machines in racks and rows I could probably just replace host with rack, row or even data-center.

With the second proposed rule, I made the changes to mycrushmap.txt. Once the changes are made, I had to compile the map into a binary format that the ceph cluster understands, this can be done by

crushtool -c /tmp/mycrushmap.txt -o /tmp/mycrushmap.new

Once the map is compiled it must then be applied to the cluster

ceph osd setcrushmap -i /tmp/mycrushmap.new

The above is documented on the ceph website. Once I applied the new CRUSH map I ran a ceph -w to see that the system had detected the changes and it then started to move data around on its own. I'll need to play with pulling out the network cable or SATA cables to see how the system behaves from me causing catastrophic failures in the test system.

I'm pretty sure I took the long way around to making the changes, there must be a more dynamic way of changing the system.

To recap and review the above operation, it's again no harder than my reference system that I know, which is GPFS. GPFS doesn't allow me to do what ceph allows me to do. I would however like to see some more visible documentation relating to the CRUSH configuration parameters and tuneables.

So far this has been a distraction from my main day job, but this will certainly help with the project that I am working on in the long run.