The Infrastructure Group

I am currently Senior Technical Architect in The Infrastructure Group at the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). In this role I manage the senior systems and network administration staff that run the labs computational infrastructure as well as playing a hands on roll in deploying major new technology initiative such as our OpenStack private cloud and secure enclaves for computing on sensitive data sets. This latter one is not yet published as the policy relies on a beta version of MIT's Written Information Security Policy (wisp) which I'm also participating in and is also not yet officially published, but I'm promised "it will be soon".

Current Infrastructure

The Infrastructure Group website is probably going to keep more current that this ever will on what is actually deployed now. In a broad sense we provide a very wide variety of customization across research group systems varying from fully managed to "there's the power and data plugs go for it". Managing the middle ground where semi-random graduate student have administrative access but we also agree to keep it patched and running is a unique challenge of this environment.

History

Since 2000 I've made many changes in our technology profile. I'd go so far as to say identifying and timing these changes has been my key value to the lab over that time. Perhaps later I'll expand on the actual transitions we've been through over the years, but for now I'm going to stick with some lessons learned about some of the pitfalls of trying to pick which technology to jump on when.

The first pit fall I've seen is evaluating a technology is deciding something is "bad" when it is in fact "immature". This is pretty easy to remedy, if you find a technology that sounds good but doesn't live up to expectation, remember to check back in.

The next issue is a bit stickier. If you have a need that can only be met by a new technology that work "well enough" a common problem is upgrades. this isn't just a software or even a computer problem. Looking across MIT there's a lot of somewhat outdated instruments which were brought in when they were state of the art but as the art was young and moving fast. In software this is somewhat more tractable than say wind tunnels. The solution is to ensure you do budget staff time for the extra hard upgrades during ramp up. This of course is easier said than done and is probably the cause of most of our tricky technical debt as it's hard to say "no you can't have a new GPU cluster until we upgrade this other thing that's working OK".