Hardware-ing a QA group

Recently I was asked what I would have done differently ramping up the QA group all over again. One of the points I mentioned was surrounding our hardware process. The following is a further explanation of my points.

Get in bed with a hardware vendor: If you are at the point where you are ramping out a QA environment, you are also at the point where you can no longer really afford to just get some generic brand machines from the computer shop down the corner. You need to at this point commit to a supplier of hardware. By this I mean IBM, Dell or HP for Intel/AMD based machines. They all have small/medium sized business facing organizations. Sure, the upfront costs might be slightly higher, but in the long run doing this now is much easier than later. In addition to the Trust Me reason just given, the following applies.
- You get a level of guarantee that the same configuration will be available for a set length of time. Often at small stores, the available systems are based upon what components they have on hand at the time. With large organizations, models have definitive life spans.
- Support contracts and SLAs will be much tighter with the large organizations who have the full support infrastructure. Do you have someone you can call at your corner computer supplier at 2AM during release crunch and your RAID controller just went *POP*?
- You can get much friendlier form factor machines from the big vendors than consumer / small business oriented stores. 1U machines should be on your shopping list.
Resist fun naming conventions: While it certainly is fun to name machines according Star Wars, Star Trek, Disney, species of bears, natural disasters, species of birds, video games etc. These schemes do not scale (how many bear species are there — that people will recognize as such?), do not resonate with all members of the group and due to their informality are subject to change. We have in the last 6 years used all the listed examples in our QA lab. Instead you should use un-fun but descriptive names. An example of this is from one of my newer machines: sap2k301. Lets beak it down into it’s components.
- sa: the hardware pool that this machine belongs to (we have multiple products, each with it’s own hardware pool; largely for budgeting purposes)
- p
  
  the purpose pool, in this case performance testing
2k3: the OS — Windows 2003
01: the machine identifier. This is the first machine of this OS, for this purpose as relates to the product. If you someone like Google you might want to have more digits in your identifier. This example maxes out at 99 which for our purposes is more than enough
Virtualize: Rather than buy lots of smaller machines then run them at 10% utilization most of the time, buy as large a machine as your budget will allow and run virtual servers on it. Not only do you have less space used in your data center (which can get quite cramped and hot) but you better use the investment by running it at say 80% load. It just happens that that 80% is really 7 server instances. Also, if you have done your deployment correctly, you could have your application completely destroy a critical OS component (say the windows registry) and your recover is simply to boot from the un-screwed image. Similarly, it lets you test on an install that has not have a dozen install/uninstalls having been completed on it. We have overlooked some nasty problems in the past by just recycling the same machine due to the time it would take to recreate it.
Figure out a update strategy: You need to figure out a way to apply OS and application patches to whatever system you decide on to manage machine images. There is no greater time sync than having to go out and grab a year’s worth of service packs and hot fixes and apply them to a newly brought up image/install. This should be automated as much as possible, and all the large OS vendors provide products meant for enterprise IT management that are ideally suited for this purpose. This also ensures that the machines you are testing on actually have the appropriate patch levels across the board. Nothing is more annoying than completing a test cycle only to find that you had the wrong version of some library or dll and basically have to write off a week of your time.
Automate OS installation/configuration: Every major OS has some way of installing itself unattended for large scale deployments. These should be implemented for your hardware to save the 40 minutes per machine it takes to install an OS. But it’s only 40 minutes. What if you had a skid of machines delivered… Having this automated makes you less reliant on the availability (and whims) of local IT.
Use Test machines ONLY for testing: This seems kinda obvious, but this is a trap that less mature companies often fall into. Only test on test specific machines. This means that a tester will at minimum be always using two machines. The first is the machine they access their mail, the bug system etc. The second will be the machine that the tests are being conducted on. The first machine is used to access the second using some remote terminal mechanism (Terminal Services on windows, exporting DISPLAY on unix). Ideally too, the access machine should be a laptop if for no other reason than they are portable. Those are the ones I can think of off the top of head, but if I think of any others, I’ll add them as a new post.