Practice what you Preach
Last week I was in London, England for SeleniumConf where I gave a talk on test infrastructure. The feedback seemed to be good from the people I talked to, but I personally was uncomfortable on stage and felt it might have been my worst performance. I felt good about the content the morning of and had a few jokes, etc. planned, but when I got going the switching of windows (1xFirefox, 1xChrome, 1xKeynote, 1xTerminal, 1xSublime) completely threw me off my game and was a downward spiral from there. It likely was too ambitious for a 40 minute slot and better suited to more of a hands-on, all-day workshop. Here is the sorta Commentary Track with extra links and such.
Most of my decks have a visual theme to them, but I couldn’t come up with anything. At one point I tried Rocky and Bullwinkle because I could say ‘Nothing up my sleeve’ and show Bullwinkle as a magician when showing an empty AWS account. But then I thought of ‘Practice what you Preach(er)’ and tried to shoehorn that in. But its /really/ hard to find appropriate things to include so ripped most of it out. And, most of the talk was supposed to be code and/or architecture diagrams so really any visual theme was a stretch. In the end, I left a couple Preacher things in, but it wasn’t an obvious thing and was lost on most — so likely should have ripped them all out.
These really are the ‘rules’ of presenting. In general, your talk will be improved if you avoid these. The logic being; the audience knows who you are from the bio in the program, they want you to succeed so don’t sow the idea of failure into their minds, and things will go inevitably go wrong (best to anticipate it and have a video of you doing a thing instead of doing it.)
“So let’s break some rules.”
Fell flat. And and the Preacher reference felt forced. (It was.)
Breaking Rule 1. I really don’t care about scripts anymore. There has been tragically little innovation in the actual script creation and maintenance space. But what people don’t talk about is the business risk around where those scripts run. I don’t have any data to substantiate this claim, but my gut is that too many people are just spinning up EC2 or ECS instances to run their scripts without knowledge of the tooling around it in order to run them securely and efficiently.
Breaking Rule 2. I had such big plans for this talk, but have been battling burnout for a year now. It’s been especially bad the last couple months which is exactly when I needed to be prepping things for success. Which didn’t help things as burnout feeds off of burnout.
Burnout isn’t technically a clinical diagnosis, but I like this definition.
Thankfully there are organizations now specifically chartered to help tech people deal with their traitor brains. Such as https://osmihelp.org.
This is a ‘simple’ view of what a modern, self-hosted Selenium infrastructure in AWS could look like. I’m likely missing a few things, but it really hasn’t changed in the last 5 or 6 years. Selenium Grid 4.0 could make some interesting changes at scale as the Hub can be broken into 4 different services. Oh, and I don’t include Docker in here because I don’t believe in running scripts on environments your customers are not. You are of course more than welcome to if enough of your customers are using headless browsers or Linux desktops. I’m also not current on how to setup Sauce Labs in a hybrid scenario (or even if they support that configuration anymore) with their tunnel product adding their cloud as available browser nodes into the Hub — which always thought was a sweet spot for them.
Here is the conclusion of the ‘Do not start with an apology’ rule and the origin of the name. In Austin I rambled (even for me) about infrastructure and just threw a tonne of ideas and topics at the audience. In Chicago I use the https://aws.amazon.com/architecture/well-architected/ Well Architected Framework from AWS to organize all those ideas. It is by AWS and uses AWS product offerings as examples, but it really is cloud neutral at its core. There was a tweet a month or so ago about a teach that used it to cut their Azure bill by something like 40% but applying the principles in it to their cloud infrastructure. So the plan then for this was to open up a terminal, run terraform apply
, do the rest of the talk then have that full diagram in the previous slide created and run a test.
Yaaaaaa. About that. Remember burnout? Ends up I got maybe 1/5 of it done. And couldn’t run a test. So rather than the pitched ‘All code’ there was ‘Some code’.
Now we’re starting to get into Rule 3 territory about not running a live demo. And the tools I’m recommending these days to do it are the Hashicorp ones. I used to suggest using the cloud provider’s native ones (such as CloudFormation for AWS) but for the above reasons have switched. I of course reserve the right to change my mind again in the future.
Almost ready to build some infrastructure, live on on stage, but first have to talk about pre-conditions.
The ‘Root Account’ is the all powerful account with every permission possible. It is also the most dangerous one. The only thing you should do with it is enable hardware MFA (https://aws.amazon.com/iam/features/mfa/?audit=2019q1 has links to purchase), create an ‘administrators’ group that has the ‘AdministratorAccess’ managed policy and create an ‘IAM User’ in that group.
The ‘IAM User’ will have a virtual MFA token and access keys.
This is where juggling windows went crazy. Escape out of Keynote presenter mod to Firefox which had the Root Account window, then to Safari had the IAM User window, then to the terminal to start Terraform and watch it scroll a bit until it applies the ‘Everything must have MFA policy’ (as described at https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_examples_iam_mfa-selfmanage.html) until things fail and demo getting a token in the shell (by switching between the terminal and Sublime which had the commands I wanted) and finish running Terraform.
The network held, and things applied without a problem. But it was here that I realized window switching wasn’t going to work so had to adjust on the fly.
One of the first pieces of infrastructure to be created needs to be the networking layer itself. I strongly believe that AWS’ VPC Scenario 2 (https://docs.aws.amazon.com/vpc/latest/userguide/VPC_Scenarios.html) is the right one in almost all Selenium deployments. Basically, everything of importance is in the ‘private’ (blue) subnet and is not reachable from the internet and the only thing that can reach is either a bastion host in the ‘public’ (green) subnet or a VPN server that sits in the public subnet. You still have to keep things in the private subnet patched, etc. but there is a level of risk mitigation achieved when the bad guys (and gals) cannot even access the machines.
I gave the bastion host an Elastic IP (static) and likely should have also registered it with Route 53 (AWS’ DNS service) and tried to SSH into it. But it didn’t work — on purpose. That is because I created an empty security grep with Terraform and attached it to the bastion host. Using the AWS CLI I added just my IP as an ingress rule so now one of the instances I created was accessible from the internet, but only from a specific IP and certificate. It was also to be able to demonstrate in Terraform how to have it ignore certain changes to prevent it from rebuilding parts of your infrastructure and kicking you out mid change. (There’s a lesson learned the hard way…)
This slide talks about how the bastion (and other) instances were produced by another part of the Hashicorp suite — Packer. No demo, but the idea of ‘Cattle vs Pets’ for your instances was brought up again. My current Packer setup uses the puppet-masterless (https://www.packer.io/docs/provisioners/puppet-masterless.html) provisioner but I would consider switching to Ansible in the future as AWS just announced that Systems Manager can run Ansible playbooks directly from S3 or Github which is kinda game changing to me. puppet-masterless relies on SSH and ideally the last step of provisioning should be to remove all the keys from the box and deal with things strictly through Systems Manager. Again, if everything is in a private subnet that doesn’t allow for access into the boxes, it is another level of security.
I also suggested using something like Secrets Manager or Vault to store passwords and other secure things rather than putting them right in your Terraform manifests.
Which dovetailed to me copy-and-pasting a private key into the bastion host. And then showing a security group that allows access into the Hub that was brought up from only the bastion.
Since we’re in AWS (and a lot of others are as well) we have to talk about security. And one of the most important parts around that in AWS is its API logging tool CloudTrail. The Terraform scripts configured a single CloudTrail trail across all regions and stores the logs in S3. Be careful about doing this though if you are in multiple regions as you pay for traffic and this can silently ad to your bill if you are not careful.
One trick AWS suggests is you have CloudTrail monitor itself and automatically re-enable itself if it is disabled. This is what is on this slide and is described in more detail on https://aws.amazon.com/blogs/mt/monitor-changes-and-auto-enable-logging-in-aws-cloudtrail/
One thing anyone building out infrastructure needs to be aware of is how much their stuff is costing at any particular moment in time. And to be warned when something spirals out of control. This is where billing alerts and using tags on everything that supports them to be able to see where your money is going. AWS billing is a black box with entire consulting organizations existing to try and get a handle on it. Terraform created a billing alert for $40 CDN.
Out of the Terraform and into the theoretical.
I believe you should run your scripts in the environment your users will. This means Windows. Not headless or Linux. So using Packer you create a Windows based AMI. I started with https://github.com/joefitzgerald/packer-windows for these demo purposes. In a proper grid, actual licenses will be required.
Your nodes should be;
- In Auto Scaling Groups even if you are not doing the ‘auto’ part. This is useful as you can intentionally scale them to 0 if you know you never run scripts over night. But also think of a scenario when the Hub notifies AWS that its used 90% of its available nodes and to spin up another 2 or 3 and then remove capacity when it has more than x spare.
- Use ‘Spot Instances’ which is a market place for companies who have bought Reserved instance (pay by the year) but are not using them and lending out their compute time to recoup some of their investment. You should never pay more for a Spot instance than you would were it On Demand.
- Have access to their Node instance restricted to only the Hub via a Security Group
One best practice we had a decade ago that has been forgotten is always running your scripts through a scriptable proxy. This lets you blackhole unnecessary scripts which slow down your tests, intercept HTTP codes and control how much bandwidth is simulated. (Having spent almost a week in a hotel with pretty crap internet, its amazing how much of the internet assumes functioning bandwidth.)
Access into this proxy should only be from the Node instances and wherever your scripts are being run from (such as CodeBuild) to configure it.
Some of this functionality is starting to be built into the browsers with bridges to WebDriver through the Javascript Executor and Google Developer Tools. This of course assumes you are only running scripts in Chrome. It’s a far better idea to just run a proxy to get greater functionality and cross-browser capability.
Another reason for Terraform over something like CloudFormation is you can run things in external cloud providers such as MacStadium which uses VMWare as the base of their cloud. So using the same tool to configure your Linux Selenium Hub and Windows Selenium Nodes you can also create Mac Nodes.
Because it is external to your private subnet where everything is, and in fact external to your VPC, a Load Balancer needs to be created in the public subnet to allow communication from MacStadium into the Hub for registration.
Selenium 4.0 is coming. And it will change this diagram a bit. As mentioned above, the Hub itself can be broken into 4 separate services which can be independently configured and scaled. A ‘Hub’ comprised of 2 AWS Lambda functions, an AWS SQS queue and and AWS Elasticache Redis instance is going to be the scalable model of the future I think.
But before that happens, there is a couple things that need to happen.
Communication between all parts of the Selenium infrastructure needs to be securable. Currently everything is HTTP but it needs to be HTTPS (if not by default, then at least configurable.) If anyone wants to do that, patches are welcome and would save me the work of doing it.
Similarly, there needs to be the some way of authorizing Nodes into the Hub. Right now, any Node can register itself with the Hub and start getting traffic. Its an interesting attack vector to think about where you discover someone launching a Hub in a public subnet and you lighting up a Node and attaching to it and now seeing a company’s next version of their app because they are sending it to you. The vector gets even more interesting when taking into consideration there is work being done to allow communication back to the Hub from the Node. If I can overflow a buffer and run arbitrary commands on the shell somehow your network is now fully compromised. Again, feel free to submit a patch along the lines of https://www.elastic.co/guide/en/beats/filebeat/current/configuring-ssl-logstash.html so I don’t have to do it.
And that was the talk. Next steps with it are unknown. I’m seriously considering turning it into a video series and maybe offering it as a workshop at future SeleniumConfs.