Imagine this Production Environment Scenario:
Your production environment is running fairly standard network hardware, basic switches, routers, firewalls, and so on. Everything is functioning perfectly fine; there was never any need to update the software on these devices.
Now, this device serves several major functions, aside from controlling access in and out of your network, it also functions as a termination point for site-to-site VPNs out to multiple branch offices. The vulnerability, as described by the manufacturer, ultimately states that, with some care, an attacker is capable of bypassing the normal security measures of the firewall, and gain access to your network.
The manufacturer has released a software update that fixes this issue, and given the severity of the vulnerability, it was decided that the update be applied; however, the update is a fairly significant update, jumping a full major release from what was running on the device at the time. With the update in place, the necessary reload of the network firewall is completed, and it’s now running on the new software, albeit with one problem. The VPNs to the branch offices aren’t coming back up.
With the VPNs down, the branch offices can’t connect to the main database servers to carry out the day-to-day operations, and little work can be completed. The configuration looks fine, there’s no reason the VPNs shouldn’t be online. With the branch offices effectively down, productivity is down, business can’t be completed, and ultimately, money is being lost.
Your network team is on top of the issue, reviewing the network configuration multiple times, going through change logs for the firewall software, until a change was identified in the required steps to configure a VPN with the new software. Armed with this information, the network configuration is updated, and the VPNs are brought back online, at the cost of lost business, and much time. How could this have been prevented?
Let’s Take Another Production Environment Situation:
Your software development team is working on a new product with multiple components that run on multiple servers located throughout your data center. In order for it to function properly, they need certain features, say multicast forwarding, enabled for the application to work properly. As the servers are in different locations, this means the multicast forwarding needs to be enabled all the way through your core switch.
Thinking there wouldn’t be an issue, the change is put through, and everything is great, at first. Just after the start of the day, as business increases, several employees are having trouble connecting to critical servers to get their work done. Other employees are seeing issues just getting out to the internet, and it seems that when they can get out, it’s proceeding rather slowly. It doesn’t take long for the problem to be tracked back to the changes made on the core switch. In this case, it appeared that with the multicast forwarding enabled, the switch had to perform much more processing than normal when working with both the normal traffic, and the multicast traffic generated by the software team’s work. Further still, it seemed that some employees were able to access network servers and devices that they normally should not be able to. Soon after the changes were reverted, everything returned to normal. Again, was there any way to avoid these issues?
How to Prevent Production Environment Errors:
While the examples above may be a bit extreme, they remain perfectly plausible, and function as solid reminders as to why we need to ensure we have a proper IT lab setup. In both of these situations, had a lab environment been setup that mimicked that of production, all the changes could have been tested prior to implementation. With this testing, it would have been easy to see that there would be issues with VPNs following a significant software update. It would have been possible to see that enabling multicast forwarding on a core switch would result in a performance hit, and have other, unexpected outcomes. It would’ve been possible to see the issues come up before they could affect the production environment, and have solutions worked out in advance, reducing the impact of such changes, and ultimately saving money in the long term.
One thing to consider with such a lab is the cost of building the IT testing environment. Of course, an ideal lab would be identical to what hardware is in production, but this isn’t always feasible. Say the edge firewall was a very high-end, powerful model, it is certain to be rather expensive, and purchasing a second one just to get utilized for testing wouldn’t necessarily seem like a worthwhile expense, especially when the same functionality can be achieved with a lower end model. At the same time, choosing hardware that’s functionally identical to production, but less powerful isn’t always the right option either. While it may be able to carry out the same job, it doesn’t necessarily mean that the changes would have the same kind of performance impact on the production hardware. In this case, how can we go about getting a proper lab in place?
As was pointed out previously, cost is one consideration. How much can be put towards a lab environment? Will it outweigh the potential losses should a configuration mishap occur? Another consideration, and probably the more important one here, is being able to reproduce the key items of the production environment, and knowing where less powerful, functionally identical devices can be used. In the case of the VPN outage, an update to the edge firewall changed how VPNs needed to be configured. In the lab, would it really be necessary to have a firewall that was capable of handling several-hundred thousand connections? Or would a slightly less powerful model that ran the same code and same configuration suffice? In the case of the core switch issues, could we get by with a similar model that may not support hundreds of switch ports? It doesn’t have to be 100% identical, but it should be close enough to get real results out of testing.
The net result of all this should be a lab environment that looks and functions like the production environment. It should be an environment where changes and upgrades can be tested. We want the lab allow us to identify potential issues, and allow us to come up with the procedures to work around the issues. We want the lab to be an environment where those procedures can be properly developed before we end up with an outage that will cost more than what it would to build the lab.