Tag Archives: automated-provisioning

The problem with automated provisioning (III of III)

The third of my articles on the macro-level issues with (automated) provisioning, which build on the previous articles, specifically the comparison of Enterprise versus “web scale” deployments described in “The problem with automated provisioning (II of III)” and the levels of complexity, in terms of automated provisioning, set up and configuration that is required.

As I’ve said before in this series of articles provisioning a thousand machines which all have the same OS, stack and code base, with updated configuration information is easier to set up than a thousand machines which use a mixture of four or five Operating Systems, which all have differing patch schedules, patch methods and code release schedules, with a diverse infrastructure and application software stack and multiple code bases. And to express this I’ve postulated the equation “(Automated) Provisioning Complexity = No. of Instances x Freq. of Change”.

What I’d like to move the focus over to is that of runtime stability and the ability of a given system to support increasingly greater levels of complexity.

I find that it is important to recognise the place of observation and direct experience as well as theory and supposition (in research I find it’s useful to identify patterns and then try to understand them).

Another trend that I have witnessed in regards to system complexity, including the requirement to provision a given system, is that the simpler and more succinct a given architectural layer, the more robust that layer is and more able to support layers above it which have higher levels of complexity.

Often Architectural layers are constrained in terms of there ability to support (and absorb) high numbers of differing components and high rates of change by the preceding layer in the stack. AKA the simpler the lowest levels of the stack the more stable they will be and thus more able to support diverse ecosystems with reasonable rates of change in the layers above them

The more complex the layer below the less stable it is likely to be (given the number of components and instances thereof and the rate of update which significantly drive up the level of complexity of the system).

This phenomenon is found in the differing compute environments I’ve been speaking about in these short articles, and again they affect the ability of a given system to be provisioned in any succinct and efficient manner.

More accurate Enterprise

Typically Enterprise IT ecosystems are woefully complex, due to a mixture of longevity (sweating those assets and risk aversion) and large numbers of functional systems (functional as in functional requirements) and non-functional components (i.e. heterogeneous infrastructure, with lots of exceptions, one off instances, etc.).

Subsequently they suffer from the issue that I’ve identifioed above, that is as lower levels are already compolex, they are constrained in the amount of complexity that can be supported at the level above, the accompanying diagram demonstrates the point.

the-problem-with-provisioning-0.1-real-enterprise

More accurate Web Scale

Whilst Web Scale class systems often exhibit almost the opposite behaviour. Given they often use a radically simplified infrastructure architecture anyway (i.e. lots of similar and easily replaceable common and often ‘commodity’ components) in a ‘platform’ approach, there isn’t often the high levels of heterogeneity that you see in a typical Enterprise IT ecosystem (homogeneous). And this approach is often found in the application and logical layers above the infrastructure, i.e. high levels of commonality of software environment, used as an application platform to support a variety of functionality, services, code and code bases.

Subsequently, because of the simple nature of low level layers of the architecture they are much more robust and capable of withstanding change (because introducing change into a complex ecosystem often leads to something, somewhere breaking, even with exceptional planning). This stability and robustness ensures that the overall architecture is better equipped to cope with change and with the frequency of change, and that layers of high levels of complexity can be supported.

the-problem-with-provisioning-0.1-real-web

And so that concludes my articles on provisioning, and the problems with it, for the time being, although I might edit them a little, or at least revisit them, when I have more time.

The problem with automated provisioning (II of III)

The second of my articles on the macro-level issues with (automated) provisioning and focusing again on the theme of complexity as a result of “No. of Instances” x “Freq. of Change” described in the previous article “The problem with automated provisioning (I of III)“, but this time comparing an Enterprise Data Centre build-out versus a typical “Web Scale” Data Centre build-out.

Having built out both examples demonstrated I find the below a useful comparison when describing some of the issues around automated provisioning and occasionally why there are misconceptions about it from those who typically deliver systems from one of these ‘camps’ and not the other.

Enterprise

the-problem-with-provisioning-0.1-enterprise

Basically the number of systems within a typical Enterprise Data Centre (and within that Enterprise itself) is larger than that in a Web Scale (or HPC) Data Centre (or supported by that organisations), and the differing number of components that support those systems is higher too. For instance at the last Data Centre build out I led there were around eight different Operating Systems being implemented alone. This base level of complexity, which is then exasperated because of the Frequency of having to patch and update this (as demonstrated by “Automated Provisioning Complexity = No. of Instances x Freq. of Change” equation) significantly impacts any adoption of automated provisioning (it makes defining operational procedures more complex too).

Web Scale

the-problem-with-provisioning-0.1-web

Frankly a Web Scale build out is much more likely to use a greater level of standardisation to be able to drive the level of scale and scaling required to service the user requests and to maintain the system as a whole (here’s a quote from Jeff Dean, Google Fellow, “If you’re running 10,000 machines, something is going to die every day.”). This is not to say that there is not a high level of complexity inherent in these types of system, it’s just that in order to cope with the engineering effort required to ensure that the system can scale to service many hundreds of millions of requests it may well require a level of component standardisation well beyond the typical you’d see in an Enterprise type deployment (where functionality and maintenance of business process is paramount). Any complexity is more likely to be in the architecture to cope with said scaling, for instance distributed computational proximity algorithms (i.e. which server is nearest to me physically so as to reduce latency versus which servers are under most load so as to process the request as optimally as possible), or in the distributed configuration needed to maintain said system as components become available and are also de-commissioned (for whatever reason).

Automated Provisioning Complexity = No. of Instances x Freq. of Change

At the most base level provisioning a thousand machines which all have the same OS, stack and code base, with updated configuration is easier to set up than a thousand machines which use a mixture of four or five Operating Systems, which all have differing patch schedules and patch methods, with a diverse infrastructure and application software stack and multiple code bases. I suspect that upon reading this article you may think that it was an overtly obvious statement to make, but it is the fundamentals that I keep seeing people trip up on over and over again which infuriates me no end, and so, yes, expect another upcoming article on the “top” architectural issues that I encounter too.

HPC, or High Performance Computing, the third major branch of computing, build-outs usually follow the model above for that of “web scale” ones. I have an upcoming article comparing the three major branches of computing usage, Enterprise, Web Scale, and HP, in much greater detail, however for the time being the comparison above is adequate to demonstrate the point I am drawing to your attention; that of complexity of environment exasperating implementation of an automated provisioning system. Hope you enjoyed this article, it is soon to be followed by a reappraisal and revised look at Enterprise and Web Scale provisioning.

The problem with automated provisioning (I of III)

Referring back to my previous article “The problem with automated provisioning – an introduction” once you get over these too human of issues into the ‘technical’ problem of provisioning then I’d have been much nearer the mark in my initial assessment, because it is indeed an issue of complexity. The risks, costs, and likely success, of setting up and maintaining an automated provisioning capability is integrally linked to that of the complexity of the environment to be provisioned.

There are a number of contributing factors, including, number of devices, virtual instances, etc., location and distribution from the command and control point, but the two main ones in my mind are “Number of Instances” and “Frequency of Change”.

And so ‘Complexity’, in terms of automated provisioning, at a macro level, can be calculated as being “Number of Instances” versus “Frequency of Change”.

No. of Instances x Freq. of Change

By “Number of Instances” I mean number of differing operating systems in use, number of differing infrastctrue applications, number of differing application runtime environments and application frameworks, number of differing code bases, number of content versions being hosted, etc.

By “Frequency of Change” I am drawing attention to patches, code fixes, version iterations, code releases, etc., and how often they are delivered.

The following diagram demonstrates what I frequently call ‘The Problem with Provisioning’; as you can see I’ve delineated against three major architectural “levels”, from the lowest and nearest to the hardware, the OS layer which also contains ‘infrastructure software’, the Application layer, containing the application platform and runtime environment, and the “CCC” layer containing Code, Configuration and Content.

the-problem-with-provisioning-0.1-overview

In a major data-centre build-out it is not atypical to see three, four or even more, different operating systems being deployed, each of which is likely to require three or six monthly patches, as well as interim high value patches (bug fixes that effect the functionality of the system and security patches). Furthermore it’s likely the number of ISV applications, COTS products, and application runtime environments will be much higher than the number of OS instances, and that the amount of “CCC” instances will be even higher.

I find it important to separate the system being provisioned into these three groupings because, typically they require differing approaches (and technologies) for the provisioning thereof, something I mentioned in the previous article when organisations mistakenly believe that the provisioning technology that they have procured will scale the entire stack, from just above ‘bare metal’ to “CCC” changes (I’ve seen this issue more than once, even by a Sun team who should of known better, albeit it was around three years ago).

This model brings to the fore the increasing level of complexity, both of components at each layer, and the frequency of changes that then occur, and although the model above is a trifle simplistic, it is useful when describing the issues that one can encounter with implementing automated provisioning systems, especially to those with little knowledge or awareness of the topic.

The problem with automated provisioning – an introduction

I was going to start this short series of articles with the statement that the problem with provisioning is one of complexity, and I’d have been wrong, the predominant issues with provisioning, and specifically automated provisioning, are awareness and expectation.

Awareness and Expectations

The level of awareness of what can actually be done, and often, more importantly, what cannot be done, with automated provisioning, or even what automated provisioning actually “is” is a significant barrier, followed by the expectations set, both by end users with a hope for IT “silver bullets”, who may well have been oversold, and Systems Integrators, product vendors and ISVs who sadly promise a little too much to be true or are a trifle unaware of the full extent of their own abilities (positivity and confidence aside).

For instance I was once asked to take over and ‘rescue’ the build out of a data centre on behalf of a customer and their outsourcerer (£30M+ to build out, estimated £180M total to build and run for the first five years).

Personally I would say that this data-centre build out was of medium complexity, being made up of more than five hundred Wintel servers, circa three hundred UNIX devices, and around two hundred ancillary pieces of hardware including network components, firewalls, switches, bridges, intelligent KVMs and their ilk, storage components, such as SAN fabric, high end disk systems, such as Hitachi 9900 range, high end tape storage, etc., and other components.

One of the biggest problems in this instance was that the contract between client and vendor stipulated using automated provisioning technologies, not a problem in itself, however an assumption had been made, by both parties, that the entire build out would be done via the provisioning system, without a great deal of thought following this through to it’s logical conclusion.

Best to say here that they weren’t using Sun’s provisioning technology, but the then ‘market leader’, however the issues were not to do with the technology, nor functionality and capabilities of the provisioning product. It would have been as likely the similar problems would have been encountered even if it had.

This particular vendor had never implemented automated provisioning technologies before on a brand new “green-field” site, they had always implemented them in existing “brown-field” sites, where, of course, their was an existing and working implementation to encapsulate in the provisioning technology.

As some of the systems were being re-hosted from other data-centres (in part savings were to be made as part of a wider data-centre consolidation), another assumption had been made that this was not a fresh “green-field” implementation, but a legacy “brown-field” one, however this was a completely new data-centre, moving to an upgrade of hardware and infrastructure, never mind later revisions of application runtime environments, new code releases, and in-part enhanced, along with, wholly-new functionality too. AKA this was not what we typically call a “lift and shift”, where a system is ‘simply’ relocated from one location to another (and even then ‘simply’ is contextual). Another major misconception and example of incorrectly set expectation was that the provisioning technology in question would scale the entire stack, from just above ‘bare metal’ to ‘Code, Configuration and Content’ (CCC) changes, something that was, and still is extremely unlikely.

Sadly because of these misconceptions and lack of fore-thought predominantly on behalf of the outsourcerer no one had allowed for the effort to either build-out the data-centre in entirety and then encapsulate it within the provisioning technology (a model they had experience of, and which was finally adopted), nor allow for the time to build the entire data-centre as ‘system images’ within the provisioning technologies and then use it to implement the entire data-centre (which would have taken a great deal longer, not only because testing a system held as system images would have been impossible, as they would have to be loaded into the hardware to do any testing, either testing of the provisioning system, or real world UAT, system, non-functional, and performance testing).

Unsurprisingly one of the first things I had to do when I arrived was raise awareness that this was an issue, as it had not fully been identified, before getting agreement from all parties on a way forward. Effort, cost, resources, and people, were all required to develop the provisioning and automated provisioning system in a workable solution. As you can guess there had been no budget put aside for all of this, so the outsourcerer ended up absorbing the costs directly, leading to increased resentment of the contract that they had entered into and straining the relationship with the client, however this had been their own fault because of lack of experience and naivete when it came to building-out new data-centres (this had been their first so they did a lot of on the job learning and gained a tremendous amount of experience, even much of this was how not to build out a data centre).

This is why I stand by the statement that the major issues facing those adopting automated provisioning is one of awareness of the technology and what it can do and one of expectations of the level of transformation and business enablement it will facilitate, as well as how easy it is to do. The other articles in this series will focus a little more on the technical aspects of the “problem with provisioning”.