|
Mission-critical systems are not software like any other. Most software may fail. Some even incorporate failure or inaccuracy as a standard mode of operation, such as the overbooking systems of transport companies. For years, TGV passengers have practiced the fantasies of overbooking without it bothering them too much. It is usual for a certain online sales or music site to be unavailable for a few seconds or even minutes, it does not bother anyone. If a payroll application makes late transfers, who will be sorry about it, except the unfortunate employees? The situation is quite different for mission-critical systems. They cannot be unavailable or even provide a degraded service. Any failure will be detrimental to the company that provides the service, to its customers and to the operator's supplier. Any failure will cause damage in terms of image, business, market share or sometimes much more. For mission-critical systems you must always be thinking in worst case scenario and not best case scenario. This is why high availability is at the heart of the design and implementation of these systems. To be achieved, it requires not only solid skills from software providers but also a culture shared with operators that can be banking industry or others. Rather than dealing with countless individual cases, it may seem useful to ask what are the principles on which a high availability system should be based, the foundations of its principles and the reasons for their effectiveness. Working on this list will make it possible to simply discriminate between architectures that can provide high availability and those that need to be eliminated and to quickly reject baroque or mannerist architectural creations by neo-experts. I see seven of them and I think that with these simple principles, we can work seriously: A SPOF, a Single Point Of Failure, is a single piece of software or hardware architecture whose failure will lead to the failure of the entire system. Under Murphy's law, all SPOF will eventually fail! The mission-critical system vendor must therefore ruthlessly eliminate all possible SPOFs. There is no need to go further, and it is enough to eliminate single databases even managed by a large Cloud specialist, even opaquely distributed on multiple machines, even guaranteed by a cryptic service level agreement, this single database is a SPOF and vulgarly, a nest of annoyances. An object is said to be symmetrical if it can be superimposed on itself by the application of a transformation of space other than identity. Behind all symmetry is a property of invariance. In our field, it is a question of distributing service requests on subsystems that can be substituted for each other, the idea being that the service will be provided in the same way regardless of the subsystem that carried it out. If one disappears, it doesn't matter for the quality of service. Technically speaking, the N subsystems need to have the same information to make a decision, which means they need to inform each other about what they've done. Symmetry can be mirror (normal/dual, Active/active), trial (normal/dual/trial, Active/active/active), quadral, etc. Theoretically there is no limit, except common sense: an Active/Passive system is not a symmetrical system. There will always be a time when the passive becomes active and this transition will not be without "friction", with friction often being an untested procedure or an individual, which will endanger the quality of service. And here again, let us remember this good Murphy and his law. Of course, symmetry, the substitutability of any subsystem by another, is also much simpler than non-symmetry. This choice of simplicity is based on another general principle that is in line with the same interest: KISS, as in Keep It Simple and Stupid. "Active/Active" contrary to what some people may believe, is much simpler than Active/Passive. The invariance of operation over time is the fact that the system will work in the same way, in the same mode, always. It will work like a perpetual calendar watch. There will not be 30 days of operation in one mode A and two hours of operation in another mode B and back. For the same reason as before, going from mode A to mode B and back will not be without friction, which will inevitably lead to quality-of-service problems. The invariance of the functioning in time is obviously a type of symmetry (translational symmetry in time). Performance predictability is the ability to predict the performance of a system with a margin of error small enough not to lead to adverse consequences. It is essential for supervising systems, for detecting damage, for anticipating the actions to be taken before failures occur. Here again, it is a problem of invariance and therefore of symmetry. Using multi-AZ architecture (where services are spread over several different geographical areas) means that the path of a transaction will not necessarily be the same from one time to the next, fast at one time, slow at another, the speed of light being non-negotiable. For the same reasons, shared resources must be avoided, even for things as simple as a Local network connection. Mission critical systems can’t share resources, otherwise the response time is not predictable. Losing the predictability of performance means losing the basics of its SLA and an essential means of detecting potential failures. Operation must be as simple as possible. A nuclear submarine carries a miniaturized nuclear power plant that provides the electricity that runs its engines and allows it to produce its oxygen. It is a very complex technology, developed by remarkable engineers. However, it is operated by normal people, by well-thought-out computer systems and by procedures books. If you had to take high-level engineers on board to operate nuclear submarines, they would all have been at the bottom of the water for a long time. No engineer of great talent will agree to spend six months underwater in a confined space, at least we can consider that there will not be enough candidates. Mission-critical systems are the same thing, they must be designed and set up by people of a certain talent but operated by normal people, without the need to resort to the former, who will never be there, according to Murphy's law, when they are needed. Implementing this principle is far from straightforward, but it is an essential guide. The Build phase should produce an artifact that will make the Run phase simple. The fewest different software or technology providers. Behind the management of a mission-critical system, there is a management of responsibility. If the system is produced by a chain of different technologies, there will be not only the problem of coordination and the strength of the chain as defined by its weakest link, but also the problem of responsibility. This principle is less strong than the previous ones and is a pragmatic application aimed at eliminating unnecessary links and keeping the technical architecture and the structure of responsibilities as simple as possible, it is also an application of the KISS principle explained above. It is by virtue of this principle that we prefer the cross-notification of information rather than the synchronization of databases themselves (and no, it's not the same thing) which moreover makes it easy to make multipolar systems (Dual, trial, "quadral", etc...). Security and securitization is now also a topic for high availability as the number of hackers’ assaults is regularly increasing. Although the main objective of these assaults is not related to service availability, they are, however, presenting a threat to it. Therefore, all connections and access must be properly secured by certificates, double authentication and so on, bearing in mind that, if a security checking can protect a system against attack, it can also be a threat to its availability. Changing a certificate for instance must be properly planned and coordinated otherwise it will cause an outage. As can be seen, however independent, these seven principles dialogue with each other and reinforce each other. Analyzing architecture through these saves time and above all, avoids disappointments. Now, after "On-demand" fashion and its underlying religious conviction that it was no longer necessary for the developer to worry about the performance of his software, after the micro-service and the belief in the harmlessness of virtualization, which ignored the fact that all micro-services were always carried out somewhere and that the speed of light remained constant, it is the Cloud idol that combines the two with an irrational overconfidence in infrastructure and brings a bunch of nonsense rarely equaled. In these turbulent waters, having a few principles validated by a long and successful experience is not useless. Comments are closed.
|
lUSIS nEWSThe latest company and industry news from Lusis Payments. Archives
December 2025
Categories
All
|

RSS Feed