Comments by "" (@ContinuousDelivery) on "Software Engineering Fu0026*K Up Behind The Passport E-gate Failure" video.

This is a design problem to be tackled though, not a reason to make a badly designed system because otherwise "it's too hard". I can think of lots of ways to secure the distributed cached version of the information, and which made sense would depend on the detail of the requirements from a security perspective - solving that problem is what design is for. Deciding that this is impossible is simply wrong, because there are lots of systems that do more complex versions of the same thing, and deciding that discarding resilience because "we can't be bother to think of a better design" is a dereliction of duty. My guess is that this is really a more systematic failure, and less about the requirements, this is more likely a problem where the need for resilience fell between the cracks of a poorly structured, poorly organised engineering process.
8
That assumes that a good design costs more, I am not so sure that that is true, particularly when the kinds of orgs that get government contracts are usually massively overstaffed with teams of hundreds and sometimes thousands of people for a project that could, and should, be built by 20 or 30 people. Sure the procurement approach for such projects is broken and a problem, but that goes a lot further than the "cheapest bidder" problem I think.
4
FREE 'How To Evolve Your Software Architecture' Guide: How to work in ways that keep stuff easy to change which gives you the freedom to make mistakes and experiment and how to work in small steps that allow you to determine their fit for your present understanding of the problem... continuously. All explained in this FREE compact guide. Download HERE ➡ https://www.subscribepage.com/evolve-your-architecture
3
I am definitely an incrementalist, and don't believe in creationism for complex systems.
3
The story about the WiFi being the case was from the Telegraph. I agree it sounds insane for a system like this.
2
That assumes that doing a better job would cost more. That's not been my experience. Poor organisation and poor developments practices result in poor systems, but not lower costs.
2
Sure, you could build it that way, but I can't imagine why you'd prefer that, if you could avoid it. Keeping the compute close to the point of need is a pretty useful strategy in distributed systems if you value resilience.
2
I am sure that that is the calculation that people make, but I don't buy it. I have worked on well designed distributed systems with good people and it didn't take longer, it was quicker, and it didn't cost 10x because dev salaries don't work like that. Actually for systems like these developed by government, they are so overstaffed that I am sure that the kind of teams I worked on would have been 10x cheaper, as well as better.
2
An easy problem to solve, in a way that is at least as secure as a central DB.
2
there is no eventual consistency problem when things are working, changes would happen so fast that it would be, effectively, the same as the supposedly synchronous version.
2
I don't think it requires much in the way of hindsight. Which really is my whole point. "Let's build a system where every eGate in the country checks in with a single, non-clustered DB to filter out bad people" - What could possibly go wrong? The huge mistake here is to assume that because we can never imagine every possible failure scenario, that we can't build resilient systems that can cope with most of them, even the unexpected ones. It is unlikely that ANYONE would have predicted that overloading a WiFi service whirl break access to the DB, but assuming that the DB is a "single point of failure" is obvious to the most superficial overview of this system, and so that case should have been considered in the design.
2
My approach isn't at all formal, we simply think about and explore, usually in small groups, all of the ways that we imagine bad things could happen. And we monitor what goes on in production to see if we are hit by any bad things that we didn't imagine and each time we find one of those, we find ways to make sure that the system won't fail in the same way again, and repeat.
2
I am sure that that was the thinking, but that is the design challenge here, how can you achieve all of those requirements while still achieving resilience. I can think of several different approaches that may work, and I am sure that we could find one that did once we had examined all of the constraints. But that is the job! Not simply making a crap system because some parts of it are difficult to achieve otherwise.Our job for a system like this, should be asking us to solve the hard problems, not just find the naively simple solutions!
1
Glad you enjoyed it
1
As I said in the video, It would be interesting to know hoe the emergency response actually worked. Either they had another route to the bad-person-list or they had copies of the data. If they had an alternate route to access the list, why not automate the switch-over to that, rather than rely on staff carrying laptops running through airports? If they had a copy somewhere then this has all of the same problems/trade-offs as the cacheing solution I recommend. Cacheing is a well understood problem and there are lots o patterns an approaches to make it work, even for distributed caches. Non of this is simple, but it is what is required to build distributed systems.
1
Yes, and all completely solvable problems with a truly distributed design.
1
...and that is the whole point of this video, it is nothing to do with perfect systems, it is about designing for resilience so that when things do go wrong, the system can cope without total failure.
1
Yup! Marvin the Paranoid Android.
1
My point is that dropping resilience because it is harder than fragility isn't a good answer, Yes event based systems suffer all of the same failure points, but a well designed event based system make it MUCH easier to design-in resilience, it is after all how Databases and virtually all serious financial systems work, oh as well as Telecoms. My point is that these problem are well understood, and have been for a very long time, because some people don't know these answers and think that choosing overly naive solutions seems like the real problem to me, somebody earlier in the comments called it "Dunning Kruger Architecture" and I think that is a good description. This design was not fit for purpose for a system like this. I disagree with your characterisation that the kind of distributed design I suggest in this video is only suitable for Gym membership. Quite the contrary, the idea of synchronising all these things is the less stable, more open to attack and more prone to failure solution.
1
I like the name 🤣🤣
1