Comments by "" (@grokitall) on "Delta Air Lines CEO on CrowdStrike outage: Cost us half a billion dollars in five days" video.

@mikkpal the after action independent report is in, and confirmed what most developers expected must have happened for this to escape into the wild. the crowdstrike software consists of 3 parts, the core kernel module, the template file which is loaded into kernel space, and a signature file which uses the parameters collected by the template file to identify threats. as per company policy mentioned by the ceo, the channel file updates get run against a verifier program, then shipped directly to every customer, which is not what the companies using it were lead to believe. even worse, the template file does not get tested at all, as it is replaced by a mock. so when the signature file was updated to check specific values in the last parameter, it triggered a bug where the template file did not collect and data for that field. by mocking it, this was not detected prior to release. this broken template file was bundled up with the signature file to produce the channel file which was then shipped to everyone. by skipping integration testing with the main kernel module, it was never run, so not only did it not detect this bug, it did not spot the failure of the kernel module to recover from the bad update either. by failing to canary release, they missed the chance to spot these bugs before they were shipped to customers.
2
@lordgarion514 but lots of people who were hit used release n-1 on their production systems and n-2 on their backups. unfortunately the crowdstrike rapid release process gives the impression that this applies to that as well, when it does not, so what was expected by their customers was that the release process would be engineered to quickly stop a problem and recover, and that this would be spotted on the customers test machines. this then would not be applied to the n-1 systems in production, and if it took long enough to hit them, it would not hit the n-2 backus machines. what actually happened is that this flag only applied to the core kernel update, not the two files in the channel update, so when it was shipped untested, it kept shipping for 90 minutes, taking down the test machines and production machines at the same time, and then when it shifted to the backups it killed them too. the only thing that crowdstrike got right about this was saying yes it was us.
2
 @christianamaral-c3l delta took longer because they had a wider exposure than everyone else. because this required hands on fixes by skilled tech guys to fix, it took a while.
1
@benjamink8448 on the contrary, microsoft has had multiple instances of boot loop problems over the last decade, and have not done enough to prevent it. yes crowdstrike took 8.5 million mostly mission critical systems down, but it was microsoft not fixing the boot loop problem for a decade which kept them down.
1
most companies cannot cope with cowboy developers doing things which actively undermine your resiliency planning. lots of people got caught after 9 11 as their backup was in the other tower. same happened with the big european data center fire, when the provider had the backup literally next door, so it caught fire as well.
1
but you did not have to fix 40,000 mostly locked down servers, then spend extra time recovering from the logistics problems of pilots and aircraft not being where they were planned to be.
1
@christianamaral-c3l the guy from delta pointed out that they had some of the highest exposure, with 40,000 machines affected. if you ignore traveling time, and assume there are enough techs on site, and a repair time of ten minutes per machine, working 24 hours per day, you are still talking about 277 man days to get them all back up, even before you add in the time to fix machines which were not designed to cope with being crashed. when you add in the restore time for broken machines, travel time, and not being able to get techs in the right place to fix machines, those numbers can go up very fast.
1
microsoft made a proposal which gave them an illegal and unfair advantage over their competitors in the security space. the eu said no, do it in a fair way. microsoft then chose to do nothing for a couple of decades, and also chose to not fix the boot loop problem for a decade as well. they do not come out of this with clean hands.
1
as these were locked down enterprise machines, the solution should have been a simple power cycling of the machine, but due to the nature of the crowdstrike bug, and the microsoft kernel bug, it required you to have a trusted tech guy turn up, unlock the systems, go into safe mode, and then lock the machine down again. think of it like the counter staff at your bank not being able to install random code on the cash machine.
1
most insurance has exclusions for negligence, as do the liability exclusions in eulas. this means that when it gets to court, the get out of jail free card won't work, and the insurance won't pay.
1
on your point about live updates, every customer understood from crowdstrike that the n-1 and n-2 settings in the software applied to the live patching as well, and that the live patches received comprehensive testing prior to shipping, and that the kernel driver caught bugs and automatically rolled back the live update. none of this information was true, so when the testing machines at companies got hit by the buggy patch, they crashed, but so did the live production machines runn8ng n-1 driver versions. these then fell over to the n-2 machines, which also applied the live update and fell over, which then hit a microsoft kernel boot loop bug which has not been fixed in the decade it has been known, which caused every machine to need manual intervention to restore. when people with your supplier give you the wrong information, and then actively subvert your resiliency planning, most companies cannot cope.
1
every airline is booking right up to capacity. after lockdown, lots of airlines had to mothball parts of their fleets, and like turning a tanker, it takes a while to get back up to your previous capacity, and across the industry, most airlines are well below pre lockdown capacity, while demand is up.
1
@katieadams5860 but the boot loop problem is specifically a windows bug, and some linux systems are build using capabilities which just roll back to an earlier install on power cycling, which can be done by unskilled people. it was the boot loop problem which kept machines down for so long.
1
when it triggers a well known pre existing bug in the kernel which has not been fixed after a decade, yes microsoft has some share in the blame. it was crowdstrike which crashed the machines, but it was microsofts failure to fix the bug which kept them down and required on site repair and remediation from tech support.
1
 @DaniEles-rc7ij sorry, but apple is based on the darwin bsd kernel, which while sharing lots of apis with linux due to the common unix history is not linux.
1
@DaniEles-rc7ij no, osx is based on the permissively licensed mach kernel from the darwin project, a bsd design, with the openstep gui built on top. linux on the other hand is a unix workalike built independently of the original at&t source code, with lots of different guis built on top. however they both use the core unix design philosophy, and being designed to be multiuser from the start have a much better security design than a standard windows install.
1