Comments by "" (@grokitall) on "Delta Air Lines CEO on CrowdStrike outage: Cost us half a billion dollars in five days" video.
-
@mikkpal the after action independent report is in, and confirmed what most developers expected must have happened for this to escape into the wild.
the crowdstrike software consists of 3 parts, the core kernel module, the template file which is loaded into kernel space, and a signature file which uses the parameters collected by the template file to identify threats.
as per company policy mentioned by the ceo, the channel file updates get run against a verifier program, then shipped directly to every customer, which is not what the companies using it were lead to believe.
even worse, the template file does not get tested at all, as it is replaced by a mock. so when the signature file was updated to check specific values in the last parameter, it triggered a bug where the template file did not collect and data for that field. by mocking it, this was not detected prior to release. this broken template file was bundled up with the signature file to produce the channel file which was then shipped to everyone.
by skipping integration testing with the main kernel module, it was never run, so not only did it not detect this bug, it did not spot the failure of the kernel module to recover from the bad update either. by failing to canary release, they missed the chance to spot these bugs before they were shipped to customers.
2
-
@lordgarion514 but lots of people who were hit used release n-1 on their production systems and n-2 on their backups. unfortunately the crowdstrike rapid release process gives the impression that this applies to that as well, when it does not, so what was expected by their customers was that the release process would be engineered to quickly stop a problem and recover, and that this would be spotted on the customers test machines. this then would not be applied to the n-1 systems in production, and if it took long enough to hit them, it would not hit the n-2 backus machines.
what actually happened is that this flag only applied to the core kernel update, not the two files in the channel update, so when it was shipped untested, it kept shipping for 90 minutes, taking down the test machines and production machines at the same time, and then when it shifted to the backups it killed them too.
the only thing that crowdstrike got right about this was saying yes it was us.
2
-
1
-
1
-
1
-
1
-
@christianamaral-c3l the guy from delta pointed out that they had some of the highest exposure, with 40,000 machines affected.
if you ignore traveling time, and assume there are enough techs on site, and a repair time of ten minutes per machine, working 24 hours per day, you are still talking about 277 man days to get them all back up, even before you add in the time to fix machines which were not designed to cope with being crashed.
when you add in the restore time for broken machines, travel time, and not being able to get techs in the right place to fix machines, those numbers can go up very fast.
1
-
1
-
1
-
1
-
on your point about live updates, every customer understood from crowdstrike that the n-1 and n-2 settings in the software applied to the live patching as well, and that the live patches received comprehensive testing prior to shipping, and that the kernel driver caught bugs and automatically rolled back the live update.
none of this information was true, so when the testing machines at companies got hit by the buggy patch, they crashed, but so did the live production machines runn8ng n-1 driver versions. these then fell over to the n-2 machines, which also applied the live update and fell over, which then hit a microsoft kernel boot loop bug which has not been fixed in the decade it has been known, which caused every machine to need manual intervention to restore.
when people with your supplier give you the wrong information, and then actively subvert your resiliency planning, most companies cannot cope.
1
-
1
-
1
-
1
-
1
-
1