Comments by "" (@grokitall) on "Software's HUGE Impact On The World | Crowdstrike Global IT Outage" video.

the fact is that mandatory automatic updates is in breach of any sensible security, stability or resilience policy. they should be able to say there is an update available, and you should be able to say that it does not get installed until it has been triggered by you.
3
third party mandatory automated updates that you cannot turn off have no place in any security policy for corporate use. these types of updates are basically saying that your system is so unimportant that you can let someone you don't know decide that the system can be shutdown now, and it does not matter if it needs a complete reinstall to fix it. that is why every half way competent engineer here is saying wtf. it has been known bad practice for systems which matter for decades.
3
it did not become a brick. both microsoft and crowdfire released kernel updates. one of them broke everything. microsoft had no means to automatically block them on next reboot. therefore microsoft is not fit for locked down corporate use. unfortunately nobody else is either.
2
you mean an argument for continuous delivery. ci and cd are designed to block deployments of broken systems.
2
sorry, but with broken kernel drivers for any operating system for locked down corporate use you either need automated roll back after a failed reboot, or you need automated bad driver detection and isolation. nobody implements the latter, and windows safe mode does not qualify as the former.
2
@AK-vx4dy infrastructure is any person, device or service without which your business stops trading and starts losing money. resilience planning requires you to ask what parts of your business does this cover, and what plans do you need to put in place to cope with it. chaos engineering then goes one step further, and deliberately blocks that infrastructure to see if your plans are good enough.
2
@BigWhoopZH no, e-gate happened because it was delivering software updates over the same capped internet connection used to do live testing, but it should not have been connecting to head office for every check anyway, it should have been using a local mirror of the data for resilience.
1
@CallousCoder this is not a os fanboy issue. every operating system could easily start by loading just enough modules to be able to read and write a file on pemenent storage. it could the load a signature list of modules loaded last time. when the module asks to be loaded, it can check if it is changed, and if so add it to a block next time list which it writes to storage immediately. if was already there, don't load it. when the kernel finishes booting, it can clear the block list, and save it to storage. at that point, the kernel can recover on next reboot simply by disabling all the updated drivers on the blocked list, and this incident never happens.
1
@CallousCoder sorry, responded to the wrong post
1
@gppsoftware the linux kernel gets a new patch every 30 seconds on average, so there are cases where multiple updates per day make sense. netflix do this quite well. good ci and cd act to stop you releasing if there is a problem, so a snafu this bad would not make it out the door. even if it got out, canary releasing to your own computers would quickly stop the roll out of the patch.
1
@vladimirpopov2373 that sounds like a similar buggy driver issue, but that is just normal windows problems, as the driver just broke, but did not crash the kernel.
1
@zed5129 someone made comments about how multiple releases per day do not make any sense under any circumstances. i was using the linux kernel development process as one example where that is obviously not the case. virus recognition signature files at the heart of the problem here is another. in response to this need we have over many years developed methods to avoid exactly the issue that happened here and mitigate the extra risks involved, but the evidence clearly indicates that despite these methods being common knowledge, either they do not use them, or in this instance they chose to subvert them.
1
any os kernel module can bsod a system. there are only two ways to stop it. 1 require every module to be submitted to os vendor for intensive testing prior to release. microsoft tried it and nobody wanted to pay them them thousands to sign every minor patch. 2, have the os catch bad drivers, and automatically block them next reboot. nobody does this. a poor approximation is windows safe mode, which does not work in a corporate environment. so every kernel driver update can hose your system with a need for a full reinstall.
1
yes it is. have a modular kernel, flag up which modules have changed on shutdown, when rebooted clear the list, and if the reboot did not complete, block those modules on next reboot.
1
it is almost a certainty. this hit hospital emergency wards and operating theaters. if nobody died due to lack of necessary data it will be a miracle.
1
@lapis.lazuli. from what little info has leaked out, a number of things can be said about what went wrong. first, the file seems to have had a big block replaced with zeros. if it was in the driver, it would be found with testing on the first machine you tried it on, as lots of tests would just fail, which should block the deployment. if it was a config file, or a signature file, lots of very good programmers will write a checker so that a broken file will not even be allowed to be checked in to version control. in either case, basic good practice testing should have caught it, and then stopped it before it even went out of the door. as that did not happen, we can tell that their testing regime was not good. then they were not running this thing in house. if it was, the release would have been blocked almost immediately. then they did not do canary releasing, and specifically the software did not include smoke tests to ensure it even got to the point of allowing the system to boot. if it had, the system would have disabled itself when the machine rebooted the second time and had not set a simple flag to say yes it worked. it could then have also phoned home, flagging up the problem and blocking the deployment. according to some reports, they also made this particular update ignore customer upgrade policies. if so, they deserve everything thrown at them. some reports even go as far as to say that some manager specifically said to ship without bothering to do any tests. in either case, a mandatory automatic update policy for anything, let alone some kernel module is really stupid.
1
at the very least, there should have been some sort of smoke test to see if it even got as far as completing booting, and disabled itself if that failed.
1
microsoft allowing a bad driver update to stop you from doing a full reboot, no, nothing to do with microsoft.
1
@SteveBurnap have fun with flags. set a flag when you download the update. if the flag is not update or booted, leave the driver and have it disable itself if it is updated change the flag to say it started if it has started do the minimum work to show it boots, then change the flag to say it booted at this point, either it worked, or it got out of the way next reboot. suddenly the problem did not occur in production, and everything else is solved by testing and canary releasing.
1
but as the systems could be rebooted, all you need is for the os vender to block drivers which were updated prior to the reboot, and the os boots and can start recovery procedures.
1
no, if you are corporate scale, you have on call tech support whenever your business is open.
1
we have known how to do man rated systems since the moon landings, and this is not how. the last time the nhs in the uk was brought down, it was embedded medical devices using windows xp years after end of life mixing with current machines.
1
@ChristianWagner888 i was not aware that the efforts in mac os and linux were further progressed than capabilities and potentialities. thanks for the information. you cannot stop bugs in kernel level code from taking down the kernel. the issue here is how the os responds to it after a reboot. boot looping until someone comes along and manually fixes it shows that in this case the answer is not well. having as little code in kernel space as possible is obviously the best answer, which is why the idea of microkernels was so popular, but in practice there are some things you need to do which must be done in kernel space. the key there is to do as little as possible, and expect things to break, planing to mitigate the side effects.
1
no, this is what happens when you push a kernel module to everyone at once without any testing. code review or push to master was not relevant to the failure mode. a simple test deployment and canary releasing would have caught this even if you were doing waterfall development.
1
@ansismaleckis1296 the problem with branching is that when you take more than a day between merges, it becomes very hard to keep the build green and pushes you towards merge hell. the problem with code review and pull requests is that when you issue the pull request and then have to wait for code review before the merge, it slows everything down. this in turn makes it more likely that the patches will get bigger, which take longer to review, making the process slower and harder, and thus more likely to miss your 1 day merge window. the whole problem comes from the question of what is the purpose of version control, and it is to do a continuous backup against every change. however this soon turned out to be of less use than expected because most backups ended up in a broken state, sometimes going months between releasable builds. this made most of the backups to be of very little value. the solution to this turned out to be smaller patches merged more often, but the pre merge manual review was found not to scale well, so a different solution was needed, which turned out to be automated regression tests against the public api, which guard against the next change breaking existing code. this is what continuous integration is, running all those tests to make sure nothing broke. the best time to write the tests was before you wrote the code, as then you have tested that the test can fail and pass. this tells us that the code does what the developer intended it to do. tdd adds refactoring into the cycle, which further checks the test to make sure it does not depend on the implementation. the problem with not merging often enough is that it breaks refactoring. either you cannot do it, or the merge for the huge patch needs to manually apply the refactoring to the unmerged code. continuous delivery takes the output from continuous integration, which is all the deployable items, and runs every other sort of test against it trying to prove it unfit for release.if it fails to find any issues, then it can then be deployed. the deployment can then be done using canary releasing, with chaos engineering being used to test the resilience of the system, performing a roll back if needed. it looks too good to be true, but is what is actually done by most of the top companies in the dora state of devops report.
1
@gruntaxeman3740 because you cannot run the flight arrival status system an a desktop, it has to be where you can see it, which makes it hard to reach. same with kiosk type systems, and any type of system where you do not want the operator to have full access to the machine. anything working like an autopilot, where it matters that it does not fail while in use, and many other use cases.
1
no this is a not doing ci or cd failure.
1