Comments by "" (@grokitall) on "Software's HUGE Impact On The World | Crowdstrike Global IT Outage" video.

  1. 3
  2. 3
  3. 2
  4. 2
  5. 2
  6. 2
  7. 1
  8. 1
  9. 1
  10. 1
  11. 1
  12. 1
  13. 1
  14. 1
  15. 1
  16. ​ @lapis.lazuli. from what little info has leaked out, a number of things can be said about what went wrong. first, the file seems to have had a big block replaced with zeros. if it was in the driver, it would be found with testing on the first machine you tried it on, as lots of tests would just fail, which should block the deployment. if it was a config file, or a signature file, lots of very good programmers will write a checker so that a broken file will not even be allowed to be checked in to version control. in either case, basic good practice testing should have caught it, and then stopped it before it even went out of the door. as that did not happen, we can tell that their testing regime was not good. then they were not running this thing in house. if it was, the release would have been blocked almost immediately. then they did not do canary releasing, and specifically the software did not include smoke tests to ensure it even got to the point of allowing the system to boot. if it had, the system would have disabled itself when the machine rebooted the second time and had not set a simple flag to say yes it worked. it could then have also phoned home, flagging up the problem and blocking the deployment. according to some reports, they also made this particular update ignore customer upgrade policies. if so, they deserve everything thrown at them. some reports even go as far as to say that some manager specifically said to ship without bothering to do any tests. in either case, a mandatory automatic update policy for anything, let alone some kernel module is really stupid.
    1
  17. 1
  18. 1
  19. 1
  20. 1
  21. 1
  22. 1
  23. 1
  24. 1
  25.  @ansismaleckis1296  the problem with branching is that when you take more than a day between merges, it becomes very hard to keep the build green and pushes you towards merge hell. the problem with code review and pull requests is that when you issue the pull request and then have to wait for code review before the merge, it slows everything down. this in turn makes it more likely that the patches will get bigger, which take longer to review, making the process slower and harder, and thus more likely to miss your 1 day merge window. the whole problem comes from the question of what is the purpose of version control, and it is to do a continuous backup against every change. however this soon turned out to be of less use than expected because most backups ended up in a broken state, sometimes going months between releasable builds. this made most of the backups to be of very little value. the solution to this turned out to be smaller patches merged more often, but the pre merge manual review was found not to scale well, so a different solution was needed, which turned out to be automated regression tests against the public api, which guard against the next change breaking existing code. this is what continuous integration is, running all those tests to make sure nothing broke. the best time to write the tests was before you wrote the code, as then you have tested that the test can fail and pass. this tells us that the code does what the developer intended it to do. tdd adds refactoring into the cycle, which further checks the test to make sure it does not depend on the implementation. the problem with not merging often enough is that it breaks refactoring. either you cannot do it, or the merge for the huge patch needs to manually apply the refactoring to the unmerged code. continuous delivery takes the output from continuous integration, which is all the deployable items, and runs every other sort of test against it trying to prove it unfit for release.if it fails to find any issues, then it can then be deployed. the deployment can then be done using canary releasing, with chaos engineering being used to test the resilience of the system, performing a roll back if needed. it looks too good to be true, but is what is actually done by most of the top companies in the dora state of devops report.
    1
  26. 1
  27. 1