Comments by "" (@grokitall) on "ThePrimeTime" channel.

  1. 1
  2. 1
  3. it is not business ethics which require the shift your company policy, but the resiliency lessons learned after 9/11 which dictate it. many businesses with what were thought to be good enough plans had them fail dramatically when faced with the loss of the data centers duplicated between the twin towers, the loss of the main telephone exchange covering a large part of the city, and being locked out of their buildings until the area was safe while their backup diesel generators had their air intake filters clog and thus the generator fail due to the dust. the recovery times for these businesses for those it did not kill were often on the order of weeks to get access to their equipment, and months to get back to the levels they were at previously, directly leading to the rise of chaos engineering to identify and test systems for single points of failure and graceful degradation and recovery, as seen with the simian army of tools at netflix. load balancing against multiple suppliers across multiple areas is just a mitigation strategy against single points of failure, and in this case the bad actors at cloudflare were clearly a single point of failure. with a good domain name registrar, you can not only add new nameservers, which i would have done as part of looking for new providers, but you can shorten the time that other people looking up your domain cache the name server entries to under an hour, which i would have also done as soon as potential new hosting was being explored and trialed. as long as your domain registrar is trustworthy, and you practice resiliency, the mitigation could have been really fast. changing the name server ordering could have been done as soon as they received the 24 hour ransom demand, giving time for the caches to move and making the move invisible for most people. not only did they not do that, or have any obvious resiliency policy, but they also built critical infrastructure around products from external suppliers without any plan for what to do if there was a problem. clearly cloudflare's behaviour was dodgy, but the casino shares some of the blame for being an online business with insufficient plans for how to stay online.
    1
  4. 1
  5. 1
  6. 1
  7. 1
  8. 1
  9. 1
  10. we now know what should have happened, and what actually happened, and they acted like amateurs. first, they generated the file, which went wrong. then they did the right thing, and ran a home built validator against it, but not as part of ci. then after passing the validation test they built the deliverable. then they shipped it out to 8.5 million mission critical systems with no further testing whatsoever which is a level of stupid which has to be seen to be believed. this then triggered some really poor code in the driver, crashing windows, and their setting it into boot critical mode caused the whole thing to go into the boot loop. this all could have been stopped before it even left the building. after validating the file, you should then continue on with the other testing just like if you had changed the file. this would have caught it. having done some tests, and created the deployment script, you could have installed it on test machines. this also would have caught it. finally, you start a canary release process, starting with putting it on the machines in your own company. this also would have caught it. if any of these steps had been done it would never have got out the door, and they would have learned a few things. 1, their driver was rubbish and boot looped if certain things went wrong. this could then have been fixed so it will never boot loop again. 2, their validator was broken. this could then have been fixed. 3, whatever created the file was broken. this could also have been fixed. instead they learned different lessons. 1, they are a bunch of unprofessional amateurs. 2, their release methodology stinks. 3, shipping without testing is really bad, and causes huge reputational damage. 4, that damage makes the share price drop of a cliff. 5, it harms a lot of your customers, some with very big legal departments and a will to sue. some lawsuits are already announced as pending. 6, lawsuits hurt profits. we just don't know how bad yet. 7, hurting profits makes the share price drop even further. not a good day to be cloudstrike. some of those lawsuits could also target microsoft for letting the boot loop disaster happen, as this has happened before, and they still have not fixed it.
    1
  11. 1
  12. 1
  13. 1
  14. 1
  15. 1
  16. 1
  17. the main problem here is that prime and his followers are responding to the wrong video. this video is aimed at people who already understand 10+ textbooks worth of stuff with lots of agreed upon terminology, and is aimed at explaining to them why the tdd haters don't get it, most of which comes down to the fact that the multiple fields involved build on top of each other, and the haters don't actually share the same definitions for many of the terms, or of the processes involved. in fact in a lot of cases, especially within this thread, the definitions the commentators use directly contradict the standard usage within the field. in the field of testing, testing is split into lots of different types, including unit testing, integration testing, acceptance testing, regression testing, exploratory testing, and lots of others, if you read any textbook on testing, a unit test is very small, blindingly fast, does not usually include io in any form, and does not usually include state across calls or long involved setup and teardown stages. typically a unit test will only address one line of code, and will be a single assert that when given a particular input, it will respond with the same output every time. everything else is usually an integration test. you will then have a set of unit tests that provide complete coverage for a function. this set of unit tests is then used as regression tests to determine if the latest change to the codebase has broken the function by virtue of asserting as a group that the change to the codebase has not changed the behaviour of the function. pretty much all of the available research says that the only way to scale this is to automate it. tdd uses this understanding by asserting that the regression test for the next line of code should be written before you write that line of code, and because the tests are very simple and very fast, you can run them against the file at every change and still work fast. because you keep them around, and they are fast, you can quickly determine if a change in behaviour in one place broke behaviour somewhere else, as soon as you make the change. this makes debugging trivial, as you know exactly what you just changed, and because you gave your tests meaningful names, you know exactly what that broke. continuous integration reruns the tests on every change, and runs both unit tests and integration tests to show that the code continues to do what it did before, nothing more. this is designed to run fast, and fail faster. when all the tests pass, the build is described as being green. when you add the new test, but not the code, you now have a failing test, and the entire build fails, showing that the system as a whole is not ready to release, nothing more. the build is then described as being red. this is where the red-green terminology comes from, and it is used to show that the green build is ready to check in to version control, which is an integral part of continuous integration. this combination of unit and integration tests is used to show that the system does what the programmer believes the code should do. if this is all you do, you still accumulate technical debt, so tdd adds the refactoring step to manage and reduce technical debt. refactoring is defined as changing the code in such a way that the functional requirements do not change, and this is tested by rerunning the regression tests to demonstrate that indeed the changes to the code have improved the structure without changing the functional behaviour of the code. this can be deleting dead code, merging duplicate code so you only need to maintain it in one place, or one of hundreds of different behaviour preserving changes in the code which improves it. during the refactoring step, no functional changes to the code are allowed. adding a test for a bug, or to make the code do something more happens at the stsrt of the next cycle. continuous delivery then builds on top of this by adding acceptance tests which confirm that the code does what the customer thinks it should be doing. continuous deployment builds on top of continuous delivery to make it so that the whole system can be deployed with a single push of a button, and this is what is used by netflix for software, hp for printer development, tesla and spacex for their assembly lines, and lots of other companies for lots of things. the people in this thread have conflated unit tests, integration tests and acceptance tests all under the heading of unit tests, which is not how the wider testing community uses the term. they have also advocated for the deletion of all regression tests based on unit tests. a lot of the talk about needing to know about the requirements in advance is based upon this idea that a unit test is a massive, slow, complex thing with large setup and teardown, but it is not how it is used in tdd. there you are only required to understand how to write the next line of code well enough that you can write a unit test for that line what will act as a regression test. this appears to be where a lot of the confusion seems to be coming from. in short, in tdd you have three steps: 1, understand the needs of the next line of code well enough that you can write a regression test for it, write the test, and confirm that it fails. 2, write enough of that line that it makes the test pass. 3, use functionally preserving refactorings to improve the organisation of the codebase. then go around the loop again. if during stages 2 and 3 you think of any other changes to make to the code, add them to a todo list, and then you can pick one to do on the next cycle. this expanding todo list is what causes the tests to drive the design. you do something extra for flakey tests, but that is ouside the scope off tdd, and is part of continuous integration. it should be pointed out that both android and chromeos both use the ideas of continuous integration with extremely high levels of unit testing. tdd fits naturally in this process, which is why so many companies using ci also use tdd, and why so many users of tdd do not want to go back to the old methods.
    1
  18. 1
  19. 1
  20. 1
  21. 1
  22. 1
  23. Every branch is essentially forking the entire codebase for the project, with all of the negative connotations implied by that statement. In distributed version control systems, this fork is moved from being implicit in centralized version control to being explicit. When two forks exist (for simplicity call them upstream and branch), there are only two ways to avoid having them become permanently incompatible. Either you slow everything down and make it so that nothing moves from the branch to upstream until it is perfect, which results in long lived branches with big patches, or you speed things up by merging every change as soon as it does something useful, which leads to continuous integration. When doing the fast approach, you need a way to show that you have not broken anything with your new small patch. The way this is done is with small fast unit test which act as regression tests against the new code, and you write them before you commit the code for the new patch and commit them at the same time, which is why people using continuous integration end up with a codebase which has extremely high levels of code coverage. What happens next is you run all the tests, and when they pass, it tells you it is safe to commit the change, this can then be rebased, and pushed upstream, which then runs all the new tests against any new changes, and you end up producing a testing candidate which could be deployed, and it becomes the new master. When you want to make the next change, as you have already rebased before pushing upstream, you can trivially rebased again before you start, and make new changes. This makes the cycle very fast, and ensures that everyone stays in sync, and works even at the scale of the Linux kernel, which has new changes upstreamed every 30 seconds. In contrast, the slow version works not by having small changes guarded by tests, but by having nothing moved to upstream until it is both complete and as perfect as can be detected. As it is not guarded by tests, it is not designed with testing in mind, which makes any testing slow and fragile, further discouraging testing, and is why followers of the slow method dislike testing. It also leads to merge hell, as features without tests get delivered with a big code dump all in one go, which may then cause problems for those on other branches which have incompatible changes. You then have to spend a lot of time finding which part of this large patch with no tests broke your branch. This is avoided with the fast approach as all of the changes are small. Even worse, all of the code in all of the long lived braches is invisible to anyone taking upstream and trying to do refactoring to reduce technical debt, adding another source of breaking your branch with the next rebase. Pull requests with peer review add yet another source of delay, as you cannot submit your change upstream until someone else approves your changes, which can take tens to hundreds of minutes depending on the size of your patch. The fast approach replaces manual peer review with comprehensive automated regression testing which is both faster, and more reliable. In return they get to spend a lot less time bug hunting. The unit tests and integration tests in continuous integration get you to a point where you have a release candidate which does all of the functions the programmer understood was wanted. This does not require all of the features to be enabled by default, only that the code is in the main codebase, and this is usually done by replacing the idea of the long lived feature branch with short lived (in the sense of between code merges) branches with code shipped but hidden behind feature flags, which also allows the people on other branches to reuse the code from your branch rather than having to duplicate it in their own branch. Continuous delivery goes one step further, and takes the release candidate output from continuous integration and does all of the non functional tests to demonstrate a lack of regressions for performance, memory usage, etc and then adds on top of this a set of acceptance tests that confirm that what the programmer understood matches what the user wanted. The output from this is a deployable set of code which has already been packaged and deployed to testing, and can thus be deployed to production. Continuous deployment goes one step further and automatically deploys it to your oldest load sharing server, and uses the ideas of chaos engineering and canary deployments to gradually increase the load taken by this server while reducing the load to the next oldest server until either it has moved all of the load from the oldest to the newest, or a new unspotted problem is observed, and the rollout is reversed. Basically though all of this starts with replacing the slow long lived feature branches with short lived branches which causes the continuous integration build to almost always have lots of regression tests always passing, which by definition cannot be done against code hidden away on a long lived feature branch which does not get committed until the entire feature is finished.
    1
  24. 1
  25. it clearly stated that the first email was saying there was a problem affecting the network, and when they turned up it was a meeting with a completely d8fferent department, sales, and that there was no problem. also no mention as to the enterprise offering being mandatory. at that point i would return to my company and start putting resiliency measures in place with the intent to min8mise exposure to cloudflare with the intent to migrate, but the option to stay if they were not complete dicks. the second contact was about was about potential issues with multiple national domains, with a clear response that it is due to differing national regulations requiring that. the only other issue mentioned was a potential tos violation which they refused to name, and an immedia5e attempt to force a contract with a 120k price tag with only 24 hours notice and a threat to kill your websites if you did not comply. at this point i would then have immediately triggered the move. on the legal view, they are obviously trying to force a contract, which others have said is illegal in the us where cloudflare has its hardware based. it is thus subject to those laws. by only giving 24 hours from the time that they were informed it was mandatory, they are clearly guilty of trying to force the contract, and thus likely to win. if they can win on that, then their threat to pull the plug on their business on short notice in pursuit of an illegal act also probably makes them guilty of tortuous interference, for which they would definitely get actual damages, which would cover loss of business earnings, probably get reputational damages, probably get to include all the costs for having to migrate to new providers, and legal costs. when i sued them, i would also go after not only cloudflare, but the entire board individually, seeking to make them jointly and severally liable, so that when they tried to delay payment, you could go after them personally. the lesson is clear, for resiliency, always have a second supplier in the wings which you can move to on short notice, and have that move be a simple yes or no decision that can be acted upon immediately. by virtue of this, don't get overly relient on external tools to allow the business to continue to be able to work to mitigate the disaster if it happens. also keep onsite backups of any business critical information. m9st importantly, make sure you test the backups. at least one major business i know of did everything right including testing the backup rec9very process, but kept the only copy of the recovery key file on the desktop of one machine in one office, with the only backup of this key being inside the encrypted backups. th8s killed the business.
    1
  26. 1
  27. 1
  28. 1
  29. 1
  30. 1
  31. 1
  32. 1
  33. 1
  34. 1
  35. 1
  36. 1
  37.  @noblebearaw  it used all the points in all the images to come up with a set of weighted values which together enabled a curve to be drawn with all the images in one set on one side of the curve, and all the images in the other set on the other side of the curve. that is the nature of statistical ai, it does not care about why it comes to the answer, only that the answer fits the training data. the problem with this approach is that you are creating a problem space with as many dimensions as you have free variables, and then trying to draw a curve in that phase space, but there are many curves that fit the historical data, and you only find out which is the right one when you provide additional data which varies from the training data. symbolic ai works in a completely different way. because it is a white box system, it can still use the same statistical techniques to determine the category which the image falls into, but this acts as the starting point. you then use this classification as a basis to start looking for why it is in that category, wrapping the statistical ai inside another process, which takes the images fed into it, and uses humans to spot where it got it wrong, and look for patterns of wrong answers which help identify features within that multi dimensional problem space which are likely to match one side of the line or the other. this builds up a knowledge graph analogous to the structure of the statistical ai, but as each feature is recognised, named, and added to the model, it adds new data points to the model, with the difference being that you can drill down from the result to query which features are important, and why. this also provides extra chances for extra feedback loops not found in statistical ai. if we look at compiled computer programs as an example, using c and makefiles to keep it simple, you would start of by feeding the statistical ai with the code and makefile, and feed it the result of the ci / cd pipeline, determining if the change just made was releasable or not. eventually, it might get good at predicting the answer, but you would not know why. the code contains additional data implicit within it which provides more useful answers. each step in the process gives usable additional data which can be queried later. was it a change in the makefile which stopped it building Correctly? did it build ok, but segfault when it was run? how good is the code coverage of the tests on the code which was changed? does some test fail, and is it well enough named that it tells you why it failed? and so on. also a lot of these failures will give you line numbers and positions within specific files as part of the error message. if you are using version control, you also know what the code was before and after the change, and if the error report is not good enough, you can feed the difference into a tool to improve the tests so that it can identify not only where the error is, but how to spot it next time. basically, you are using a human to encode information from the tools into an explicit knowledge graph which ends up detecting that the code got it wrong because the change in line 75 of query.c returns the wrong answer to a specific function when passed specific data because a branch which should have been taken to return the right answer was not taken because the test on that line had 1 less = sign than was needed ad position 12, making it an assignment statement rather than a test, making the test never pass. it could then also suggest replacing the = with == in the new code, thus fixing the problem. none of that information could be got from the statistical ai, as any features in the code used to find the problem are implicit in the internal model, but it contains none of the feedback loops needed to do more than identify that there is a problem. going back to the tank example, the symbolic ai would not only be able to identify that there was a camouflaged tank, but point out where it was hiding, using the fact that trees don't have straight edges, and then push the identified parts of the tank through a classification system to try and recognise the make and model of the tank, this providing you with the capabilities and limitations of the identified vehicle as well as the presence and location. often when it gets stuck, it resorts to the fallback option of presenting the data to the human and saying "what do you know in this case which i don't", adding that information explicitly into the know,edge graph, and trying again to see if it altered the result.
    1
  38. There is some confusion about branches. Every branch is essentially a fork of the entire codebase from upstream. In centralized version control, upstream is the main branch, and everyone working on different features has their own branch which eventually merges back into the main branch. In decentralized version control who is the main branch is a matter of convention, not a feature of the tool, but the process works the same. When you clone upstream, you still get a copy of the entire codebase, but you do not have to bother creating a name for your branch, so people work in the local copy of master. They then write their next small commit, add tests, run them, rebase, and assuming the tests pass push to an online copy of their local repository and generate a pull request. If the merge succeeds, when they next rebase the local copy will match upstream which will have all of their completed work in it. At this point, you have no unsynchronized code in your branch, and you can delete the named branch, or if distributed, the entire local copy, and you don't have to worry about it. If later you need to make new changes you can either respawn the branch from main / upstream, or clone from upstream and you are ready to go with every upstream change. If you leave the branch inactive for a while, you have to remember to do a rebase before you start your new work to get to the same position. It is having lots of unsynchronized code living for a long time in the branch which causes all of the problems, because by definition anything living in a branch is not integrated and so does not enjoy the benefits granted by being merged. This includes not having multiple branches making incompatible changes, and finding out that things broke because someone did a refactoring and your code was not covered, so you now get to fix that problem.
    1
  39. 1
  40. 1
  41. 1
  42. 1
  43. 1
  44. 1
  45. 1
  46. 1
  47. 1
  48. 1
  49. 1
  50. 1