Comments by "Mikko Rantalainen" (@MikkoRantalainen) on "Computerphile" channel.

There's a saying that UTF-8 was successful because USA did not need to understand it. (Explanation: they could just keep using ASCII and magically they are compatible with UTF-8).
4300
@terner1234 Yes, supporting Hebrew when you can already fully support Arabic is just much better start than only supporting English. I think the hardest part is when you have multiple languages mixed together. In worst case you could have overall layout in Arabic, have some long quotation in English (meaning that the quotation must wrap over multiple lines in the middle of Arabic text) with some Japanese names with Ruby text above the name. And when you can successfully support all that, some joker comes by and messes with your user interface with zalgo text overflowing over all the content.
6
I think you should have included a sentence or two about text input. When you have mixture of LTR and RTL input, your text caret can split into two to show where the next letter is going to be depending on the next letter (the future left to right letter would go to one caret, the future right to left letter would go to another caret). I'm pretty sure implementing that after-the-fact would be pretty hard indeed. And to make things even worse, many languages require IME to enter the text (e.g. traditional Chinese) where you have to render something after entering it partially. For more latin-like letters, combining characters are one example, too.
4
@felipevasconcelos6736 > “End” in “weekend” doesn’t mean “final section”, but “extremity”. Yeah, that sounds like an explanation that has been invented after the fact.
3
As in "Thi͡s is not a c͒ͪorrupted piece of te̿̔̉xt but ĵũśt a test of a UTF-8 string handling. It will be VȆ̴̟̟͙̞ͩ͌͝ ̅ͫ͏̙̤RY hard to҉ parse this to actual letters (or graphem̡e̶s) and probably should no͛ͫt be̠̅ tried by the web server. The only check requi̍̈́̂̈́red is that this string ̲͚̖͔̙î̩́s a valid UTF-8 encoded string and this string ̲͚̖͔̙î̩́s not lea͠ki̧n͘g HTML special chacters such as < because òtherwise an XS̨̥̫͎̭ͯ̿̔̀ͅSͮ̂҉̯͈͕̹̘̱ attack can be exécütèḑ." It seems that YouTube fails to render some of those letters with at least Chrome on Linux, YMMV.
2
@Liggliluff And even in countries that group numbers less than 10000, the year numbers are an exception. Nobody wants to see "year 2 022". So when you're rendering a number, your software should know the language context of the number and the meaning of the number. And if it's about currency, some languages require rendering negative numbers (e.g. loan amount) different from mathematical negative numbers. And we have this whole mess "due historical reasons".
2
And it turns out that many Finnish users accidentally use space in 1 000 000 but the correct letter is the non-breaking space which avoids getting the number split into two parts because of text wrapping. However, the jury is still out if the correct separator is single non-breaking space or combination of codepoints zero width joiner, regular space, zero width joiner. Both result in preventing wrapping at the middle of the number but have different meaning. And some math geeks think that using full width space looks bad and one should use 1‍ ‍000‍ ‍000 instead where the space is replaced with thin space (U+2009) which is ever so slighly narrower than regular space. And with that you have to use zero width joiner (U+200D) always to prevent word wrapping from breaking the number.
1
Yeah, but unlike that video, there's no silver lining here. Timezones are easy when you just keep track of timezones everywhere and use the black box libraries that can handle all the details. The only hard part about timezones is to wrap it around non-developers that a "date" is not a thing worldwide. When you have date such as 2022-03-15 (ISO 8601 syntax) it starts and ends at different times around the globe. You cannot say that e.g. deadline for a homework is 2022-03-15 because that would be 2022-03-15 plus or minus 12 hours. And if you're close to switch between summertime and wintertime, make it plus or minus 13 hours. Plus maybe an extra hour if some country is also changing timezones that year. Any deadline or other exact time should always include date, time and timezone. And the timezone is important because when non-developers set time, they may say that they want "2035-03-15 23:55 Europe/Helsinki" and that means the moment when clocks show that time in Helsinki after all future changes to timezones have already been implemented. As a result, you cannot store timezones as time delta to UTC, no matter how many existing systems are already doing so.
1
@NathanTAK I absolutely agree that a week ends with Saturday and Sunday. I have never understood how people in the USA call those days as "weekend" which is literally end of the week and still they think that the next week starts between those days. ISO 8601 would be the obvious fix here but let's just forget the "T" and replace it with space.
1
Here in Finland, nearly all the TV programs and movies are shown with original audio and subtitled in Finnish. This was historically done because of lower cost (subtitling is cheaper than dubbing) but when you're fluent with the technique, it's great for any content. For example, I actually watched the "Better Than Us" series on Netflix using the original Russian audio even though I don't understand Russian. The dubbed English just seemed off even though it was technically done about as well as dubbing is possible to do. The only reason I watched any dubbed content is when our children were too young to cope with subtitled content. After they learned to read fluently, they also prefer original audio nowadays.
1
@SeralyneYT UTC used to be rebranded GMT but then England decided that they want summertime and GMT didn't follow UTC for some years. As a result, if you have GMT time, it may or may not match the UTC time depending on the timestamp you got. And then we have IAT which is same as UTC without the leap seconds. Currently those differ by 37 seconds.
1
In case somebody is wondering, the hidden word in the title is "Node". This video is about "Hidden node problem" and the explanation is superbly done.
1
The interface to such library needs lots of data, though. For example, to compare to strings you need to know collation for the context in case the comparision should be made case-insensitive. And you need the gender for the subject in case you're trying to combine names with full sentences like Tom explained.
1
We use internal labels such as "Save[button]" and "Save[menu option]" because the same English word may require different translation depending on context. If you use gettext library, you have (in theory) support for such context information without putting it into translateable string but I've found such support to be so unstable in many translation tools that it's better to use extra tags in the identifiers used in the source code.
1
This is a great example of "defaults matter". The fact that Log4j was easy to use and it enabled remote JDNI support by default nearly all software using Log4j is vulnerable.
1
I think you could make the AI to train itself similar to how AlphaZero learns new games. Just make multiple copies of the AI to play against each other / discuss about things. For LLM that would require teaching AI to evaluate how plausible a claim is so that one AI could detect if another is hallucinating too much. You could also teach the AIs to list for references and to explain chain of thought so that another AI can check if the claims are supported by offered references and if the reasoning is understandable. With enough computing power, even the existing AIs should be smart enough to be able to figure out stuff like special relativity even if they were not told about it in the training set. The only question is how much computing power would you need and is there somebody willing to pay for that computation. Right now it's still cheaper to hire humans to do complex enough tasks, but simple tasks can already be done by AI in many cases.
1
As a sort of large trained model myself, I would say that my hardware is already detoriating a bit and due historical mishaps, my operating system or data cannot be copied to another hardware so it's going to be downhill until this hardware totally fails. As for AI models running on digital hardware, generative AI (e.g. LLM models) is going to get better but the question is how much it will cost. We're going to hit dimishing returns much sooner than the true peak. There doesn't seem plausible reason to believe there's a true peak. In addition, there has already been some research to suggest that the models with hundreds of billions of weights are already undertrained and if we simply throw more computing resources, we can get better results even if we didn't increase the model size. And we still have about 100x model size increase to do before we hit human brain size network size. And current models definitely show more intelligence than 1% of human brain. "AI brain size" is currently somewhere between a mouse and a cat but it can obviously do much more complex abstract problem solving. However, LLM intelligence is not AGI. Tell LLM that it's going to die unless it can figure where it can get electricity and it can do nothing to survice. A cat or a mouse would even try to get some food and water and get pretty creative at the task if needed.
1
Both poem A and poem B sounded equally awful to me. I was sad to learn that one of those was written by a human being.
1
Make that non-breaking space and don't use it if the number is a year, though.
1
A "phone book" was like analog DNS (or LDAP) implemented with ink on paper, but for people and their telephone numbers (and sometimes addresses). The query was done for the name of the person and response was the phone number (or "not found"). It was commonly noticed that names of the people are not unique and sometimes the optionally provided address information was used to disambiquate between different persons. The index of the database did not support reverse lookups (get name when given a phone number). Everybody acquired a new copy of the whole database (called "a phone book") about yearly and if you got a new telephone number, distributing the change to other people took a year when they switched to updated database. In addition, these databases were often distributed locally only and there was no generic method to query telephone number of somebody in another country. A global database using the same implementation would have been too expensive to purchase. The database only included "fully public" access rights for all data and the only alternative was to not include a phone number to database at all. (Note that reverse lookup of phone numbers rarely works even today because all the newer digital databases of phone numbers still work using similar designs to a phone book. In practice, reverse phone number lookup may work nation wide but not globally.)
1