What a 2016 MacBook Pro teaches me about Statistical Language models

The apparent flaws of my old MacBook Pro is a prime visualization of statistical language models.

I have a 2016 MacBook Pro (13-inch). When I got it I came fresh out of university and was so glad I got hand on a new Laptop. But pretty soon reviews about its “features” came in and they where not so good. Mostly people complained about the “butterfly” keyboard and the touch bar. I personally have no problem with the touchbar (allthough I find it unnecessay), but the keyboard felt wrong from the very first day. Soon after the keys began to stall, but kept on working after hard hitting them. There are numerous stories out there and people said “these keyboards the worst products in Apple history”. A few weeks ago the first key broke. Then another one and recently the third one just fell off. Whenever I open the lid they are sucked to the screen and I have to set them inside their meant-to-be places.

While all this is very unnerving, it led me to thoughs about statistical language models. Imagine counting each character in, for example, a book. You would end up with a distribution of the number of times a letter of the alphabet would occur in that book. Let us average such a process over a corpus of many textual sources of different languages. You would end up with something that is illustrated below, a frequency distribution of the letters across some languages.

freq-distribution frequency distributions of the 26 most common Latin letters across some languages, take from wikipedia

Guess which key broke first? Of course, it was the letter “E”. According to wikipedia the letter “E” is the most used letter, not only in the English, but in all western european languages. It has got a relative frequency of almost 12.702% in English and sits right at the center to be used so often. Now guess which letter came off second? It was letter “S”, which totally makes sense. It has a relatively high distribution in the english language, but also I use it all the time with cmd+s to save my files during coding. The third letter that broke is the “R”. With a frequency of 5.987% it is still pretty high in the frequency diagram, but I also hit it more then other letters in its frequency range since it is needed to reload the browser and therefore normal distributions are skewed in that sense. I am kind of curious which letter will break next? According to the frequency table it should be the letter “T” and let me tell you it already feels woobly.

After 5 years Apple now releases new MacBooks. Fixing all they have done wrong in the previous models, the new Magic keyboard features better travel (1mm vs 0.7mm) and, most importantly, it uses the tried and trusted scissor switches that won’t fail should you get any grain on them. Maybe its time to get a new one.