String Comparison Bug

At my work, the microservice that my team looks over aggregates the data of SaaS members. So one of the things, that we send back is the members data. There is email, that is used to match the members with the already existing data - so it works as a key, and the rest of the data, that we collected. The service depended on the emails in the response not being duplicates.

Then one day, for some reason, the backend started to reject the data for one specific case. We did not get back any useful data, except that MySQL did not like something, and refused to input the data. I know the message was not descriptive, because the first thing that I tried is, if the spaces and different no-letter characters were the problem. That did not help.

I was sure, that we were not sending the duplicate data, since we did compare the emails changed into the lower case and dropped the duplicates. We already had the equivalent of matching of email.trim().toLowerCase(); in the code, since the cases were already a problem at an earlier time. We can discuss whenever an emails with upper and lower cases is the same, but this was not and it is still not the hill I am willing to fight on.

On the end, the reason was something in this line. What I realized after days of testing is, that while the JavaScript considered the letters with diacritics and ones without them different, the MySQL was treating them as same. So ã == a and ñ == n and é == e, and so on. So we would happily return both of them, and they would be rejected because they are duplicates.

It is now possible to use all the characters in the domain of the URL, so sure there are going to be examples like this, that are perfectly valid. But for me it is easier to change the code on my side, then convince people that we need to change the encoding on some columns in the database. So my solution on the end was still to remove these characters, before comparison on our side.

So the end result looked something like this:

email.trim().normalize('NFD').replace(/[\u0300-\u036f]/g, '').toLowerCase();

The above solution was copied from the StackOverflow answer [1]. The explanation of what is happening pasted below is also from the same source:

normalize()ing to NFD Unicode normal form decomposes combined graphemes into the combination of simple ones. The è of Crème ends up expressed as e + ̀. Using a regex character class to match the U+0300 → U+036F range, it is now trivial to globally get rid of the diacritics, which the Unicode standard conveniently groups as the Combining Diacritical Marks Unicode block.

I am grateful for the people working on the Unicode standard to make it easier for us. The ability to have block, that we can use have helped me both here - with Diacritical block, and with Japanese language processing - the kanji are separate from the kana. These are the guys that make our working life just a tiny bit easier.