Bad quality master data increases the probability of missed hits; for instance, when matching customer names against names of politically exposed persons (PEP), relevant hits may be missed due to character encoding problems.
Name Coding
In banking systems, names are represented as sequences of binary digits (bits). The first name “Jim”, for instance, can be represented by three called ASCII characters:
Jim (ASCII)
"01001010" "01101001" "01101101".
Other Encodings
Character encodings such as ISO-8859-1 and UTF-8 have different character representations, since ASCII cannot represent ä, ö, ü and other characters.
For example, the name “Jürg Näf” has different encodings in ISO-8859-1 and UTF-8.
Jürg (ISO-8859-1)
"01001010" "11111100" "01110010" "01100111"
Jürg (UTF-8)
"01001010" "11000011 10111100" "01110010" "01100111"
Näf (ISO-8859-1)
"01001110" "11100100" "01100110"
Näf (UTF-8)
"01001110" "11000011 10100100" "01100110"
The character encoding becomes compliance-relevant when a banking system uses simultaneously different types for legacy reasons. In this case, one encoding of “Jürg Näf” may match perfectly. However, if by mistake the UTF-8 encoding is assumed to be ISO-8859-1, 44% of the characters will not match (4 out of 9).
Generation of verifications for hits that differ by 40% or more, would obtain a significant number of false positives.