Unicode Equivalence

Not all emoji are created equal

Max Claus Nunes
Better Programming
Published in
3 min readApr 17, 2020

--

Photo by Amador Loureiro on Unsplash

Unicode

Everyone in the world should be able to use their own language on phones and computers. — Unicode Consortium home page

The Unicode standard defines a different code for every character. It was built with an encoding foundation large enough to support the writing systems used by all the world’s languages.

Unicode Equivalence

Even using the same encoding standard, the Unicode, a text might be represented in different forms. For instance, do you think ã is equal to ?

'ã' === 'ã' → false
'ã'.length → 1
'ã'.length → 2

Turns out the same character, like the one from this example “LATIN SMALL LETTER A WITH TILDE”, can be normalized in different forms which can end up with different Unicode representations. However, in most cases, we probably want to compare two strings independently of the Unicode representations. That’s what normalization is for.

Normalization

Normalization makes it possible to determine whether two characters are equivalent by combining all marks in a specific order and using rules for decomposition and composition to transform each string into one of the Unicode Normalization Forms:

  • Normalization Form D (NFD): Canonical Decomposition
  • Normalization Form C (NFC): Canonical Decomposition,
    followed by Canonical Composition
  • Normalization Form KD (NFKD): Compatibility Decomposition
  • Normalization Form KC (NFKC): Compatibility Decomposition,
    followed by Canonical Composition

So, in our example, the first character is represented in NFC that combines the value into a single Unicode (U+00E3) while the second character is represented in NFD that decomposes the character into multiple Unicode values (U+0061 U+0303).

'ã'.normalize('NFC') === 'ã'.normalize('NFC') → true
'ã'.normalize('NFC').length → 1
'ã'.normalize('NFC').length → 1

To explain about NFKC and NFKD lets use a different example — ẛ̣ :

As we can see, in NFKC and NFKD it may remove formatting distinctions changing the output.

Which Normalization to Choose?

W3C recommends for WWW development NFC wherever possible:

Content authors SHOULD use Unicode Normalization Form C (NFC) wherever possible for content. Note that NFC is not always appropriate to the content or even available to content authors in some languages. —W3C Working Group

The other normalization forms are for specific scenarios that most of us won’t ever experience. NFKC and NFKD, for example, can be used to break a roman number down into multiple characters, such as turning into VIII. And NFD could be used in a fuzzy search, for example where it needs to ignore marks — e.g. searching for a should also include results with ã.

JavaScript

Text normalization in Javascript:

const value = "ã"console.log('Unicode codepoint:', 
value.charCodeAt(0).toString(16),
value.charCodeAt(1).toString(16))
// → Unicode codepoint: 61 303
console.log('Normalize:', value.normalize('NFC'))

Javascript normalize method documentation

Golang

Text normalization in Go:

package mainimport (
"fmt"
"golang.org/x/text/unicode/norm"
)
func main() {
value := "ã"
fmt.Printf("Unicode codepoint: %U\n", []rune(value))
//→ Unicode codepoint: [U+0061 U+0303]
fmt.Println("Normalize:", string(norm.NFC.Bytes([]byte(value))))
}

Golang norm package documentation

--

--

Responses (1)

What are your thoughts?