Unicode Equivalence

Not all emoji are created equal

Max Claus Nunes

Published in

Better Programming

3 min readApr 17, 2020

Unicode

Everyone in the world should be able to use their own language on phones and computers. — Unicode Consortium home page

The Unicode standard defines a different code for every character. It was built with an encoding foundation large enough to support the writing systems used by all the world’s languages.

Unicode Equivalence

Even using the same encoding standard, the Unicode, a text might be represented in different forms. For instance, do you think ã is equal to ã?

'ã' === 'ã' → false
'ã'.length → 1
'ã'.length → 2

Turns out the same character, like the one from this example “LATIN SMALL LETTER A WITH TILDE”, can be normalized in different forms which can end up with different Unicode representations. However, in most cases, we probably want to compare two strings independently of the Unicode representations. That’s what normalization is for.

Normalization

Normalization makes it possible to determine whether two characters are equivalent by combining all marks in a specific order and using rules for decomposition and composition to transform each string into one of the Unicode Normalization Forms:

Normalization Form D (NFD): Canonical Decomposition
Normalization Form C (NFC): Canonical Decomposition,
followed by Canonical Composition
Normalization Form KD (NFKD): Compatibility Decomposition
Normalization Form KC (NFKC): Compatibility Decomposition,
followed by Canonical Composition

So, in our example, the first character is represented in NFC that combines the value into a single Unicode (U+00E3) while the second character is represented in NFD that decomposes the character into multiple Unicode values (U+0061 U+0303).

'ã'.normalize('NFC') === 'ã'.normalize('NFC') → true
'ã'.normalize('NFC').length → 1
'ã'.normalize('NFC').length → 1

To explain about NFKC and NFKD lets use a different example — ẛ̣ :

As we can see, in NFKC and NFKD it may remove formatting distinctions changing the output.

Which Normalization to Choose?

W3C recommends for WWW development NFC wherever possible:

Content authors SHOULD use Unicode Normalization Form C (NFC) wherever possible for content. Note that NFC is not always appropriate to the content or even available to content authors in some languages. —W3C Working Group

The other normalization forms are for specific scenarios that most of us won’t ever experience. NFKC and NFKD, for example, can be used to break a roman number down into multiple characters, such as turning Ⅷ into VIII. And NFD could be used in a fuzzy search, for example where it needs to ignore marks — e.g. searching for a should also include results with ã.

JavaScript

Text normalization in Javascript:

const value = "ã"console.log('Unicode codepoint:', 
  value.charCodeAt(0).toString(16),
  value.charCodeAt(1).toString(16))
// → Unicode codepoint: 61 303console.log('Normalize:', value.normalize('NFC'))

Javascript normalize method documentation

Golang

Text normalization in Go:

package mainimport (
 "fmt" "golang.org/x/text/unicode/norm"
)func main() {
 value := "ã" fmt.Printf("Unicode codepoint: %U\n", []rune(value))
 //→ Unicode codepoint: [U+0061 U+0303] fmt.Println("Normalize:", string(norm.NFC.Bytes([]byte(value))))
}

Golang norm package documentation

Unicode Equivalence

Not all emoji are created equal

Unicode

Unicode Equivalence

Normalization

Which Normalization to Choose?

JavaScript

Golang

References

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in Better Programming

Written by Max Claus Nunes

Responses (1)