deunicode-0.4.0 has been yanked.
deunicode
The deunicode library transliterates Unicode strings such as "Æneid" into pure
ASCII ones such as "AEneid."
It started as a Rust port of Text::Unidecode Perl module, and was extended to support emoji.
Examples
extern crate deunicode;
use deunicode;
assert_eq!;
assert_eq!;
assert_eq!;
assert_eq!;
assert_eq!;
assert_eq!;
Guarantees and Warnings
Here are some guarantees you have when calling deunicode():
- The
Stringreturned will be valid ASCII; the decimal representation of everycharin the string will be between 0 and 127, inclusive. - Every ASCII character (0x0000 - 0x007F) is mapped to itself.
- All Unicode characters will translate to a string containing newlines
(
"\n") or ASCII characters in the range 0x0020 - 0x007E. So for example, no Unicode character will translate to\u{01}. The exception is if the ASCII character itself is passed in, in which case it will be mapped to itself. (So'\u{01}'will be mapped to"\u{01}".)
There are, however, some things you should keep in mind:
- As stated, some transliterations do produce
\ncharacters. - Some Unicode characters transliterate to an empty string, either on purpose
or because
deunicodedoes not know about the character. - Some Unicode characters are unknown and transliterate to
"[?]". - Many Unicode characters transliterate to multi-character strings. For example, 北 is transliterated as "Bei ".
- Han characters are mapped to Mandarin, and will be mostly illegible to Japanese readers.
Unicode data
Text::Unidecodeby Sean M. Burke- Unicodey by Cal Henderson
For a detailed explanation on the rationale behind the original dataset, refer to this article written by Burke in 2001.