Everything you need to know about Punycode
Language revolution in the address bar
From its inception, the internet was designed to be a global network; however, there was one notable limitation: domain names could only use Latin characters. This alphabet barrier was not intentional; it was a consequence of technical design decisions made at the time when the global standards for digital representations of the characters were still in development. The basic building block of the entire system, the Domain Name System (DNS), was conceived in 1983 and worked exclusively with a character subset of ASCII, allowing only lowercase Latin characters (a–z), numbers (0–9), and the comma1.
That technical choice, which initially provided functionality and universality in the early phase of development, became a significant obstacle over time. Though the content of web pages and emails could be in any language, the domain name (part of the URL) still had to be written in the Latin alphabet. Such an „alphabet barrier" contributed to the so-called „digital divide" in particular in countries, where languages which does not use Latin, such as English, arent widely used. For such a user, it was often easier to memorize a chain of numbers (IP address) rather than a series of unknown glyphs2. Ironically, what was meant to be easy to remember for some became a cultural barrier for the rest of the world. The essential international standard for non-Latin languages, Unicode, was initiated four years later, in 1991, which illustrates how design choices from an early phase of development can have unintended, global impacts3. The Punycode development begins with an effort to bridge such a divide.
A smart solution
Efforts about „internationalization of domain names" started to appear since the mid-1990s, though after years of debates and a lot of competing proposals, a standard solution45.In March 2003, the IETF (Internet Engineering Task Force) approved RFC 3492, a standard that described the Punycode algorithm 6. Its author, Adam Costello, designed it as a neat and efficient solution, which was able to losslessly and reversibly transform any Unicode string to the plain ASCII subset.
Punycode is not sa brand new alghoritm. Rather, it is a specific implementation of a more generic algorithm named Bootstring, which enables the representation of any string from a larger character set (Unicode in our case) using a smaller set of characters (a subset of ASCII in our case)7. Such a concept was designed to be universal and functional across most scripts, while striving to be self-optimized and adapt to a character set in a particular string[^punycode-optimization].
##Origin of the name: How is it "puny"?
The name is a catchy word play, which rhymes with Unicode and refers to three meanings of how the alghotitm is „puny“:
-
Small subset of characters: only lowercase letters, numbers and comma is allowed1.
-
Short encoded version: encoded strings arent much longer than original. This is not only elegant, but also important, as DNS limits the length of the domain tag to 63 chars8.
-
implementation is small
The power of Punycode lies in its „puniness“ — its simplicity. It managed to achieve maximum significance (universal for all characters, therefore applicable to all languages) with minimal requirements.
The alghoritm: Whats behind the xn--?
The Punycode is denoted by a special prefix: „xn--", so everything encoded, such as a domain name, starts with it. The prefix is defined within the ACE (ASCII-Compatible Encoding), in standards IDNA 2003 and IDNA 200845.
The encoding process has multiple phases:
-
ASCII char separation: All ASCII characters (those that don't need to be encoded) from the input string are copied to the start of the output string.
-
Add the hyphen (minus) separator
-
:
If there were any ASCII characters, the separator "-" is added after them (e.g., for "čáslav", the ASCII characters will be followed by a trailing hyphen like "slav-")9.
We have to realize that the hyphen itself is an ASCII character. Thus, the hyphens can be part of the input string, and if they are present, they will be appended to the output as other characters. That does not create any ambiguity, as the dash added last is the one that was added, as it denotes the end of the ASCII characters.
-
Encoding non‑ASCII chars: The characters beyond ASCII are encoded using the Bootstring algorithm with parameters for Punycode, resulting in a sequence of a‑z and 0‑99.
-
Adding the ACE prefix
xn--
: In domain names, the Punycode‑encoded label is prefixed withxn--
to denote ACE (ASCII‑Compatible Encoding)10.
So, for example, if we want to encode the string "čáslav" (a Czech town):
- ASCII char separation:
From the "čáslav", the ASCII chars "slav" are placed at the start of the output.
- Add the hyphen (minus) separator
-
:
As the input contains both ASCII and non-ASCII chars, a hyphen is then added, so the output is slav-
- Encoding non‑ASCII chars:
The č
and á
chars are encoded using the Bootstring algorithm into 4na7x
and appended to the end of the output, so the resulting Punycode output is slav-4na7x
- Adding the ACE prefix
xn--
:
To denote the punycode encoded text in the domain name, we need to prepend the xn--
prefix. It is called ACE (ASCII Compatible Encoding).
So the string we can use in our DNS (Domain Name System setup) is xn--slav-4na7x
Examples
The table below shows how different types of input are transformed generated using tr46 UTS #46 processing, by a Punycode lib tr46
Input | Nameprepped, Punycode encoded and ACE prefixed output | Description |
---|---|---|
hello | hello | Simple ASCII word |
world-test | world-test | ASCII word with hyphen |
café | xn--caf-dma | French word with é |
naïve | xn--nave-6pa | French word with ï |
résumé | xn--rsum-bpad | French word with é |
Zürich | xn--zrich-kva | German city with ü (uppercase char was lowercased by Nameprep) |
münchen | xn--mnchen-3ya | German city with ü |
español | xn--espaol-zwa | Spanish word with ñ |
português | xn--portugus-q1a | Portuguese word with ê |
français | xn--franais-xxa | French word with ç |
تست | xn--pgba0a | Arabic |
δοκιμή | xn--jxalpdlp | Greek |
פרובה | xn--5dbgb3dua | Hebrew |
गुजराती | xn--31bky1czdnc | Gujarati word |
ไทย | xn--o3cw4h | Thai word |
中文 | xn--fiq228c | Chinese word |
日本語 | xn--wgv71a119e | Japanese word |
한국어 | xn--3e0bk47br7k | Korean word |
🌟.ws | xn--ch8h.ws | Emoji domain |
🎉.to | xn--dk8h.to | Emoji domain |
💻.fm | xn--3s8h.fm | Emoji domain |
école.fr | xn--cole-9oa.fr | French school domain with è |
bücher.de | xn--bcher-kva.de | German books domain with ü |
niño.ws | xn--nio-8ma.ws | Spanish word domain with ñ |
tølløse.dk | xn--tllse-vuac.dk | Danish place domain with ø |
العربية.ws | xn--mgbcd4a2b0d2b.ws | Arabic |
test中文 | xn--test-3f5fy05j | Mixed ASCII and Chinese |
café123 | xn--caf123-dva | Mixed letters and numbers with é |
á | xn--1ca | Single letter with acute accent |
ñ | xn--ida | Single letter with tilde |
ø | xn--pda | Single Scandinavian letter |
测试 | xn--0zwm56d | Chinese |
münchen.bayern | xn--mnchen-3ya.bayern | German domain with regional TLD |
Footnotes
-
RFC 1123: Requirements for Internet Hosts – Application and Support. Section 2.1. Retrieved August 30, 2024 ↩ ↩2
-
INFITT: A New Architecture for Multilingual Internet Domains. Retrieved August 30, 2024 ↩
-
Unicode Consortium: The Unicode Standard. Retrieved August 30, 2024 ↩
-
RFC 3490: Internationalizing Domain Names in Applications (IDNA). Retrieved August 30, 2024 ↩ ↩2
-
RFC 5890: Internationalized Domain Names for Applications (IDNA): Definitions and Document Framework. Retrieved August 30, 2024 ↩ ↩2
-
IETF: Punycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications (IDNA). Retrieved August 30, 2024 ↩
-
IETF: Punycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications (IDNA). Retrieved August 30, 2024 ↩
-
RFC 1035: Domain Names – Implementation and Specification. Section 2.3.4. Retrieved August 30, 2024 ↩
-
RFC 3492: Punycode: A Bootstring encoding of Unicode for IDNA. Retrieved August 30, 2024 ↩ ↩2
-
IANA (February 14, 2003): "Completion of IANA Selection of IDNA Prefix". See Wikipedia note [6]: https://en.wikipedia.org/wiki/Punycode#cite_note-6. Retrieved May 2, 2025 ↩