Punycode explained

Language revolution in the address bar

Screenshot of a spanish website elpaís.es which uses IDN

From its inception, the internet was designed to be a global network; however, there was one notable limitation: domain names could only use Latin characters. This alphabet barrier was not intentional; it was a consequence of technical design decisions made at the time when the global standards for digital representations of the characters were still in development. The basic building block of the entire system, the Domain Name System (DNS), was conceived in 1983 and worked exclusively with a character subset of ASCII, allowing only lowercase Latin characters (a–z), numbers (0–9), and the comma¹.

That technical choice, which initially provided functionality and universality in the early phase of development, became a significant obstacle over time. Though the content of web pages and emails could be in any language, the domain name (part of the URL) still had to be written in the Latin alphabet. Such an „alphabet barrier" contributed to the so-called „digital divide" in particular in countries, where languages which does not use Latin, such as English, arent widely used. For such a user, it was often easier to memorize a chain of numbers (IP address) rather than a series of unknown glyphs². Ironically, what was meant to be easy to remember for some became a cultural barrier for the rest of the world. The essential international standard for non-Latin languages, Unicode, was initiated four years later, in 1991, which illustrates how design choices from an early phase of development can have unintended, global impacts³. The Punycode development begins with an effort to bridge such a divide.

A smart solution

Efforts about „internationalization of domain names" (IDN) started to appear since the mid-1990s, though after years of debates and a lot of competing proposals, a standard solution⁴⁵.In March 2003, the IETF (Internet Engineering Task Force) approved RFC 3492, a standard that described the Punycode algorithm ⁶. Its author, Adam Costello, designed it as a neat and efficient solution, which was able to losslessly and reversibly transform any Unicode string to the plain ASCII subset.

Punycode is not sa brand new alghoritm. Rather, it is a specific implementation of a more generic algorithm named Bootstring, which enables the representation of any string from a larger character set (Unicode in our case) using a smaller set of characters (a subset of ASCII in our case)⁷. Such a concept was designed to be universal and functional across most scripts, while striving to be self-optimized and adapt to a character set in a particular string[^punycode-optimization].

##Origin of the name: How is it "puny"?

The name is a catchy word play, which rhymes with Unicode and refers to three meanings of how the alghotitm is „puny“:

Small subset of characters: only lowercase letters, numbers and comma is allowed¹.
Short encoded version: encoded strings arent much longer than original. This is not only elegant, but also important, as DNS limits the length of the domain tag to 63 chars⁸.
implementation is small

The power of Punycode lies in its „puniness“ — its simplicity. It managed to achieve maximum significance (universal for all characters, therefore applicable to all languages) with minimal requirements.

The alghoritm: Whats behind the xn--?

Punycode algorithm

The Punycode is denoted by a special prefix: „xn--", so everything encoded, such as a domain name, starts with it. The prefix is defined within the ACE (ASCII-Compatible Encoding), in standards IDNA 2003 and IDNA 2008⁴⁵.

The encoding process has multiple phases:

ASCII char separation: All ASCII characters (those that don't need to be encoded) from the input string are copied to the start of the output string.
Add the hyphen (minus) separator -:

If there were any ASCII characters, the separator "-" is added after them (e.g., for "čáslav", the ASCII characters will be followed by a trailing hyphen like "slav-")⁹.

We have to realize that the hyphen itself is an ASCII character. Thus, the hyphens can be part of the input string, and if they are present, they will be appended to the output as other characters. That does not create any ambiguity, as the dash added last is the one that was added, as it denotes the end of the ASCII characters.

Encoding non‑ASCII chars: The characters beyond ASCII are encoded using the Bootstring algorithm with parameters for Punycode, resulting in a sequence of a‑z and 0‑9⁹.
Adding the ACE prefix xn--: In domain names, the Punycode‑encoded label is prefixed with xn-- to denote ACE (ASCII‑Compatible Encoding)¹⁰.

So, for example, if we want to encode the string "čáslav" (a Czech town):

ASCII char separation:

From the "čáslav", the ASCII chars "slav" are placed at the start of the output.

Add the hyphen (minus) separator -:

As the input contains both ASCII and non-ASCII chars, a hyphen is then added, so the output is slav-

Encoding non‑ASCII chars:

The č and á chars are encoded using the Bootstring algorithm into 4na7x and appended to the end of the output, so the resulting Punycode output is slav-4na7x

Adding the ACE prefix xn--:

To denote the punycode encoded text in the domain name, we need to prepend the xn-- prefix. It is called ACE (ASCII Compatible Encoding).

So the string we can use in our DNS (Domain Name System setup) is xn--slav-4na7x

Examples

The table below shows how different types of input are transformed generated using tr46 UTS #46 processing, by a Punycode lib tr46

Input	Nameprepped, Punycode encoded and ACE prefixed output	Description
`hello`	`hello`	Simple ASCII word
`world-test`	`world-test`	ASCII word with hyphen
`café`	`xn--caf-dma`	French word with `é`
`naïve`	`xn--nave-6pa`	French word with `ï`
`résumé`	`xn--rsum-bpad`	French word with `é`
`Zürich`	`xn--zrich-kva`	German city with `ü` (uppercase char was lowercased by Nameprep)
`münchen`	`xn--mnchen-3ya`	German city with `ü`
`español`	`xn--espaol-zwa`	Spanish word with `ñ`
`português`	`xn--portugus-q1a`	Portuguese word with `ê`
`français`	`xn--franais-xxa`	French word with `ç`
`تست`	`xn--pgba0a`	Arabic
`δοκιμή`	`xn--jxalpdlp`	Greek
`פרובה`	`xn--5dbgb3dua`	Hebrew
`गुजराती`	`xn--31bky1czdnc`	Gujarati word
`ไทย`	`xn--o3cw4h`	Thai word
`中文`	`xn--fiq228c`	Chinese word
`日本語`	`xn--wgv71a119e`	Japanese word
`한국어`	`xn--3e0bk47br7k`	Korean word
`🌟.ws`	`xn--ch8h.ws`	Emoji domain
`🎉.to`	`xn--dk8h.to`	Emoji domain
`💻.fm`	`xn--3s8h.fm`	Emoji domain
`école.fr`	`xn--cole-9oa.fr`	French school domain with `è`
`bücher.de`	`xn--bcher-kva.de`	German books domain with `ü`
`niño.ws`	`xn--nio-8ma.ws`	Spanish word domain with `ñ`
`tølløse.dk`	`xn--tllse-vuac.dk`	Danish place domain with `ø`
`العربية.ws`	`xn--mgbcd4a2b0d2b.ws`	Arabic
`test中文`	`xn--test-3f5fy05j`	Mixed ASCII and Chinese
`café123`	`xn--caf123-dva`	Mixed letters and numbers with `é`
`á`	`xn--1ca`	Single letter with acute accent
`ñ`	`xn--ida`	Single letter with tilde
`ø`	`xn--pda`	Single Scandinavian letter
`测试`	`xn--0zwm56d`	Chinese
`münchen.bayern`	`xn--mnchen-3ya.bayern`	German domain with regional TLD

The smart solution created a vulnerability: IDN homograph attack

Punycode allowed using international character sets in domain names and made the internet more accessible for users from all over the world. The same technology, though, provided cybercriminals with a new so-called vector of attack. Attackers started to exploit the visual similarity of characters across different alphabets to create so-called IDN homograph attacks — a form of exploit based on deception of a user with a URL whose domain has one character swapped for a visually identical (or almost identical) but different character¹¹¹².

The principle: How the homograph attack works

IDN homograph attacks exploit visually identical or very similar characters from different alphabets, so-called homoglyphs ¹³. Unicode contains more than 136k characters from various alphabets, many of which look almost identical but have different codepoints¹¹.

Homoglyph examples:

Latin char	Homoglyph	Origin	Unicode	Attack example
a (U+0061)	а (U+0430)	Cyrillic	U+0430	аpple.com
o (U+006F)	о (U+043E)	Cyrillic	U+043E	gооgle.com
p (U+0070)	р (U+0440)	Cyrillic	U+0440	рaypal.com
e (U+0065)	е (U+0435)	Cyrillic	U+0435	еbay.com
c (U+0063)	с (U+0441)	Cyrillic	U+0441	сisco.com
x (U+0078)	х (U+0445)	Cyrillic	U+0445	eхpress.com

RFC 1123: Requirements for Internet Hosts – Application and Support. Section 2.1. Retrieved August 30, 2024 ↩ ↩²
INFITT: A New Architecture for Multilingual Internet Domains. Retrieved August 30, 2024 ↩
Unicode Consortium: The Unicode Standard. Retrieved August 30, 2024 ↩
RFC 3490: Internationalizing Domain Names in Applications (IDNA). Retrieved August 30, 2024 ↩ ↩²
RFC 5890: Internationalized Domain Names for Applications (IDNA): Definitions and Document Framework. Retrieved August 30, 2024 ↩ ↩²
IETF: Punycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications (IDNA). Retrieved August 30, 2024 ↩
IETF: Punycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications (IDNA). Retrieved August 30, 2024 ↩
RFC 1035: Domain Names – Implementation and Specification. Section 2.3.4. Retrieved August 30, 2024 ↩
RFC 3492: Punycode: A Bootstring encoding of Unicode for IDNA. Retrieved August 30, 2024 ↩ ↩²
IANA (February 14, 2003): "Completion of IANA Selection of IDNA Prefix". See Wikipedia note [6]: https://en.wikipedia.org/wiki/Punycode#cite_note-6. Retrieved May 2, 2025 ↩
Unicode Technical Report #36: Unicode Security Considerations. Retrieved August 30, 2024 ↩ ↩²
ICANN SSAC Advisory SAC037: Display and Usage of Internationalized Registration Data. Retrieved August 30, 2024 ↩
Unicode Consortium: Unicode Security Mechanisms. Retrieved December 19, 2024 ↩