Percent-encoding: how URLs handle characters they weren't built for

You've seen %20 in a URL. Probably %2F too, or that classic %3F when someone pastes a query string wrong. That % followed by two hex digits is percent-encoding -- the web's way of stuffing arbitrary bytes through a channel originally designed for a narrow slice of ASCII.

The mechanism is ancient by web standards. It dates back to 1994 and RFC 1738, the first formal URL specification¹. And it's still everywhere, running underneath every link you click.

How the %XX triplet works

Take a byte. Write it as % plus two hexadecimal digits. Space (0x20) becomes %20. Hash (0x23) becomes %23. That's the entire idea.

The hex digits can be uppercase or lowercase -- %2f and %2F both mean forward slash. RFC 3986 recommends producers use uppercase, but consumers must accept either².

Flowchart showing the percent-encoding decision process for a given character

Percent-encoding decision flowchart

Here's the part that catches people: percent-encoding operates on bytes, not characters. For ASCII, there's a clean 1:1 mapping -- @ is always byte 0x40, so it always becomes %40. But anything outside ASCII has to be converted to bytes first (using a character encoding, almost always UTF-8), and then each byte gets its own %XX triplet.

The accented letter e (U+00E9) in UTF-8 produces two bytes: 0xC3 and 0xA9. So it becomes %C3%A9. A Chinese character like U+4E2D turns into three UTF-8 bytes and becomes %E4%B8%AD.

Diagram showing how Unicode characters become percent-encoded bytes through UTF-8

Unicode to percent-encoded bytes via UTF-8

Character classes in RFC 3986

The current URI standard is RFC 3986, published in January 2005 by Berners-Lee, Fielding, and Masinter². It splits characters into categories that determine what needs encoding and where.

Unreserved characters never need encoding:

A-Z a-z 0-9 - . _ ~

Letters, digits, hyphen, period, underscore, tilde. If you percent-encode one of these, a conforming implementation should treat %41 identically to A².

Reserved characters carry syntactic meaning:

Gen-delims: : / ? # [ ] @
Sub-delims: ! $ & ' ( ) * + , ; =

Whether a reserved character needs encoding depends on context. A / in a path is a delimiter -- don't encode it. A / inside a query parameter value is data and should be encoded as %2F. The same character, legal in one spot, must be escaped in another. This context-dependence is honestly the trickiest part of the whole system.

Everything else -- control characters, spaces, non-ASCII bytes, { } | \ ^ -- always needs encoding.

Standards evolution (briefly)

Timeline showing the evolution of URI encoding standards

URI encoding standards from 1994 to present

RFC 1738 (1994) introduced %HH escaping with vague categories of "safe" and "unsafe" characters¹. It listed ~ as unsafe, which caused years of problems with Unix home directory URLs. RFC 2396 (1998) cleaned things up: ~ moved to unreserved, terminology got sharper, and the spec shifted from "URL" to the broader "URI"³. RFC 3986 (2005) was the big overhaul -- reserved characters split into gen-delims and sub-delims, normalization rules formalized, and the unreserved set trimmed to just the 66 characters we use today².

A companion spec, RFC 3987, defined Internationalized Resource Identifiers (IRIs) -- URIs that can contain Unicode⁴. The conversion rule is simple: encode non-ASCII characters as UTF-8, then percent-encode the bytes. The spec is firm that UTF-8 is the only acceptable encoding for this. No Latin-1, no Shift_JIS. Browsers had already converged on this by 2005; the RFC just formalized it.

IRIs are what the browser's address bar displays. The wire protocol still sends percent-encoded URIs. Related: if you're working with internationalized domain names, those use a different encoding scheme called Punycode, which is part of the broader IDN infrastructure.

The form encoding oddity

HTML forms do their own thing. When a form submits with method="GET" (or POST with the default enctype), the browser uses application/x-www-form-urlencoded⁵. It looks like percent-encoding but has one well-known deviation: spaces become + signs.

name=John+Doe&city=New+York

This dates back to RFC 1866 -- the HTML 2.0 specification from 1995⁶. RFC 3986 knows nothing about + meaning space. In a URI path, + is a literal plus sign. But in form-encoded query strings, + and %20 both represent a space.

Form encoding is also more aggressive: it encodes everything except A-Z a-z 0-9 * - . _⁵. Characters like ! and ~, which are unreserved in RFC 3986, get percent-encoded in form data.

I find this split genuinely annoying. You can't just write one encoding function and use it everywhere -- the rules differ depending on whether you're building a path, a query parameter, or form data.

encodeURI vs encodeURIComponent

This is the question developers actually google.

Comparison chart showing which characters encodeURI and encodeURIComponent encode differently

encodeURI vs encodeURIComponent comparison

encodeURI() encodes a complete URI. It leaves structural characters alone -- : / ? # [ ] @ ! $ & ' ( ) * + , ; = -- because those are part of the URI's syntax⁷.

encodeURIComponent() encodes a single component (a query parameter value, a path segment). It encodes the structural characters too, because inside a component, a / or & is data, not syntax⁸.

const url = "https://example.com/search?q=coffee & tea";

encodeURI(url);
// "https://example.com/search?q=coffee%20&%20tea"
// Bad: & is preserved, breaking the query string

"https://example.com/search?q=" + encodeURIComponent("coffee & tea");
// "https://example.com/search?q=coffee%20%26%20tea"
// Correct: & is encoded as %26

Rule of thumb: encodeURIComponent() for values, encodeURI() for whole URLs where the structure is already valid. In practice, I reach for encodeURIComponent() about 95% of the time.

Neither function produces form encoding -- both use %20 for spaces, never +. For that, use URLSearchParams:

const params = new URLSearchParams({ q: "coffee & tea" });
params.toString(); // "q=coffee+%26+tea"

Python has a similar split: urllib.parse.quote() uses %20 for spaces and keeps / safe by default (path encoding), while urllib.parse.quote_plus() uses + for spaces and encodes slashes (form encoding)⁹.

from urllib.parse import quote, quote_plus

quote("hello world")       # 'hello%20world'
quote_plus("hello world")  # 'hello+world'

Common pitfalls

Double encoding. If you encode a URL that's already percent-encoded, the % signs themselves get encoded: %20 becomes %2520. This happens when one layer of your stack encodes a value and another layer re-encodes it. RFC 3986 warns against this explicitly². ORMs and HTTP client libraries are frequent offenders.

The plus sign trap. In a form-encoded query string, + means space. In a URI path, + is a literal plus. Decode a path segment with a form decoder and every + silently becomes a space. I've seen this bug in production more than once.

Wrong encoding context. Using encodeURI() on a query parameter value won't encode / or &, potentially breaking the URL structure. Using encodeURIComponent() on an entire URL will encode the :// and every / in the path, making it useless. The function choice depends on what you're encoding and where it goes.

Non-UTF-8 legacy. Modern systems use UTF-8 everywhere, but older systems might encode as Latin-1 or Windows-1252. The character e is 0xE9 in Latin-1 (one byte, %E9) but 0xC3 0xA9 in UTF-8 (two bytes, %C3%A9). Mixing these up silently corrupts text. The WHATWG URL standard requires UTF-8 for all URL encoding in browsers¹⁰, which has mostly eliminated this problem in web contexts -- but APIs and legacy backends can still surprise you.

Citations

RFC 1738: Uniform Resource Locators (URL). Retrieved March 16, 2026 ↩ ↩²
RFC 3986: Uniform Resource Identifier (URI): Generic Syntax. Retrieved March 16, 2026 ↩ ↩² ↩³ ↩⁴ ↩⁵
RFC 2396: Uniform Resource Identifiers (URI): Generic Syntax. Retrieved March 16, 2026 ↩
RFC 3987: Internationalized Resource Identifiers (IRIs). Retrieved March 16, 2026 ↩
WHATWG: URL Living Standard -- application/x-www-form-urlencoded. Retrieved March 16, 2026 ↩ ↩²
RFC 1866: Hypertext Markup Language - 2.0. Section 8.2.1, Form submission. Retrieved March 16, 2026 ↩
MDN: encodeURI(). Retrieved March 16, 2026 ↩
MDN: encodeURIComponent(). Retrieved March 16, 2026 ↩
Python Software Foundation: urllib.parse -- Parse URLs into components. Retrieved March 16, 2026 ↩
WHATWG: URL Living Standard. Retrieved March 16, 2026 ↩

Updated: March 16, 2026