About the UTF-8 Encoder / Decoder
The UTF-8 Encoder / Decoder converts text to UTF-8 byte sequences and decodes UTF-8 encoded bytes back to readable characters. UTF-8 is the dominant character encoding on the web — over 98% of all websites use it. This tool handles plain text, special characters, emoji, and multi-byte Unicode code points including CJK characters, Arabic script, and mathematical symbols.
How to Use
- To encode: Paste your text into the input field and click Encode. The tool outputs the UTF-8 byte sequence as percent-encoded hex values (e.g.,
%E2%9C%93) or raw hex bytes.
- To decode: Paste UTF-8 encoded bytes or a percent-encoded string into the input field and click Decode. The tool recovers the original Unicode text.
- Switch between encoding formats using the output mode selector: percent-encoding, hex bytes, or decimal byte values.
How UTF-8 Encoding Works
UTF-8 is a variable-width encoding that represents every Unicode code point using one to four bytes:
- 1 byte (U+0000–U+007F) — ASCII characters. The byte value equals the code point. Example:
A → 0x41.
- 2 bytes (U+0080–U+07FF) — Latin extended, Greek, Cyrillic, Hebrew, Arabic, and similar scripts. Example:
é (U+00E9) → 0xC3 0xA9.
- 3 bytes (U+0800–U+FFFF) — CJK characters, most remaining BMP scripts. Example:
中 (U+4E2D) → 0xE4 0xB8 0xAD.
- 4 bytes (U+10000–U+10FFFF) — Supplementary planes including emoji, historic scripts, and mathematical notation. Example:
😀 (U+1F600) → 0xF0 0x9F 0x98 0x80.
The leading byte encodes both the total byte count and the high bits of the code point. Continuation bytes always begin with 10 in binary, allowing decoders to resynchronize after a transmission error.
UTF-8 vs Other Unicode Encodings
- UTF-8 — Variable-width (1–4 bytes). ASCII-compatible. The default for HTML, JSON, XML, and most internet protocols. No byte-order mark needed.
- UTF-16 — Fixed 2 bytes for BMP characters, 4 bytes for supplementary planes. Used internally by Windows, Java, and JavaScript strings. Requires a BOM (byte order mark) to indicate endianness.
- UTF-32 — Fixed 4 bytes per code point. Simple indexing but memory-heavy. Used in some Unix/Linux internal processing.
- Latin-1 (ISO-8859-1) — Legacy 8-bit encoding covering 256 characters. Incompatible with non-Western scripts. Files mislabeled as Latin-1 that contain UTF-8 data are a common source of mojibake (garbled characters).
Common UTF-8 Encoding Issues
- Mojibake — Garbled characters caused by reading UTF-8 data with the wrong encoding. Common symptom:
é appears as é (the two UTF-8 bytes interpreted as Latin-1). Fix: ensure the database connection, file read, and HTTP response all use UTF-8 consistently.
- Double encoding — A string already percent-encoded is encoded again, turning
%20 into %2520. Happens when encoding is applied at multiple layers without decoding between them.
- BOM at file start — The UTF-8 BOM (
0xEF 0xBB 0xBF) is optional and unnecessary in UTF-8. Some Windows tools add it; it can break JSON parsers, PHP scripts, and HTTP headers if not stripped.
- Incomplete multi-byte sequences — A 3-byte character with only 2 bytes transmitted is invalid UTF-8. Caused by string truncation that does not respect character boundaries. Use
mb_substr() in PHP or equivalent multi-byte-aware string functions.
- Null byte handling — The NUL character (U+0000) encodes to a single
0x00 byte in UTF-8, which terminates C strings. Some libraries use Modified UTF-8 (MUTF-8) encoding NUL as 0xC0 0x80 to avoid this.
Frequently Asked Questions
- What is the difference between UTF-8 encoding and URL percent-encoding?
- They are related but distinct. UTF-8 is a character encoding — it defines the byte representation of every Unicode character. URL percent-encoding (RFC 3986) is a transport encoding that takes UTF-8 bytes and represents each non-ASCII byte as
%XX where XX is the hex value. Together they allow any Unicode character to appear in a URL safely.
- How do I set UTF-8 encoding in my application?
- In HTML:
. In PHP: mb_internal_encoding('UTF-8') and header('Content-Type: text/html; charset=utf-8'). In MySQL: CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci (use utf8mb4, not utf8, to support the full 4-byte range including emoji). In Python 3: UTF-8 is the default for source files and string handling.
- Why does MySQL have both utf8 and utf8mb4?
- MySQL's
utf8 charset is a misnomer — it supports only up to 3-byte sequences (BMP characters), silently dropping emoji and other 4-byte code points. utf8mb4 is the correct full UTF-8 implementation. Always use utf8mb4 for new tables; utf8 is a legacy alias kept for compatibility.
- What does "invalid UTF-8 sequence" mean in my logs?
- The byte stream contains sequences that do not conform to the UTF-8 specification — either bytes that cannot appear in valid UTF-8, or multi-byte sequences with incorrect continuation bytes. Common causes: data received from a Latin-1 source, binary data mixed with text, or a truncated string at a multi-byte character boundary. Sanitize input with
mb_convert_encoding($str, 'UTF-8', 'UTF-8') in PHP or equivalent to strip or replace invalid sequences.
- Can UTF-8 represent every Unicode character?
- Yes. UTF-8 can encode all 1,114,112 Unicode code points from U+0000 to U+10FFFF, covering every script, symbol, and emoji in the Unicode standard. It is the only encoding that is simultaneously ASCII-compatible, self-synchronizing, and able to represent all of Unicode.