About the UTF-32 Encoder / Decoder
The UTF-32 Encoder / Decoder converts text to UTF-32 byte sequences and decodes UTF-32 encoded data back to readable Unicode characters. UTF-32 is a fixed-width encoding that stores every Unicode code point as exactly four bytes — making it the simplest Unicode encoding for random access and character indexing, at the cost of being the most memory-intensive.
How to Use
- To encode: Paste your text into the input field and click Encode. Each character is output as a 4-byte (8 hex character) code unit.
- To decode: Paste a UTF-32 hex byte sequence and click Decode to recover the original text.
- Select byte order — UTF-32 LE (Little Endian) or UTF-32 BE (Big Endian) — to match your source data.
- Toggle BOM inclusion to match your file format requirements.
How UTF-32 Encoding Works
UTF-32 is the simplest of the Unicode encodings. Every code point is stored directly as a 32-bit (4-byte) integer:
- The character
A (U+0041) → 00 00 00 41 (BE) or 41 00 00 00 (LE).
- The character
中 (U+4E2D) → 00 00 4E 2D (BE) or 2D 4E 00 00 (LE).
- The emoji
😀 (U+1F600) → 00 01 F6 00 (BE) or 00 F6 01 00 (LE).
Because the code point value maps directly to the 32-bit integer (with no surrogate pairs or variable-length sequences), character indexing is O(1): the nth character starts at byte offset n × 4.
UTF-32 vs UTF-8 vs UTF-16
- UTF-8 — Variable-width (1–4 bytes). Most compact for ASCII and Western text. The web standard. Cannot index characters in O(1) without scanning.
- UTF-16 — Variable-width (2 or 4 bytes). Used internally by Windows, Java, and JavaScript. More compact than UTF-32 for BMP characters. Requires surrogate pair handling for supplementary characters.
- UTF-32 — Fixed-width (4 bytes always). O(1) character access, no surrogate pairs, no variable-length complexity. Requires 4× the storage of UTF-8 for ASCII text. Used in Python 3 internals (as UCS-4), some Unix/Linux system APIs, and Perl 5.6+ internal strings.
When UTF-32 is Used
- Python 3 (CPython) — CPython stores strings as UCS-4 (essentially UTF-32) when the string contains any character above U+00FF, enabling fast indexing and slicing.
len(str) returns character count, not byte count.
- Linux/macOS wide characters — On Linux and macOS,
wchar_t is 4 bytes and stores UTF-32 code points. On Windows, wchar_t is 2 bytes (UTF-16 code units).
- Internal text processing — Applications that need frequent random access or character-level slicing sometimes convert to UTF-32 internally to avoid the complexity of variable-width encodings, then convert back for storage or transmission.
- Research and interoperability testing — UTF-32 is useful for verifying Unicode codec implementations because its byte structure is trivially predictable.
Common UTF-32 Issues
- Memory usage — UTF-32 uses 4 bytes per character regardless of the character. An ASCII string that is 1 KB in UTF-8 becomes 4 KB in UTF-32. For large text corpora or memory-constrained environments, this is significant.
- Byte order confusion — Like UTF-16, UTF-32 comes in LE and BE variants. The BOM (U+FEFF encoded as
FF FE 00 00 for LE or 00 00 FE FF for BE) identifies the byte order. Without a BOM, misidentifying endianness produces completely wrong code points.
- Poor ecosystem support — UTF-32 is rarely used in file formats, network protocols, or databases. Most parsers, editors, and APIs expect UTF-8 or UTF-16. Conversion is required at almost every system boundary.
- Null bytes in ASCII — UTF-32 encodes ASCII characters with three leading null bytes (e.g.,
A → 41 00 00 00 in LE), which breaks null-terminated string handling in C.
Frequently Asked Questions
- Why would I choose UTF-32 over UTF-8?
- For applications that need O(1) random access to individual characters by index — without scanning from the start of the string — UTF-32 is simpler to implement. Python's CPython uses this approach internally. For storage, transmission, or any external-facing interface, UTF-8 is almost always the better choice due to compactness and universal support.
- Does UTF-32 support all Unicode characters?
- Yes. UTF-32 can represent all 1,114,112 Unicode code points from U+0000 to U+10FFFF in a single 4-byte code unit, with no surrogate pairs required. This makes it the only fixed-width encoding that covers the full Unicode range.
- What is the difference between UTF-32 and UCS-4?
- UCS-4 was an older ISO standard that defined a 4-byte encoding for all code points. UTF-32 is essentially the Unicode standardisation of the same concept, constrained to the valid Unicode range (up to U+10FFFF). In practice, UTF-32 and UCS-4 are interchangeable for modern Unicode text.
- How do I read a UTF-32 file in Python?
- Use
open(filename, 'r', encoding='utf-32'). Python's codec detects the BOM automatically and reads the correct byte order. To specify byte order explicitly: encoding='utf-32-le' or encoding='utf-32-be'.
- How do I convert UTF-32 to UTF-8?
- In Python:
text.encode('utf-8') on a native Python string (already stored as UCS-4 internally). In C: use iconv with UTF-32 as the input charset and UTF-8 as the output. In Java: read the bytes with new String(bytes, "UTF-32") then write with str.getBytes(StandardCharsets.UTF_8).