Question 1

Why does the same text have a different byte count in UTF-8 vs UTF-16?

Accepted Answer

UTF-8 uses a variable number of bytes per character: 1 byte for ASCII characters, 2 bytes for most accented and extended Latin characters, 3 bytes for CJK and other BMP characters, and 4 bytes for emoji and supplementary plane characters. UTF-16 uses 2 bytes for most characters and 4 bytes for supplementary characters, plus a 2-byte Byte Order Mark at the start. Pure ASCII text is smaller in UTF-8 (1 byte per character) than in UTF-16 (2 bytes per character plus BOM), while text with mostly CJK characters may be similar in size across both encodings.

Question 2

Why does my text show 'Not encodable' for ASCII?

Accepted Answer

ASCII can only represent 128 characters — specifically, the basic English alphabet, digits, punctuation, and control characters (code points 0–127). Any character outside that range, including accented letters like é or ñ, emoji, Chinese characters, Arabic script, or any non-English letter, simply has no valid ASCII representation. If your text contains even a single such character, the entire string is considered un-encodable in ASCII.

Question 3

Why is the UTF-16 byte count always even and seemingly large?

Accepted Answer

UTF-16 encodes each character as one or two 16-bit code units, meaning each unit is 2 bytes wide, so the total is always a multiple of 2. Additionally, a properly formed UTF-16 stream starts with a 2-byte Byte Order Mark (BOM) that identifies whether the file is big-endian or little-endian. This means even a single-character string in UTF-16 occupies at least 4 bytes (2 for BOM + 2 for the character), and an empty string occupies 2 bytes. For text that is mostly ASCII, UTF-16 will use roughly twice as many bytes as UTF-8.

Question 4

Does JavaScript's string .length property give me the byte count?

Accepted Answer

No. JavaScript stores strings as sequences of UTF-16 code units, so .length returns the number of 16-bit code units — not the number of Unicode characters and definitely not the byte count in any encoding. For most everyday text they coincide, but emoji and characters from supplementary Unicode planes (code points above U+FFFF) each require two code units, making .length return 2 for a single visible character. To get UTF-8 bytes in a browser, use new TextEncoder().encode(str).length. To count actual Unicode characters (code points), use Array.from(str).length.

Question 5

When does the difference between character count and byte count actually matter?

Accepted Answer

The difference matters most in these practical scenarios: database column limits (MySQL VARCHAR measures in bytes, not characters), bcrypt password hashing (silently truncates at 72 bytes, not 72 characters), HTTP header size limits enforced in bytes by servers and proxies, file name length limits on Linux (255 bytes per component), API rate limits or payload restrictions defined in bytes, and network protocol fields with fixed byte widths. Relying on character count rather than byte count in any of these contexts can cause data truncation, silent errors, or security vulnerabilities.

Question 6

What is a Unicode code point and how is it different from a byte?

Accepted Answer

A Unicode code point is an integer assigned to each character in the Unicode standard, ranging from U+0000 to U+10FFFF (about 1.1 million possible values). It's the abstract identity of a character, independent of any encoding. A byte is a unit of computer storage: 8 bits, with a value from 0 to 255. An encoding like UTF-8 or UTF-16 defines the rules for translating code points into bytes. A single code point can require 1, 2, 3, or 4 bytes depending on the encoding used and the code point's numeric value. The character 'A' is code point U+0041, which encodes to 1 byte in both ASCII and UTF-8, but 2 bytes in UTF-16.

📝 Text & String Size Calculator

Text & String Size Calculator

Why the Same Text Has Different Byte Counts in Different Encodings

ASCII: The Oldest Standard and Its Sharp Limits

UTF-8: Variable Width and Near-Universal Adoption

UTF-16: Fixed-ish Width, Java's Default, and the BOM

Where the Differences Become Consequential

The JavaScript String Length Trap

Practical Rules for Choosing an Encoding

FAQ