Text & String Size Calculator
Measure the exact byte size of any text across UTF-8, UTF-16, and ASCII encodings instantly.
UTF-16 includes a 2-byte Byte Order Mark (BOM). ASCII only supports characters with code points 0–127; non-ASCII characters make the string un-encodable in pure ASCII.
Why the Same Text Has Different Byte Counts in Different Encodings
When you copy a piece of text and ask "how big is this?", the answer depends entirely on which encoding you use to store it. A string of 100 characters might occupy 100 bytes, 202 bytes, or somewhere in between—or it might not be representable at all—depending on whether you choose ASCII, UTF-8, or UTF-16. This isn't a quirk or a bug. It reflects a genuine tradeoff at the heart of how computers represent human language.
Understanding why these numbers differ matters more than it might seem. It affects database column limits, HTTP headers, file I/O, API payload restrictions, and encryption buffers. Developers working across language boundaries—say, a Python backend and a Java frontend—often encounter mysterious size mismatches precisely because the two runtimes count bytes differently.
ASCII: The Oldest Standard and Its Sharp Limits
ASCII (American Standard Code for Information Interchange) dates to the early 1960s. It encodes 128 characters: the 26 uppercase and 26 lowercase English letters, digits 0–9, punctuation, and a set of control characters. Each character maps to a single byte with values between 0 and 127, using only 7 of a byte's 8 bits.
For plain English text with no diacritics, ASCII is beautifully efficient. A 500-character paragraph occupies exactly 500 bytes. There's no overhead, no metadata, no ambiguity. But the moment you introduce an accented character like é, a Chinese ideograph, a Japanese kana, an Arabic letter, a mathematical symbol, or an emoji, ASCII simply cannot represent it. The string becomes unencodable. It's not that the character becomes distorted—it's that there is no mapping at all. ASCII is the right choice only when you have complete confidence your text will stay within those original 128 characters.
UTF-8: Variable Width and Near-Universal Adoption
UTF-8 was designed to solve ASCII's limitations while remaining backward-compatible with it. Any string that contains only ASCII characters occupies the same number of bytes in UTF-8 as it would in ASCII. That property alone drove UTF-8's adoption across the web.
For characters outside ASCII, UTF-8 uses a variable number of bytes. Characters in the Latin Extended range—like é, ñ, ü—occupy 2 bytes each. Most characters in common writing systems, including Greek, Cyrillic, Arabic, Hebrew, and the Basic Multilingual Plane of CJK (Chinese, Japanese, Korean), take 3 bytes. Emoji and other characters in supplementary Unicode planes require 4 bytes.
This variable-width design has a practical consequence: you cannot determine a UTF-8 string's byte length from its character count alone. A 20-character Japanese sentence might occupy 40 or 60 bytes. A 20-character emoji sequence might occupy 80 bytes. A 20-character ASCII string occupies exactly 20 bytes. That variability is why tools like this one exist—counting characters tells you nothing reliable about byte size unless you know the encoding.
UTF-16: Fixed-ish Width, Java's Default, and the BOM
UTF-16 takes a different approach. It encodes most characters as exactly 2 bytes (one 16-bit code unit), which makes indexing and certain string operations faster in languages that use it internally. JavaScript, Java, and C# all store strings as sequences of UTF-16 code units.
The complication arises with characters outside the Basic Multilingual Plane—roughly, emoji and rare historical scripts. These require a surrogate pair: two code units, meaning 4 bytes instead of 2. So UTF-16 isn't truly fixed-width either, though it behaves as fixed-width for the vast majority of everyday text.
There's also the Byte Order Mark (BOM): a 2-byte sequence at the start of a UTF-16 file or stream that declares whether the file is big-endian or little-endian. This BOM adds overhead that's easy to overlook. A single empty string in UTF-16 still occupies 2 bytes. A 100-character ASCII string in UTF-16 with BOM occupies 202 bytes—roughly double what UTF-8 would use for the same content.
Where the Differences Become Consequential
Database systems are perhaps the most common place where encoding confusion costs real engineering time. MySQL's VARCHAR(255) means 255 bytes in Latin-1 encoding, but if you switch the column to UTF-8, a 255-character string containing multi-byte characters can exceed the physical storage limit. PostgreSQL's TEXT type doesn't have this issue because it measures in characters rather than bytes, but its maximum index size (2712 bytes) is a byte limit.
HTTP headers have a similar character versus byte distinction. The HTTP/1.1 specification doesn't define a maximum header size, but many servers and load balancers impose one in bytes. An Authorization header containing a JWT with Unicode payloads can breach a 4KB limit even when its logical "character count" appears well within safe bounds.
Password hashing is a subtler case. bcrypt, one of the most widely used password hashing algorithms, silently truncates inputs at 72 bytes—not 72 characters. If you allow users to create passwords with emoji or extended characters, a password that appears unique and long might hash identically to a shorter one. The character count gives no warning; only the byte count reveals the truncation.
File system naming conventions also vary by byte limit. Most Linux file systems allow 255 bytes per filename component, while macOS HFS+ measures in Unicode characters but normalizes them in ways that change byte sizes. A filename that works on one platform can fail on another for reasons that have nothing to do with its visible length.
The JavaScript String Length Trap
JavaScript's String.prototype.length property returns the number of UTF-16 code units, not the number of Unicode characters and certainly not the number of bytes in any encoding. A single emoji like 🎉 has a JavaScript length of 2, because it requires a surrogate pair in UTF-16—but it represents exactly one Unicode code point and occupies 4 bytes in UTF-8.
This means code like if (username.length > 20) might pass strings that contain only 15 actual visible characters but 20 code units—or conversely, it might reject strings that are visually shorter than the limit but happen to contain emoji. The correct approach is either to measure in Unicode code points (using Array.from(str).length or the spread operator) or to measure in bytes using the TextEncoder API.
Practical Rules for Choosing an Encoding
UTF-8 is the default choice for nearly everything today. It handles all of Unicode, it's backward-compatible with ASCII, and it's the de facto standard for web content, JSON, XML, and most Unix systems. Its byte size for a given string is always between 1× and 4× the character count, with most Western European text staying close to 1×.
UTF-16 makes sense when you're working within a runtime that uses it internally (Java, JavaScript, .NET) and performing many indexed character operations. Transcoding between UTF-16 and UTF-8 at the boundary is usually the right architectural choice rather than passing UTF-16 across the wire.
ASCII belongs only in contexts where you can guarantee the input will never contain anything outside the original 128-character range—legacy system integration, certain communication protocols, or deliberate constraints on accepted input. Using it anywhere else means silent data corruption or runtime errors when non-ASCII characters appear.
Measuring byte size accurately—rather than character count—is the habit that prevents a category of bugs that are notoriously hard to reproduce because they only manifest with specific character combinations. A tool that shows you all three encoding sizes at once makes that measurement immediate.