📝 Text & String Size Calculator

Last updated: June 4, 2026

Text & String Size Calculator

Measure the exact byte size of any text across UTF-8, UTF-16, and ASCII encodings instantly.

0 characters
Encoding Results
UTF-8
UTF-16 (with BOM)
ASCII (7-bit)

UTF-16 includes a 2-byte Byte Order Mark (BOM). ASCII only supports characters with code points 0–127; non-ASCII characters make the string un-encodable in pure ASCII.

Please enter some text to calculate byte sizes.

Why the Same Text Has Different Byte Counts in Different Encodings

When you copy a piece of text and ask "how big is this?", the answer depends entirely on which encoding you use to store it. A string of 100 characters might occupy 100 bytes, 202 bytes, or somewhere in between—or it might not be representable at all—depending on whether you choose ASCII, UTF-8, or UTF-16. This isn't a quirk or a bug. It reflects a genuine tradeoff at the heart of how computers represent human language.

Understanding why these numbers differ matters more than it might seem. It affects database column limits, HTTP headers, file I/O, API payload restrictions, and encryption buffers. Developers working across language boundaries—say, a Python backend and a Java frontend—often encounter mysterious size mismatches precisely because the two runtimes count bytes differently.

ASCII: The Oldest Standard and Its Sharp Limits

ASCII (American Standard Code for Information Interchange) dates to the early 1960s. It encodes 128 characters: the 26 uppercase and 26 lowercase English letters, digits 0–9, punctuation, and a set of control characters. Each character maps to a single byte with values between 0 and 127, using only 7 of a byte's 8 bits.

For plain English text with no diacritics, ASCII is beautifully efficient. A 500-character paragraph occupies exactly 500 bytes. There's no overhead, no metadata, no ambiguity. But the moment you introduce an accented character like é, a Chinese ideograph, a Japanese kana, an Arabic letter, a mathematical symbol, or an emoji, ASCII simply cannot represent it. The string becomes unencodable. It's not that the character becomes distorted—it's that there is no mapping at all. ASCII is the right choice only when you have complete confidence your text will stay within those original 128 characters.

UTF-8: Variable Width and Near-Universal Adoption

UTF-8 was designed to solve ASCII's limitations while remaining backward-compatible with it. Any string that contains only ASCII characters occupies the same number of bytes in UTF-8 as it would in ASCII. That property alone drove UTF-8's adoption across the web.

For characters outside ASCII, UTF-8 uses a variable number of bytes. Characters in the Latin Extended range—like é, ñ, ü—occupy 2 bytes each. Most characters in common writing systems, including Greek, Cyrillic, Arabic, Hebrew, and the Basic Multilingual Plane of CJK (Chinese, Japanese, Korean), take 3 bytes. Emoji and other characters in supplementary Unicode planes require 4 bytes.

This variable-width design has a practical consequence: you cannot determine a UTF-8 string's byte length from its character count alone. A 20-character Japanese sentence might occupy 40 or 60 bytes. A 20-character emoji sequence might occupy 80 bytes. A 20-character ASCII string occupies exactly 20 bytes. That variability is why tools like this one exist—counting characters tells you nothing reliable about byte size unless you know the encoding.

UTF-16: Fixed-ish Width, Java's Default, and the BOM

UTF-16 takes a different approach. It encodes most characters as exactly 2 bytes (one 16-bit code unit), which makes indexing and certain string operations faster in languages that use it internally. JavaScript, Java, and C# all store strings as sequences of UTF-16 code units.

The complication arises with characters outside the Basic Multilingual Plane—roughly, emoji and rare historical scripts. These require a surrogate pair: two code units, meaning 4 bytes instead of 2. So UTF-16 isn't truly fixed-width either, though it behaves as fixed-width for the vast majority of everyday text.

There's also the Byte Order Mark (BOM): a 2-byte sequence at the start of a UTF-16 file or stream that declares whether the file is big-endian or little-endian. This BOM adds overhead that's easy to overlook. A single empty string in UTF-16 still occupies 2 bytes. A 100-character ASCII string in UTF-16 with BOM occupies 202 bytes—roughly double what UTF-8 would use for the same content.

Where the Differences Become Consequential

Database systems are perhaps the most common place where encoding confusion costs real engineering time. MySQL's VARCHAR(255) means 255 bytes in Latin-1 encoding, but if you switch the column to UTF-8, a 255-character string containing multi-byte characters can exceed the physical storage limit. PostgreSQL's TEXT type doesn't have this issue because it measures in characters rather than bytes, but its maximum index size (2712 bytes) is a byte limit.

HTTP headers have a similar character versus byte distinction. The HTTP/1.1 specification doesn't define a maximum header size, but many servers and load balancers impose one in bytes. An Authorization header containing a JWT with Unicode payloads can breach a 4KB limit even when its logical "character count" appears well within safe bounds.

Password hashing is a subtler case. bcrypt, one of the most widely used password hashing algorithms, silently truncates inputs at 72 bytes—not 72 characters. If you allow users to create passwords with emoji or extended characters, a password that appears unique and long might hash identically to a shorter one. The character count gives no warning; only the byte count reveals the truncation.

File system naming conventions also vary by byte limit. Most Linux file systems allow 255 bytes per filename component, while macOS HFS+ measures in Unicode characters but normalizes them in ways that change byte sizes. A filename that works on one platform can fail on another for reasons that have nothing to do with its visible length.

The JavaScript String Length Trap

JavaScript's String.prototype.length property returns the number of UTF-16 code units, not the number of Unicode characters and certainly not the number of bytes in any encoding. A single emoji like 🎉 has a JavaScript length of 2, because it requires a surrogate pair in UTF-16—but it represents exactly one Unicode code point and occupies 4 bytes in UTF-8.

This means code like if (username.length > 20) might pass strings that contain only 15 actual visible characters but 20 code units—or conversely, it might reject strings that are visually shorter than the limit but happen to contain emoji. The correct approach is either to measure in Unicode code points (using Array.from(str).length or the spread operator) or to measure in bytes using the TextEncoder API.

Practical Rules for Choosing an Encoding

UTF-8 is the default choice for nearly everything today. It handles all of Unicode, it's backward-compatible with ASCII, and it's the de facto standard for web content, JSON, XML, and most Unix systems. Its byte size for a given string is always between 1× and 4× the character count, with most Western European text staying close to 1×.

UTF-16 makes sense when you're working within a runtime that uses it internally (Java, JavaScript, .NET) and performing many indexed character operations. Transcoding between UTF-16 and UTF-8 at the boundary is usually the right architectural choice rather than passing UTF-16 across the wire.

ASCII belongs only in contexts where you can guarantee the input will never contain anything outside the original 128-character range—legacy system integration, certain communication protocols, or deliberate constraints on accepted input. Using it anywhere else means silent data corruption or runtime errors when non-ASCII characters appear.

Measuring byte size accurately—rather than character count—is the habit that prevents a category of bugs that are notoriously hard to reproduce because they only manifest with specific character combinations. A tool that shows you all three encoding sizes at once makes that measurement immediate.

FAQ

Why does the same text have a different byte count in UTF-8 vs UTF-16?
UTF-8 uses a variable number of bytes per character: 1 byte for ASCII characters, 2 bytes for most accented and extended Latin characters, 3 bytes for CJK and other BMP characters, and 4 bytes for emoji and supplementary plane characters. UTF-16 uses 2 bytes for most characters and 4 bytes for supplementary characters, plus a 2-byte Byte Order Mark at the start. Pure ASCII text is smaller in UTF-8 (1 byte per character) than in UTF-16 (2 bytes per character plus BOM), while text with mostly CJK characters may be similar in size across both encodings.
Why does my text show 'Not encodable' for ASCII?
ASCII can only represent 128 characters — specifically, the basic English alphabet, digits, punctuation, and control characters (code points 0–127). Any character outside that range, including accented letters like é or ñ, emoji, Chinese characters, Arabic script, or any non-English letter, simply has no valid ASCII representation. If your text contains even a single such character, the entire string is considered un-encodable in ASCII.
Why is the UTF-16 byte count always even and seemingly large?
UTF-16 encodes each character as one or two 16-bit code units, meaning each unit is 2 bytes wide, so the total is always a multiple of 2. Additionally, a properly formed UTF-16 stream starts with a 2-byte Byte Order Mark (BOM) that identifies whether the file is big-endian or little-endian. This means even a single-character string in UTF-16 occupies at least 4 bytes (2 for BOM + 2 for the character), and an empty string occupies 2 bytes. For text that is mostly ASCII, UTF-16 will use roughly twice as many bytes as UTF-8.
Does JavaScript's string .length property give me the byte count?
No. JavaScript stores strings as sequences of UTF-16 code units, so .length returns the number of 16-bit code units — not the number of Unicode characters and definitely not the byte count in any encoding. For most everyday text they coincide, but emoji and characters from supplementary Unicode planes (code points above U+FFFF) each require two code units, making .length return 2 for a single visible character. To get UTF-8 bytes in a browser, use new TextEncoder().encode(str).length. To count actual Unicode characters (code points), use Array.from(str).length.
When does the difference between character count and byte count actually matter?
The difference matters most in these practical scenarios: database column limits (MySQL VARCHAR measures in bytes, not characters), bcrypt password hashing (silently truncates at 72 bytes, not 72 characters), HTTP header size limits enforced in bytes by servers and proxies, file name length limits on Linux (255 bytes per component), API rate limits or payload restrictions defined in bytes, and network protocol fields with fixed byte widths. Relying on character count rather than byte count in any of these contexts can cause data truncation, silent errors, or security vulnerabilities.
What is a Unicode code point and how is it different from a byte?
A Unicode code point is an integer assigned to each character in the Unicode standard, ranging from U+0000 to U+10FFFF (about 1.1 million possible values). It's the abstract identity of a character, independent of any encoding. A byte is a unit of computer storage: 8 bits, with a value from 0 to 255. An encoding like UTF-8 or UTF-16 defines the rules for translating code points into bytes. A single code point can require 1, 2, 3, or 4 bytes depending on the encoding used and the code point's numeric value. The character 'A' is code point U+0041, which encodes to 1 byte in both ASCII and UTF-8, but 2 bytes in UTF-16.