Giulia Guglielmi
back to index
5 min read

What's Really Inside a .txt File? ASCII, ANSI, and the UTF-8 Mojibake Mystery

  • encoding
  • unicode
  • computer-science

Have you ever saved a .txt file, sent it to someone, and watched your words turn into a soup of strange symbols? You type a perfectly normal sentence like “We’ll see tomorrow.” Your friend opens it on their machine and sees:

We’ll see tomorrow.

This digital glitch has a name — Mojibake — and it's caused by a mismatch in text encoding. It points at one of the most fundamental ideas in computer science: computers don't actually understand letters. They only understand numbers. Depending on the rules a program uses to read those numbers, the exact same file can display two completely different things.

Computers don't understand letters

Your hard drive has no idea what the letter "A" or "B" is. It only knows two things: ones and zeros. These are called bits.

So how do we get from a drive that only stores ones and zeros to the word "hello" rendered on your screen? The answer lives in three names: ASCII, ANSI, and UTF-8.

Bits, bytes, and hex

A single bit is just a 1 or a 0. Group eight of them together and you get a byte:

01001000

Every unique sequence of bits can represent a specific character. Basic combinatorics tells us that with 8 slots, each either 1 or 0, you get 2^8 = 256 possible combinations — values from 0 to 255.

Nobody, not even engineers, wants to read endless binary. So we use a shortcut called hexadecimal (hex). A single byte maps cleanly to exactly two hex digits:

binary 01001000  →  decimal 72  →  hex 0x48

The 0x prefix just tells the computer "read this as hexadecimal." Open any file in a hex editor and you'll see a tidy grid of two-digit codes instead of a wall of bits.

But that raises the real question: if a character is stored as a byte, and a byte is written as a hex code, how does the computer decide that 0x48 means the capital letter H? It needs a translation blueprint — a text encoding. Here are the three most important ones in history.

ASCII (1963)

The grandfather of text encoding is ASCII — the American Standard Code for Information Interchange.

To save space, ASCII used only 7 bits per character instead of 8. That's 2^7 = 128 combinations. Those 128 slots covered everything American English needed: uppercase and lowercase letters, digits 0–9, basic punctuation, and a handful of control codes like Tab and Enter.

In ASCII, 0x43 means capital C and 0x61 means lowercase a. For a few decades this was perfect — as long as you only wrote American English.

But if you lived in the UK, ASCII didn't even have the pound sign (£). And it had no room at all for é, ñ, ü, or . As computers spread across the world, this became a hard limit.

The ASCII extension: ANSI / Windows-1252

Remember that a real byte has 8 bits, but ASCII only used 7? That left 128 unused slots — values 128 through 255.

Different vendors and countries started filling those empty slots with their own regional characters. Microsoft's layout, Windows-1252 (often loosely called ANSI), extended ASCII for Western European languages:

ByteWindows-1252 character
0xE9é (French, Italian)
0xF1ñ (Spanish)
0xA3£ (British pound)

It worked for Western Europe — but it still left out Greek, Cyrillic, Japanese, Chinese, and most of the world.

UTF-8 (1993)

The breakthrough came with Unicode, and specifically the UTF-8 encoding.

Early on, engineers considered giving every character a fixed 2 bytes (16 bits). That covers tens of thousands of characters — but it instantly doubles the size of every plain-English text file. Wasteful.

UTF-8 solved this with a variable-length scheme:

  • Standard English letters use 1 byte, making UTF-8 100% backward-compatible with ASCII.
  • Anything else — accents, other alphabets, emoji — expands on the fly, using up to 4 bytes for a single character.

Today UTF-8 encodes over 97% of all web pages.

The demo: same bytes, two meanings

Let's tie it together. Take our sentence from the start: “We’ll see tomorrow.” Notice the curly apostrophe (’) in We’ll.

When you save this as UTF-8, letters like W, e, l, s are stored as their single-byte ASCII values. But the curly apostrophe needs more room — UTF-8 stores that one mark as three bytes:

’  →  0xE2 0x80 0x99

Now your friend opens the file in an old client hardcoded to read Windows-1252. That encoding expects every byte to be its own character, so it decodes the three bytes one at a time:

ByteWindows-1252 shows
0xE2â
0x80
0x99

And that is exactly why a single apostrophe became ’.

The fix

The good news: you don't need to retype anything. The bytes are fine — only the interpretation is wrong. Open the file in your editor, find the encoding setting, and force it to read the file as UTF-8 instead of Western European / Windows-1252. The text snaps back to normal.

The deeper lesson is worth keeping: a file is just bytes. Bytes have no meaning until something decides how to read them. Get the encoding right and Mojibake disappears.