Unicode and UTF – A Quick Overview

Unicode UTF Image of typewriter

Last week I looked at compact strings in Java. I explained that strings are internally represented in either ASCII or UTF-16. By this stage, we should all know what ASCII is. But what is Unicode and UTF, and why is it important to know?

Here’s a quick overview of ASCII, Unicode and UTF. I also explain the difference between  UTF-8, UTF-16 and UTF-32. The way a string is represented in computer systems can have a considerable impact.

ASCII

ASCII is an acronym for American Standard Code for Information Interchange. It is a character encoding standard for representing text in computers and telecommunication systems. It was originally based on the English alphabet.

I won’t go into too much history, except to say that ASCII was first commercially used as a teleprinter code.
A teleprinter was an electromechanical device that sent and received typed messages, generally over a telephone line.

ASCII encodes 128 characters into seven-bit integers. 95 of these characters are printable: the lowercase and uppercase alphabetic letters (a to z; A to Z), the numeric digits from 0 to 9, and punctuation symbols. The original ASCII specification also included 33 non-printing control codes used with teleprinters. Most of these are obsolete, although we still use a few, such as carriage return, line feed, and tab characters.

For more details, see https://en.wikipedia.org/wiki/ASCII and https://www.unicode.org/charts/PDF/U0000.pdf.

Unicode

Like ASCII, Unicode is a character encoding standard for representing text in computers and telecommunications systems. It allows us to encode and handle text in most of the world’s different written languages.

The name Unicode is derived from the term Universal Coded Character Set.

Unicode is maintained by the Unicode Consortium. The current version defines 144,697 characters covering 159 modern and historic scripts, as well as symbols, emojis, and non-printing control and formatting codes. 

The first version of Unicode used a 16-bit encoding, but from Unicode 2.0, it uses 21-bit encoding. This means that more than 1 million characters can be encoded. That’s a lot of characters!

The Unicode Standard specifies a numeric value (code point) and a name for each of its characters. It contains 1,114,112 code points. Of these, 2,048 are surrogates (used to make the pairs in UTF-16), 66 are non-characters, and 137,468 are reserved for private use. This leaves 974,530 for public assignment.

Unicode encoding treats all characters the same whether they are alphabetic, numeric or symbols. No escape sequence or control codes are needed to specify any character. This means that all characters can be used with equal ease in any combination together.

Unicode defines 17 planes numbered from 0 to 16. A plane is a continuous group of 65,536 code points. Plane 0 is the Basic Multilingual Plane (BMP). This contains the most commonly used characters of the major languages. The higher planes 1 through 16 are called supplementary planes. Unicode character values are in the range of U+0000 to U+10FFFF (a numeric range of 0 to 10FFFF in hex).

For more details, see Introduction to Unicode and https://en.wikipedia.org/wiki/Unicode.

Unicode Representations and UTF

A Unicode transformation format (UTF) defines an algorithm to map every Unicode code point (except surrogate code points) to a unique byte sequence.

Unicode characters can be represented in one of three UTF encoding forms. The encoding form determines how each character is represented in memory. There is an 8-bit form (UTF-8), a 16-bit form (UTF-16), and a 32-bit form (UTF-32):

  • UTF-8 uses a sequence of between one and four 8-bit bytes to represent each code point.
  • UTF-16 uses either one or two 16-bit code units.
  • UTF-32 uses a single 32-bit code unit.

We can easily convert between the various forms. This allows us to support data input or output in multiple formats, while using a particular UTF for internal processing. Conversions between these encoding forms are fast and lossless.

For more details, see https://www.unicode.org/faq/utf_bom.html

UTF-8 is the dominant encoding on the web and is used in over 98% of all websites. UTF-16 is used by Java, JavaScript and Windows. UTF-8 and UTF-32 are used by Linux and various Unix systems.

UTF-8

UTF-8 was designed to be easily used with existing ASCII-based systems. It uses one byte (8 bits) for the first 128 Unicode code points, and up to 4 bytes for other characters. The first 128 code points represent the ASCII characters, so any valid ASCII text is also valid UTF-8 encoded Unicode text.

UTF-8 is the main encoding on the web. It is used in over 98% of websites, and on most Unix-like operating systems,

If we only use the ASCII character set, UTF-8 is much faster than UTF-16 because there is less data to process. However, for non-European languages, UTF-8 needs more memory than UTF-16.

See https://en.wikipedia.org/wiki/UTF-8

UTF-16

UTF-16 is a variable-length character encoding. Code points are encoded with either one or two 16-bit code units. A single 16-bit code unit is used to encode the most common 64K characters, and a pair of 16-bit code units encode the 1024K less commonly used characters.

UTF-16 is used in Java, JavaScript, and the Windows API. It is sometimes used for plain text and word-processing data files on Microsoft Windows.

For more details, see https://en.wikipedia.org/wiki/UTF-16

UTF-32

UTF-32 is a fixed-length encoding. This is different to UTF-8 and UTF-16 which are variable-length encodings. UTF-32 always uses four bytes (32 bits) per code point. Each UTF-32 value represents one code point and is equal to that code point’s numerical value.

The main advantage of UTF-32 is that the Unicode code points are directly indexed. Finding the Nth code point in a string is a constant-time O(1) operation. In contrast, a variable-length code requires O(n) (linear time) to count N code points from the start of the string.

The main disadvantage of UTF-32 is that it uses a lot more memory. Each code point is mapped to four bytes even though 11 bits of the code point value are always zero. This makes UTF-32 nearly double the size of UTF-16. UTF-32 can be up to four times the size of UTF-8 depending on how many of the characters are in the ASCII table. Characters beyond the BMP are relatively rare (except in text using emojis).

UTF-32 is used by Linux and various Unix systems.

For more details, see https://en.wikipedia.org/wiki/UTF-32.

Ending Off

The last link I’ll share with you is a great article by Joel Spolsky titled The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Don’t forget to share your comments and Java experiences.

Leave a Comment

Your email address will not be published. Required fields are marked *

Code like a Java Guru!

Thank You

We're Excited!

Thank you for completing the form. We're excited that you have chosen to contact us about training. We will process the information as soon as we can, and we will do our best to contact you within 1 working day. (Please note that our offices are closed over weekends and public holidays.)

Don't Worry

Our privacy policy ensures your data is safe: Incus Data does not sell or otherwise distribute email addresses. We will not divulge your personal information to anyone unless specifically authorised by you.

If you need any further information, please contact us on tel: (27) 12-666-2020 or email info@incusdata.com

How can we help you?

Let us contact you about your training requirements. Just fill in a few details, and we’ll get right back to you.

Your Java tip is on its way!

Check that incusdata.com is an approved sender, so that your Java tips don’t land up in the spam folder.

Our privacy policy means your data is safe. You can unsubscribe from these tips at any time.