Java 9 introduced the concept of compact strings as a performance enhancement.
Strings are a major component of heap usage and can occupy as much as 25% of the heap memory. Reducing the internal size of a string can result in a significant saving in memory usage, as well as an equivalent reduction of garbage collection overhead.
Background
Java uses UTF-16 internally to represent characters, which means that a char
is two (2) bytes wide.
Up to and including JDK 8, the sequence of characters in each String
object is internally stored using a char
array, as follows:
private final char[] value;
In the years between Java 1.0 and Java 9, it was found that most strings contain LATIN-1 (ASCII) characters only. ASCII characters need 1 byte of memory, while other Unicode characters need 2 bytes to represent them.
This happens mainly when we use a language like English that can be represented in ASCII/Latin-1.
If all the characters inside a specific String
object can be represented using a single byte each, then half of the space in the internal char
array is not being used. Using a byte
array instead of a char
array therefore has the potential to reduce heap memory usage and improve GC performance.
Compact Strings Implementation
From Java 9, the String
class uses a byte
array to store the characters, as follows:
private final byte[] value;
If there is even a single character in the string sequence that needs more than one byte to represent it, then every character in the sequence will be stored using 2 bytes, i.e. in UTF-16 representation. The String
class internally still uses a byte[]
, but it doubles the array size when allocating space for it.
Latin-1 vs UTF-16
How does a String
differentiate between the LATIN-1 and UTF-16 representations? There is a variable called coder
that is used to specify the representation, as follows:
/* can have the value of either LATIN1 or UTF16 */
private final byte coder;
static final byte LATIN1 = 0;
static final byte UTF16 = 1;
Most of the String
methods first check the coder
value using a call to the isLatin1()
method. The String
method is then dispatched to a specific implementation, either the StringLatin1
or StringUTF16
class. These changes do not affect any public interfaces of String
or any other related classes.
For example,
private boolean isLatin1() {
return COMPACT_STRINGS && coder == LATIN1;
}
public char charAt(int index) {
if (isLatin1()) {
return StringLatin1.charAt(value, index);
} else {
return StringUTF16.charAt(value, index);
}
}
Most of the classes working with strings (such as StringBuilder
and StringBuffer
) have been updated to support the new String
representation.
Disabling Compact Strings
Processing a 2-byte String
is slower because there is additional logic for handling both cases. Fortunately 2-byte strings are in the minority in most Java applications. If your application uses more 2-byte strings than 1-byte strings, the best choice is to disable compact strings when running the VM.
The Compact String VM option is enabled by default. To disable it, we can use the following option at runtime:
+XX:-CompactStrings
For more details on compact strings, see https://openjdk.org/jeps/254
Don’t forget to share your comments and Java experiences.
1 thought on “Compact Strings in Java”
Pingback: Unicode and UTF - A Quick Overview • 2022 • Incus Data Programming Courses