Overview: The following data types (clickable) are offered by modern Pascal compilers to store strings. This page also covers the following topics: PChar, const parameters, and memory and reference counting.
Array of Char | historical |
ShortString | UTF-8 or ANSI, max. 255 bytes, no reference counting |
String | platform-dependent: AnsiString or UnicodeString |
AnsiString | UTF-8 or ANSI |
UTF8String | UTF-8-typed AnsiString |
RawByteString | untyped AnsiString, adopts codepage of the source |
WideString | UTF-16, no reference counting |
UnicodeString | UTF-16 |
UCS4String | UTF-32, array of UCS4Char |
With the types AnsiString and UnicodeString, Delphi and FreePascal/Lazarus have gained very powerful, practical, and fast string types. Both types carry length information with them (a "counted string" as opposed to a null-terminated string), can dynamically allocate as much memory as the system architecture and operating system allow, know the character encoding used for their content, and use reference counting to avoid unnecessary copying of the content.
All of this is very convenient and leaves little to be desired — except perhaps that the language of the content (e.g., de, en, fr, etc.) cannot be stored in the metadata. And in one special case, reference counting can be a disadvantage: when a string's content is used across multiple threads (multi-threading), the programmer must ensure that each thread receives its own copy of the string in advance.
From today's perspective, the naming of some string types is a bit unfortunate. The UnicodeString should actually be called UTF16String (analogous to UTF8String), because potentially all string types can contain Unicode characters, as UTF-8 and UTF-32 are just as much part of the Unicode system as UTF-16. And UCS4String would be better named UTF32String, as UCS4 has been superseded by the UTF-32 standard (functionally identical).
The type String is not a clearly defined type, but rather depends on the environment and compiler settings. In FreePascal and Lazarus, a new unit by default includes the compiler directive {$H+}, which means that the type String should be a long string. In FreePascal, a long string is an AnsiString (UTF-8). In contrast, in Delphi from version 2009 onward, a long string is a UnicodeString (UTF-16).
If this compiler directive is not present in a unit or is specified as {$H-}, then the type String is a ShortString with 255 bytes statically reserved for content.
The modern string types in Delphi and FreePascal, i.e., AnsiString and UnicodeString, use reference counting.
Exceptions are the types ShortString, WideString, and UCS4String.
Referencing a string content means that a string variable internally is just a pointer that points to the content,
which may be shared among multiple string variables.
If a constant value is assigned to an AnsiString, e.g., s := 'Hello';
,
the variable s
only receives a pointer to the constant stored in the read-only code segment of the program.
The function StringRefCount(s)
will return -1 in this case, which indicates that a constant was assigned.
Heap-allocated:
If the string content is created at runtime, e.g., by concatenation such as s := s + ' world!';
,
then memory is allocated on the heap (↗dynamic memory)
and the new content is stored as a complete string "Hello world!".
As long as the new string is assigned to a single variable only, StringRefCount(s)
will return 1 as the reference count.
The AnsiString variable itself is always just a pointer to the content. This pointer is 4 bytes (32-bit) or 8 bytes (64-bit) in size and is stored in the variable’s scope — that is, for a local variable on the stack, or for example in a record or object.
The metadata is stored in memory along with the content. The following table shows the memory layout
used for AnsiString content.
The string variable as a pointer points directly to the content, meaning that metadata is accessed internally
by reading before the pointer address.
Likewise, a type cast like PChar(s)
returns a pointer directly to the content (the first character), as is well known.
CodePage | ElementSize | ReferenceCount | Length | Content | Terminator |
2 bytes | 2 bytes | 4 or 8 bytes | 4 or 8 bytes | Chars | Char #0 |
The string content in a 32-bit application can theoretically be up to 2 GB in size. In a 64-bit application, a string could theoretically grow up to 9 exabytes — that’s 9 billion GB.
If a string variable is assigned another string of the same type, the content is not copied; only a pointer to the existing content is set, and the reference count for that content is increased. This is faster than copying the content — especially if the string is 9 exabytes large. ;)
The "copy-on-write" mechanism automatically copies the content only when a modification to it becomes necessary. If the string variable is assigned new content, the reference count of the previous content is decreased. If the count drops to 0, it means no variable is using (referencing) that content anymore, and the previously allocated memory is released.
A content can be referenced up to 2 billion times (32-bit, signed?). On 64-bit systems, it can even be referenced up to 9 trillion times (according to the Wiki: “On 64-bit targets, fields RefCount/Length consume 8 bytes each, not 4.”).
procedure MyProcedure(const s: String);
In a function header, string parameters should ideally be declared as constant using const
,
provided that the procedure does not need to modify the string content.
The advantage is better performance. This also applies to other structured types such as records or arrays.
If the parameter is passed without const
, the content is usually copied for the local variable.
This ensures that any modifications remain within the procedure and do not affect the original data,
allowing the caller to continue working with its unmodified values.
However, a local copy is unnecessary if the content is only read and not changed.
Using const
tells the compiler that no modification will occur.
However, for modern reference-counted string types (AnsiString and UnicodeString),
the const
declaration is less critical than it was with older types.
Without const
, for example, a ShortString
is copied entirely.
But with a reference-counted string, only the reference count is incremented.
Even so, using const
still has an advantage:
it avoids even that reference counter change, resulting in a slight performance gain.
Looking back: In the early Pascal versions of the 1970s, there was no string type yet. Instead, you had to define a static array for a character sequence manually, with a fixed length depending on your needs.
In Pascal, the Char
array was traditionally defined with an index starting at 1,
because the
↗creator of Pascal
considered counting from 1 to be more natural for humans.
To this day, all newer Pascal string types still use index 1 as the starting point —
with the exception of UCS4String
, which is simply an array of UCS4Char
starting at index 0.
If the content does not fill the entire length of the array, you must store the actual used length separately somewhere.
That was really quite inconvenient.
Alternatively, you could place a #0
character after the content —
making it a null-terminated string like in the C programming language.
PChar
is not a string type, but simply a pointer to a character.
PChar
is mainly used to pass a pointer to a function in an API or another function library,
for example, functions in a .dll
(such as the Windows API) or a .so
file.
The PChar
type does not allocate its own memory for content.
This means that when using PChar
, you must ensure that the memory is allocated elsewhere
and remains unchanged for as long as the PChar
might be used to access the content.
PChar
is not a fixed type but may represent either PAnsiChar
or PWideChar
,
depending on the environment — specifically on the compiler switch that determines whether a
String is an AnsiString
or a UnicodeString
.
When you apply PChar(s)
as a type cast, you get a pointer to the first character of the string — or NIL
if the string is empty.
It is important to remember that the string variable is responsible for memory management.
So, as long as you want to use or pass the PChar
pointer, the string variable must remain unchanged.
A PChar
should point to a null-terminated content, because it carries no metadata and thus does not know the length of the content.
Fortunately, the types AnsiString
and UnicodeString
are automatically null-terminated,
so you typically don’t need to worry about this.
If you are working with content that is not null-terminated, such as a ShortString
,
you must store and pass the length separately.
ShortString
was the first string type introduced in Turbo Pascal.
In principle, a ShortString
works similarly to an array of Char
,
with the difference that the first element s[0]
stores the length actually used by the content.
Since the length indicator is just one Char
(1 byte), the maximum content length is 255 bytes
(Char
value range #0..#255).
The first character of the content is at s[1]
, thus continuing Pascal's tradition of 1-based indexing for strings.
The capacity of a short string must be defined in advance, e.g. var s: String[40];
for a maximum of 40 bytes of content.
Including the length byte, String[40]
always occupies 41 bytes — even if the string is empty.
If you declare ShortString
as a type, it corresponds to String[255]
,
i.e. the maximum short string size of up to 255 characters with a fixed memory size of 256 bytes.
The memory location of a ShortString
depends on the scope in which the variable is declared:
as a global variable it is stored in the data segment,
as a local variable on the stack,
as a record field inside the record,
and as a class property in the object instance on the heap.
The character encoding of the content depends on the system or inherits the encoding of the assigned content. In FreePascal and Lazarus today, the content is usually UTF-8 encoded. In Delphi, however, the content is ANSI-encoded — in Western Europe typically using code page Windows-1252.
A ShortString
is not null-terminated,
which means if you want to use it with PChar,
you would need to manually append a #0
to the content or pass the string length separately (if possible).
When using a short string with only a few characters, String[n]
can have advantages in terms of memory and speed
compared to reference-counted strings.
For example, for a language code like "de"
or a country code like "CH"
,
a String[3]
takes just 4 bytes of memory,
making it very compact and faster to copy than increasing and decreasing a reference counter.
The same applies to codes like "de-AT"
or short words that fit into a String[7]
.
The fixed 8-byte size is no larger than the pointer used by an AnsiString
on a 64-bit system.
However, to use such a string in string functions, it must be converted to an AnsiString
or UnicodeString
,
which reverses any memory or performance benefits.
An AnsiString
contains single-byte characters of type AnsiChar
.
The AnsiString
type was introduced in 1996 with Delphi 2, the first 32-bit version of Delphi.
Before that, one had to make do with ShortString.
For details on memory management of AnsiString
, see
Heap and Reference Counting.
Character Encoding: In Delphi, the 1-byte encoding follows the system’s code page,
e.g., code page “Windows-1252” in Western Europe.
In Lazarus, UTF-8 became the default encoding for AnsiString
in 2016.
In console applications, the compiler directive {$codepage utf8}
also sets UTF-8 as the default encoding for AnsiString
.
In UTF-8, a character can occupy between 1 and 4 bytes. The advantage is that it can represent any alphabet and any script. In contrast, traditional code page encodings use exactly 1 byte per character. This makes accessing individual characters very simple, but each code page is limited to just one alphabet.
Typing via code page number: It is possible to use a specific character encoding within an AnsiString
by specifying the desired code page when declaring the AnsiString
.
var
s_cp1251: AnsiString(1251); // Cyrillic only
s_cp1252: AnsiString(1252); // Latin-1 only (Western European)
s_cp1253: AnsiString(1253); // Greek only
s_utf8 : AnsiString(CP_UTF8); // = UTF8String
s_raw : AnsiString(CP_NONE); // = RawByteString
begin
s_utf8 := 'Γειά σου, Κόσμε!'; // Greek: "Hello World!"
s_cp1253 := s_utf8; // valid: cp1253 supports Greek characters
s_raw := s_cp1253; // valid: a RawByteString accepts any code page
s_cp1252 := s_cp1253; // invalid: cp1252 doesn't support Greek characters
end;
If one AnsiString
variable is assigned to another, recoding to the target code page is performed when needed.
The example above demonstrates such special cases.
However, in most programs this is rarely necessary, since everything typically uses the same default code page.
When assigning to a UTF8String, it always works,
because UTF-8 can encode all Unicode characters (in 1 to 4 bytes).
However, if you assign s_cp1253
(Greek) to s_cp1252
(Latin only),
the Greek characters have no equivalents in the Western European code page.
As a result, these characters will be replaced with '?'
in s_cp1252
, rendering the string useless.
Be careful when assigning a constant to a typed string: The constant (UTF-8 encoded) content is always assigned, even if the target string is typed with a different code page and would require recoding. In my opinion, this is a compiler bug — when reading the variable later, the UTF-8 content is misinterpreted according to the declared code page, making it unusable.
I also encountered another issue with TStringList.Text
:
If you add strings to a TStringList
that do not match the system’s default code page,
this is not accounted for when processing via .Text
.
As a result, you may get an unusable combined string,
because the default code page doesn’t match the differently encoded content.
A RawByteString
is a special type of
AnsiString
that is not assigned a standard language code page.
When an AnsiString
is assigned to a RawByteString
, it simply adopts its code page,
and no recoding of the content is performed.
function ToWesternEuropean(const s: RawByteString): RawByteString;
begin
Result := s; // only referencing here
if StringCodePage(Result) <> 1252 then
SetCodePage(Result, 1252); // dereferencing and recoding
end;
This function header accepts any code page without triggering an automatic recoding (e.g., to UTF-8).
Inside the function: If the input string s
already uses code page 1252 (Western European),
no recoding takes place, and the same string reference is returned — effectively with no time cost.
A UTF8String
is a typed AnsiString(CP_UTF8)
,
which means it is guaranteed to contain UTF-8 encoded content.
This type is only relevant for Delphi, where AnsiString
by default uses the system’s ANSI code page instead of UTF-8.
In Lazarus, UTF-8 has been the default encoding for AnsiString
since 2016,
and AnsiString
is equivalent to the String
type.
Therefore, in Lazarus, you can usually just use String
and don’t need UTF8String
.
However, if a unit written in FreePascal is meant to be Delphi-compatible and relies on UTF-8 for specific reasons,
you should explicitly declare UTF8String
to ensure it is treated the same way by both compilers.
WideString
is a UTF-16 string type introduced with Delphi 3 (1997), specifically for Windows.
It implements the Windows BSTR
(Basic String) type, which is used as the native string type
for the COM interface (OLE Automation).
On Windows, the WideString
content is not stored on the application’s own heap
but managed through the Windows API using SysAllocString()
and SysFreeString()
.
Because of this, WideString
does not use reference counting,
and the indirect memory management via the Windows API is slightly slower.
Since Linux does not support COM, the string content in FreePascal under Linux is stored in the regular heap. Therefore, if you do not need to access the Windows COM interface but want to use a UTF-16 string, you should instead use the UnicodeString type.
UnicodeString
is strictly for UTF-16 encoding (with WideChar
elements)
and was introduced with Delphi 2009.
The name “Unicode”String is unfortunately chosen,
because UTF-8 and UTF-32 are also part of the Unicode standard —
but this string type specifically and only refers to UTF-16.
In terms of memory management, see:
Heap and Reference Counting,
as with AnsiString.
Unlike AnsiString
, UnicodeString
cannot contain content in different encodings —
it always contains UTF-16 (the successor to the UCS-2 standard).
In Delphi, UTF-16 is useful for working with the Windows API.
That’s why the String
type has been synonymous with UnicodeString
in Delphi since 2009.
In FreePascal/Lazarus, which is also used for Linux development,
UTF-8 was chosen instead, and AnsiString
remained the default.
As a result, UnicodeString
plays a minor role in Lazarus (mainly for Delphi compatibility).
Unfortunately, UTF-16 has a similar issue to UTF-8: a character can consist of multiple elements.
Therefore, UnicodeString
offers no advantage over
UTF8String
when it comes to accessing complete characters.
UCS4String
is simply an array of UCS4Char
(each element is 32 bits = 4 bytes) and is designed to store a UTF-32 encoded string.
UTF-32 has the unbeatable advantage that each element represents exactly one character.
Unfortunately, the type is not well supported. There are only a few functions available for UCS4String
.
The last element should contain a null character to be compatible with null-terminated strings,
but this must be handled entirely manually.
The string length is calculated as Length(Array) - 1
and may be 0 or -1 if there is no content.
As such, UCS4String
is sadly comparable to the old array of Char
,
which no one really wants to use anymore.
That’s unfortunate, because 4 bytes per character is no longer an issue on modern systems.
If there were good support for UCS4String
— similar to what AnsiString
offers —
it could actually be a modern and practical type.
But the existing string types already cover most use cases,
and the current definition as an array of UCS4Char
is, compatibility-wise, a dead end.
There are many options, but in practice, most people just use the String
type,
because it leaves little to be desired
and is fully supported by all string functions as the system’s standard string type.
If issues do arise during development, they often involve typecasting to PChar
,
or failure to account for the fact that a character in UTF-8 may occupy up to 4 bytes — and the same applies to UTF-16.
Code page conversions are usually only needed for special cases,
such as reading/writing ANSI text files or working with databases.