String Types in Pascal (programming language)

String Types in Pascal

Overview: The following data types (clickable) are offered by modern Pascal compilers to store strings. This page also covers the following topics: PChar, const parameters, and memory and reference counting.

Array of Char	historical
ShortString	UTF-8 or ANSI, max. 255 bytes, no reference counting
String	platform-dependent: AnsiString or UnicodeString
AnsiString	UTF-8 or ANSI
UTF8String	UTF-8-typed AnsiString
RawByteString	untyped AnsiString, adopts codepage of the source
WideString	UTF-16, no reference counting
UnicodeString	UTF-16
UCS4String	UTF-32, array of UCS4Char

Introduction

With the types AnsiString and UnicodeString, Delphi and FreePascal/Lazarus have gained very powerful, practical, and fast string types. Both types carry length information with them (a "counted string" as opposed to a null-terminated string), can dynamically allocate as much memory as the system architecture and operating system allow, know the character encoding used for their content, and use reference counting to avoid unnecessary copying of the content.

All of this is very convenient and leaves little to be desired — except perhaps that the language of the content (e.g., de, en, fr, etc.) cannot be stored in the metadata. And in one special case, reference counting can be a disadvantage: when a string's content is used across multiple threads (multi-threading), the programmer must ensure that each thread receives its own copy of the string in advance.

From today's perspective, the naming of some string types is a bit unfortunate. The UnicodeString should actually be called UTF16String (analogous to UTF8String), because potentially all string types can contain Unicode characters, as UTF-8 and UTF-32 are just as much part of the Unicode system as UTF-16. And UCS4String would be better named UTF32String, as UCS4 has been superseded by the UTF-32 standard (functionally identical).

String

The type String is not a clearly defined type, but rather depends on the environment and compiler settings. In FreePascal and Lazarus, a new unit by default includes the compiler directive {$H+}, which means that the type String should be a long string. In FreePascal, a long string is an AnsiString (UTF-8). In contrast, in Delphi from version 2009 onward, a long string is a UnicodeString (UTF-16).

If this compiler directive is not present in a unit or is specified as {$H-}, then the type String is a ShortString with 255 bytes statically reserved for content.

Heap Memory and Reference Counting

The modern string types in Delphi and FreePascal, i.e., AnsiString and UnicodeString, use reference counting. Exceptions are the types ShortString, WideString, and UCS4String.
Referencing a string content means that a string variable internally is just a pointer that points to the content, which may be shared among multiple string variables.

If a constant value is assigned to an AnsiString, e.g., s := 'Hello';, the variable s only receives a pointer to the constant stored in the read-only code segment of the program. The function StringRefCount(s) will return -1 in this case, which indicates that a constant was assigned.

Heap-allocated: If the string content is created at runtime, e.g., by concatenation such as s := s + ' world!';, then memory is allocated on the heap (↗dynamic memory) and the new content is stored as a complete string "Hello world!". As long as the new string is assigned to a single variable only, StringRefCount(s) will return 1 as the reference count.

The AnsiString variable itself is always just a pointer to the content. This pointer is 4 bytes (32-bit) or 8 bytes (64-bit) in size and is stored in the variable’s scope — that is, for a local variable on the stack, or for example in a record or object.

The metadata is stored in memory along with the content. The following table shows the memory layout used for AnsiString content. The string variable as a pointer points directly to the content, meaning that metadata is accessed internally by reading before the pointer address. Likewise, a type cast like PChar(s) returns a pointer directly to the content (the first character), as is well known.

CodePage	ElementSize	ReferenceCount	Length	Content	Terminator
2 bytes	2 bytes	4 or 8 bytes	4 or 8 bytes	Chars	Char #0

The string content in a 32-bit application can theoretically be up to 2 GB in size. In a 64-bit application, a string could theoretically grow up to 9 exabytes — that’s 9 billion GB.

If a string variable is assigned another string of the same type, the content is not copied; only a pointer to the existing content is set, and the reference count for that content is increased. This is faster than copying the content — especially if the string is 9 exabytes large. ;)

The "copy-on-write" mechanism automatically copies the content only when a modification to it becomes necessary. If the string variable is assigned new content, the reference count of the previous content is decreased. If the count drops to 0, it means no variable is using (referencing) that content anymore, and the previously allocated memory is released.

A content can be referenced up to 2 billion times (32-bit, signed?). On 64-bit systems, it can even be referenced up to 9 trillion times (according to the Wiki: “On 64-bit targets, fields RefCount/Length consume 8 bytes each, not 4.”).

const Parameters

procedure MyProcedure(const s: String);

In a function header, string parameters should ideally be declared as constant using const, provided that the procedure does not need to modify the string content. The advantage is better performance. This also applies to other structured types such as records or arrays.

If the parameter is passed without const, the content is usually copied for the local variable. This ensures that any modifications remain within the procedure and do not affect the original data, allowing the caller to continue working with its unmodified values. However, a local copy is unnecessary if the content is only read and not changed. Using const tells the compiler that no modification will occur.

However, for modern reference-counted string types (AnsiString and UnicodeString), the const declaration is less critical than it was with older types. Without const, for example, a ShortString is copied entirely. But with a reference-counted string, only the reference count is incremented. Even so, using const still has an advantage: it avoids even that reference counter change, resulting in a slight performance gain.

Array[1..n] of Char

Looking back: In the early Pascal versions of the 1970s, there was no string type yet. Instead, you had to define a static array for a character sequence manually, with a fixed length depending on your needs.

In Pascal, the Char array was traditionally defined with an index starting at 1, because the ↗creator of Pascal considered counting from 1 to be more natural for humans. To this day, all newer Pascal string types still use index 1 as the starting point — with the exception of UCS4String, which is simply an array of UCS4Char starting at index 0.

If the content does not fill the entire length of the array, you must store the actual used length separately somewhere. That was really quite inconvenient. Alternatively, you could place a #0 character after the content — making it a null-terminated string like in the C programming language.

PChar

PChar is not a string type, but simply a pointer to a character. PChar is mainly used to pass a pointer to a function in an API or another function library, for example, functions in a .dll (such as the Windows API) or a .so file.

The PChar type does not allocate its own memory for content. This means that when using PChar, you must ensure that the memory is allocated elsewhere and remains unchanged for as long as the PChar might be used to access the content.

PChar is not a fixed type but may represent either PAnsiChar or PWideChar, depending on the environment — specifically on the compiler switch that determines whether a String is an AnsiString or a UnicodeString.

When you apply PChar(s) as a type cast, you get a pointer to the first character of the string — or NIL if the string is empty. It is important to remember that the string variable is responsible for memory management. So, as long as you want to use or pass the PChar pointer, the string variable must remain unchanged.

A PChar should point to a null-terminated content, because it carries no metadata and thus does not know the length of the content. Fortunately, the types AnsiString and UnicodeString are automatically null-terminated, so you typically don’t need to worry about this. If you are working with content that is not null-terminated, such as a ShortString, you must store and pass the length separately.

ShortString
String[n]

ShortString was the first string type introduced in Turbo Pascal. In principle, a ShortString works similarly to an array of Char, with the difference that the first element s[0] stores the length actually used by the content. Since the length indicator is just one Char (1 byte), the maximum content length is 255 bytes (Char value range #0..#255). The first character of the content is at s[1], thus continuing Pascal's tradition of 1-based indexing for strings.

The capacity of a short string must be defined in advance, e.g. var s: String[40]; for a maximum of 40 bytes of content. Including the length byte, String[40] always occupies 41 bytes — even if the string is empty. If you declare ShortString as a type, it corresponds to String[255], i.e. the maximum short string size of up to 255 characters with a fixed memory size of 256 bytes.

The memory location of a ShortString depends on the scope in which the variable is declared: as a global variable it is stored in the data segment, as a local variable on the stack, as a record field inside the record, and as a class property in the object instance on the heap.

The character encoding of the content depends on the system or inherits the encoding of the assigned content. In FreePascal and Lazarus today, the content is usually UTF-8 encoded. In Delphi, however, the content is ANSI-encoded — in Western Europe typically using code page Windows-1252.

A ShortString is not null-terminated, which means if you want to use it with PChar, you would need to manually append a #0 to the content or pass the string length separately (if possible).

When using a short string with only a few characters, String[n] can have advantages in terms of memory and speed compared to reference-counted strings. For example, for a language code like "de" or a country code like "CH", a String[3] takes just 4 bytes of memory, making it very compact and faster to copy than increasing and decreasing a reference counter. The same applies to codes like "de-AT" or short words that fit into a String[7]. The fixed 8-byte size is no larger than the pointer used by an AnsiString on a 64-bit system. However, to use such a string in string functions, it must be converted to an AnsiString or UnicodeString, which reverses any memory or performance benefits.

AnsiString

An AnsiString contains single-byte characters of type AnsiChar. The AnsiString type was introduced in 1996 with Delphi 2, the first 32-bit version of Delphi. Before that, one had to make do with ShortString. For details on memory management of AnsiString, see Heap and Reference Counting.

Character Encoding: In Delphi, the 1-byte encoding follows the system’s code page, e.g., code page “Windows-1252” in Western Europe. In Lazarus, UTF-8 became the default encoding for AnsiString in 2016. In console applications, the compiler directive {$codepage utf8} also sets UTF-8 as the default encoding for AnsiString.

In UTF-8, a character can occupy between 1 and 4 bytes. The advantage is that it can represent any alphabet and any script. In contrast, traditional code page encodings use exactly 1 byte per character. This makes accessing individual characters very simple, but each code page is limited to just one alphabet.

AnsiString(Codepage)

Typing via code page number: It is possible to use a specific character encoding within an AnsiString by specifying the desired code page when declaring the AnsiString.

var
 s_cp1251: AnsiString(1251);    // Cyrillic only
 s_cp1252: AnsiString(1252);    // Latin-1 only (Western European)
 s_cp1253: AnsiString(1253);    // Greek only

 s_utf8  : AnsiString(CP_UTF8); // = UTF8String
 s_raw   : AnsiString(CP_NONE); // = RawByteString

begin
 s_utf8 := 'Γειά σου, Κόσμε!';  // Greek: "Hello World!"

 s_cp1253 := s_utf8;    // valid: cp1253 supports Greek characters

 s_raw := s_cp1253;     // valid: a RawByteString accepts any code page

 s_cp1252 := s_cp1253;  // invalid: cp1252 doesn't support Greek characters
end;

If one AnsiString variable is assigned to another, recoding to the target code page is performed when needed. The example above demonstrates such special cases. However, in most programs this is rarely necessary, since everything typically uses the same default code page.

When assigning to a UTF8String, it always works, because UTF-8 can encode all Unicode characters (in 1 to 4 bytes). However, if you assign s_cp1253 (Greek) to s_cp1252 (Latin only), the Greek characters have no equivalents in the Western European code page. As a result, these characters will be replaced with '?' in s_cp1252, rendering the string useless.

Be careful when assigning a constant to a typed string: The constant (UTF-8 encoded) content is always assigned, even if the target string is typed with a different code page and would require recoding. In my opinion, this is a compiler bug — when reading the variable later, the UTF-8 content is misinterpreted according to the declared code page, making it unusable.

I also encountered another issue with TStringList.Text: If you add strings to a TStringList that do not match the system’s default code page, this is not accounted for when processing via .Text. As a result, you may get an unusable combined string, because the default code page doesn’t match the differently encoded content.

RawByteString

A RawByteString is a special type of AnsiString that is not assigned a standard language code page. When an AnsiString is assigned to a RawByteString, it simply adopts its code page, and no recoding of the content is performed.

function ToWesternEuropean(const s: RawByteString): RawByteString;
begin
  Result := s;  // only referencing here
  if StringCodePage(Result) <> 1252 then
    SetCodePage(Result, 1252);  // dereferencing and recoding
end;

This function header accepts any code page without triggering an automatic recoding (e.g., to UTF-8).
Inside the function: If the input string s already uses code page 1252 (Western European), no recoding takes place, and the same string reference is returned — effectively with no time cost.

UTF8String

A UTF8String is a typed AnsiString(CP_UTF8), which means it is guaranteed to contain UTF-8 encoded content. This type is only relevant for Delphi, where AnsiString by default uses the system’s ANSI code page instead of UTF-8.

In Lazarus, UTF-8 has been the default encoding for AnsiString since 2016, and AnsiString is equivalent to the String type. Therefore, in Lazarus, you can usually just use String and don’t need UTF8String.

However, if a unit written in FreePascal is meant to be Delphi-compatible and relies on UTF-8 for specific reasons, you should explicitly declare UTF8String to ensure it is treated the same way by both compilers.

WideString

WideString is a UTF-16 string type introduced with Delphi 3 (1997), specifically for Windows. It implements the Windows BSTR (Basic String) type, which is used as the native string type for the COM interface (OLE Automation).

On Windows, the WideString content is not stored on the application’s own heap but managed through the Windows API using SysAllocString() and SysFreeString(). Because of this, WideString does not use reference counting, and the indirect memory management via the Windows API is slightly slower.

Since Linux does not support COM, the string content in FreePascal under Linux is stored in the regular heap. Therefore, if you do not need to access the Windows COM interface but want to use a UTF-16 string, you should instead use the UnicodeString type.

UnicodeString

UnicodeString is strictly for UTF-16 encoding (with WideChar elements) and was introduced with Delphi 2009. The name “Unicode”String is unfortunately chosen, because UTF-8 and UTF-32 are also part of the Unicode standard — but this string type specifically and only refers to UTF-16.

In terms of memory management, see: Heap and Reference Counting, as with AnsiString. Unlike AnsiString, UnicodeString cannot contain content in different encodings — it always contains UTF-16 (the successor to the UCS-2 standard).

In Delphi, UTF-16 is useful for working with the Windows API. That’s why the String type has been synonymous with UnicodeString in Delphi since 2009. In FreePascal/Lazarus, which is also used for Linux development, UTF-8 was chosen instead, and AnsiString remained the default. As a result, UnicodeString plays a minor role in Lazarus (mainly for Delphi compatibility).

Unfortunately, UTF-16 has a similar issue to UTF-8: a character can consist of multiple elements. Therefore, UnicodeString offers no advantage over UTF8String when it comes to accessing complete characters.

UCS4String

UCS4String is simply an array of UCS4Char (each element is 32 bits = 4 bytes) and is designed to store a UTF-32 encoded string. UTF-32 has the unbeatable advantage that each element represents exactly one character.

Unfortunately, the type is not well supported. There are only a few functions available for UCS4String. The last element should contain a null character to be compatible with null-terminated strings, but this must be handled entirely manually. The string length is calculated as Length(Array) - 1 and may be 0 or -1 if there is no content.

As such, UCS4String is sadly comparable to the old array of Char, which no one really wants to use anymore. That’s unfortunate, because 4 bytes per character is no longer an issue on modern systems. If there were good support for UCS4String — similar to what AnsiString offers — it could actually be a modern and practical type. But the existing string types already cover most use cases, and the current definition as an array of UCS4Char is, compatibility-wise, a dead end.

Conclusion

There are many options, but in practice, most people just use the String type, because it leaves little to be desired and is fully supported by all string functions as the system’s standard string type.

If issues do arise during development, they often involve typecasting to PChar, or failure to account for the fact that a character in UTF-8 may occupy up to 4 bytes — and the same applies to UTF-16. Code page conversions are usually only needed for special cases, such as reading/writing ANSI text files or working with databases.

Back • Scroll to Top • Homepage