Now that we know what arrays are, have gotten in touch with immutable
,
and had a quick look at the basic types, it's time to introduce two
new constructs in one line:
alias string = immutable(char)[];
The term string
is defined by an alias
statement which defines it
as a slice of immutable(char)
s. This means, once a string
has been constructed
its content will never change again. And actually this is the second
introduction: welcome UTF-8 string
!
Due to their immutablility, string
s can be shared perfectly among
different threads. As string
is a slice, parts can be taken out of it without
allocating memory. The standard function
std.algorithm.splitter
for example, splits a string by newline without any memory allocations.
Besides the UTF-8 string
, there are two more types:
alias wstring = immutable(wchar)[]; // UTF-16
alias dstring = immutable(dchar)[]; // UTF-32
The variants are most easily converted between each other using
the to
method from std.conv
:
dstring myDstring = to!dstring(myString);
string myString = to!string(myDstring);
This means that a plain string
is defined as an array of 8-bit Unicode code
units. All array operations can be
used on strings, but they will work on a code unit level, and not a character level. At
the same time, standard library algorithms will interpret string
s as sequences
of code points, and there is also an
option to treat them as sequences of
graphemes by explicit usage of
std.uni.byGrapheme
.
This small example illustrates the difference in interpretation:
string s = "\u0041\u0308"; // Ä
writeln(s.length); // 3
import std.range : walkLength;
writeln(s.walkLength); // 2
import std.uni : byGrapheme;
writeln(s.byGrapheme.walkLength); // 1
Here the actual array length of s
is 3, because it contains 3 code units:
0x41
, 0x03
and 0x08
. Those latter two define a single code point
(combining diacritics character) and
walkLength
(standard library function to calculate arbitrary range length) counts two code
points total. Finally, byGrapheme
performs rather expensive calculations
to recognize that these two code points combine into a single displayed
character.
Correct processing of Unicode can be very complicated, but most of the time, D
developers can simply consider string
variables as magical byte arrays and
rely on standard library algorithms to do the right job.
If by element (code unit) iteration is desired, one can use
byCodeUnit
.
Auto-decoding in D is explained in more detail in the Unicode gems chapter.
Strings in D can always span over multiple lines:
string multiline = "
This
may be a
long document
";
When quotes appear in the document, Wysiwyg strings (see below) or heredoc strings can be used.
It is also possible to use raw strings to minimize laborious escaping
of reserved symbols. Raw strings can be declared using either backticks ( `
... `
) or the r(aw)-prefix (r" ... "
).
string raw = `raw "string"`; // raw "string"
string raw2 = r"raw `string`"; // raw `string`
D provides even more ways to represent strings - don't hesitate to explore them.