07 - literals

When you initialize or assign an object, data you write in the code (e.g. numbers) is generally stored in some read-only memory block inside the executable.

This data also has a type and there are ways to specify it.

Integer literals

By default, integer literals are of type int. The type can be changed by applying specific suffix:

1
2
3
4
5
6
7
8
123    // int
123u   // unsigned
123ul  // unsigned long
123ull // unsigned long long

// C++23
123z   // signed version of std::size_t
123uz  // std::size_t (an alias, typically for unsigned long long)
  • The suffix is not case-sensitive but for long long, it must be either ll or LL. Mixed (lL or Ll) are not valid. Lowercase is recommended.

  • There is no suffix for short.

The size type is one of integer types which the target platform uses for storing sizes, indexes etc. Its unsigned variant has an alias name std::size_t and usually it is unsigned long long.

Integers may be written using multiple numeric systems:

1
2
3
4
5
int decimal      = 42;         // base 10 (digits 0 - 9)
int octal        = 052;        // base  8 (digits 0 - 7)
int hexadecimal1 = 0x2a;       // base 16 (digits 0 - f/F)
int hexadecimal2 = 0X2A;       // as above (case doesn't matter)
int binary       = 0b00101010; // base  2 (digits 0 and 1), requires C++14

All of the above represent the same value.

Since C++14 numbers may use single quotes to separate digit groups. There are no requirements on grouping, all quotes are simply ignored.

1
2
3
int x1 = 123456789;
int x2 = 123'456'789;
int x3 = 1'2345'6'7'89;

What happens if I try to assign/initialize a variable with a literal of a different type?

The value will be converted. If the destination type can not represent the data, some of it might be lost (narrowing conversion). Convertions will be explained in-depth in a later lesson.

Character and string literals

Characters are stored as numbers. Depending on the encoding, the value in memory can be interpreted as different text.

Character literals use prefixes. They are case-sensitive.

It's possible to write character value directly (in hexadecimal system) by using \U character.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
 'a'          // char, ASCII encoding (compatible with UTF-8)
u'貓'         // char16_t, UTF-16, supports only BMP part of Unicode
U'🍌'         // char32_t, UTF-32, supports any Unicode character
U'\U0001f34c' // same as line above, value written as hex
L'β'          // wchar_t, encoding and value implementation-defined

// introduced in C++17, char
// breaking change in C++20: now it has type char8_t
u8'a'         // UTF-8, single byte characters from ISO 10646

// these forms are not really used
// compilers often give a warning about such code because
// multiple characters within '' are usually a typo
 'abc' // int, value implementation-defined
L'abc' // wchar_t, value implementation-defined

Don't worry if you don't get UTF information right now, it will be useful later once you learn more about Unicode.

In actual programs, it's really rare to have a need to write characters directly in the code - programs offering translations usually load language-specific text from files.

String literals like "abc" follow identical rules, they can have identical prefixes. Doubly quoted text is simply an array of multiple objects of the same character type.

Encoding used by std::cout is implementation-defined, but every known implementation uses UTF-8. The default encoding for source code files in most situations is also UTF-8, so you shouldn't have problems storing and outputting Unicode text in your programs. If you do, it's usually a matter of project's compiler settings.

std::cout will not accept every possible character type. For wchar_t (and its arrays) you will need to use std::wcout.

String literal concatenation

If multiple string literals are next to each other, only with whitespace between, they behave as if one long string literal. This allows splitting and formatting large blocks of text embedded in code without introducing unwanted line breaks.

1
2
3
4
5
6
7
// this will be one long string literal
"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod "
"tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, "
"quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo "
"consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse "
"cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat "
"non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."

Each quoted string can have own prefix. If one prefix is present, the concatenated string will have the type specified by prefix. If multiple prefixes are present, they all must be the same.

https://en.wikipedia.org/wiki/Lorem_ipsum

Escape sequences

Some characters can not be represented easily in text - you have already seen this with line breaks. How about other conflicts? What if we want to output ' or "?

Escape sequences allow to represent troublesome characters in source code by specifying their numeric value or special predefined sequence.

sequence

description

representation

\'

single quote

byte 0x27

\"

double quote

byte 0x22

\?

question mark

byte 0x3f

\\

backslash

byte 0x5c

\a

audible bell

byte 0x07

\b

backspace

byte 0x08

\f

form feed (new page)

byte 0x0c

\n

line feed (new line)

byte 0x0a

\r

carriage return

byte 0x0d

\t

horizontal tab

byte 0x09

\v

vertical tab

byte 0x0b

\nnn

arbitrary octal value

byte nnn

\Xnn

arbitrary hexadecimal value

byte nn

\c

implementation-defined

implementation-defined

\unnnn

arbitrary Unicode value; may result in several code units

code point U+nnnn

\Unnnnnnnn

arbitrary Unicode value; may result in several code units

code point U+nnnnnnnn

A very common sequence is \0 (represented as byte 0x00) which is the null character, used to denote end of data in various contexts.

\? is not necessary, you can write ? directly too but it's kept for backwards compatibility. In the past (before C++17) there was a very weird feature called trigraphs which allowed 3-character special sequences, parsed even before comments. This feature could accidentally be used by unescaped ?.

Because Unicode is ASCII-compatible and ASCII is backwards compatible with very old telegraph systems, you can see some historical control characters:

  • \a caused the machine to output specific sound - see https://en.wikipedia.org/wiki/Bell_character for its history

  • \f - https://en.wikipedia.org/wiki/Page_break#Form_feed

  • \r was used to cause the machine to reset position to the beginning of a line. The \r\n sequence was very common and in fact, Windows OS uses this sequence up to today - enter key in Windows-based programs outputs this 2-character sequence while every other system outputs only \n. In many editors you can find the setting how line endings should be written: LF (Unix) or CRLF (Windows). As \r has lost its meaning in the era of screens (not telegrams), programs which display text simply ignore this character. For more history - see https://en.wikipedia.org/wiki/Carriage_return.

  • \t, \v - historically they meant advancement to the next multiple of 8 characters horizontally and 6 lines vertically. \v is not used anymore but \t is still widely used to indent code. Editors often allow to change tab size (usually 2/4/8) and convert indentation to/from spaces.

Backspace (\b) is used by keyboards to indicate pressed backspace key. If you use this character in a program, its meaning can be different depending what other program will use this data:

  • If \b is written to a text file, it's up to the file reading/displaying program what will be done with it. Most will simply ignore this character.

  • If \b is written to an interactive shell (such as the one in which you can run your compiled programs) the shell will usually discard/overwrite previously output character, just like it was a telegraph machine. A similar behavior can be observed with \r which will discard/overwrite entire line.

Raw string literals

An alternative to escape sequences are raw strings, in which special characters loose their meaning and everything between delimeters is treated as it is.

The syntax is:

prefix(optional) R"delimiter(raw_characters)delimiter"

Example:

1
2
3
4
// all are equivalent
R"(\a\b\c\d\e\f)"
R"xXx(\a\b\c\d\e\f)xXx"
"\\a\\b\\c\\d\\e\\f"

Raw string literals may span multiple code lines (without concatenation) and they will contain all characters between delimeters, including whitespace such as line breaks.

Raw string literals can be concatenated with other string literals.

Floating-point literals

By default, floating-point literals are of type double.

  • With suffix f or F, they are float.

  • With suffix l or L, they are long double.

Floating point literals support various formats, including expotential notation and hexadecimal fractions. When using dot (.) one of digit sequences is optional.

Examples:

1
2
3
4
5
6
7
8
9
10
11
12
42.0f // 42, float
42.f  // 42, float
0.42l // 0.42, long double
.42l  // 0.42, long double
3e10  // 30000000000, double
123.456e-67 // 123.456 * 10^(-67), double
.1E4l       // 1000, long double
0x10.1p2f   // 64.25, float (uses hexadecimal digits)
0x1.2p3     // 9 (1.125 * 2^3), double (uses hexadecimal digits)

NAN // NaN, float, macro constant defined in <cmath>
    // it's implementation-defined, may not exist if not supported

Other literals

It's worth noting that not all literals have to be made of characters or digits - some literals are keywords. You already know 2 of them: false and true are literals of type bool.

Later you will be learn about one more keyword literal - nullptr.

Automatic type

A simple but very useful feature is the type placeholder auto. It will deduce the type based on the expression used in intialization:

1
2
3
4
5
auto b = true; // bool
auto i = 1;    // int
auto l = 1l;   // long
auto f = 1.0f; // float
auto x; // error: can't deduce without initializer

Exercise

Try to output the text "\n??='\\ using both escape sequences and raw strings.