6.1 Lexical conventions
Blanks
The following characters are considered as blanks: space, newline,
horizontal tabulation, carriage return, line feed and form feed. Blanks are
ignored, but they separate adjacent identifiers, literals and
keywords that would otherwise be confused as one single identifier,
literal or keyword.
Comments
Comments are introduced by the two characters (*, with no
intervening blanks, and terminated by the characters *), with
no intervening blanks. Comments are treated as blank characters.
Comments do not occur inside string or character literals. Nested
comments are handled correctly.
Identifiers
ident |
::= |
(letter∣ _) { letter∣ 0…9∣ _∣ ' } |
letter |
::= |
A … Z ∣ a … z |
Identifiers are sequences of letters, digits, _ (the underscore
character), and ' (the single quote), starting with a
letter or an underscore.
Letters contain at least the 52 lowercase and uppercase
letters from the ASCII set. The current implementation (except on
MacOS 9) also recognizes as letters all accented characters from the ISO
8859-1 (“ISO Latin 1”) set. All characters in an identifier are
meaningful. The current implementation accepts identifiers up to
16000000 characters in length.
Integer literals
integer-literal |
::= |
[-] (0…9) { 0…9∣ _ } |
|
∣ |
[-] (0x∣ 0X) (0…9∣ A…F∣ a…f)
{ 0…9∣ A…F∣ a…f∣ _ } |
|
∣ |
[-] (0o∣ 0O) (0…7) { 0…7∣ _ } |
|
∣ |
[-] (0b∣ 0B) (0…1) { 0…1∣ _ } |
An integer literal is a sequence of one or more digits, optionally
preceded by a minus sign. By default, integer literals are in decimal
(radix 10). The following prefixes select a different radix:
Prefix |
Radix |
0x, 0X |
hexadecimal (radix 16) |
0o, 0O |
octal (radix 8) |
0b, 0B |
binary (radix 2) |
(The initial 0 is the digit zero; the O for octal is the letter O.)
The interpretation of integer literals that fall outside the range of
representable integer values is undefined.
For convenience and readability, underscore characters (_) are accepted
(and ignored) within integer literals.
Floating-point literals
float-literal |
::= |
[-] (0…9) { 0…9∣ _ } [. { 0…9∣ _ }]
[(e∣ E) [+∣ -] (0…9) { 0…9∣ _ }] |
Floating-point decimals consist in an integer part, a decimal part and
an exponent part. The integer part is a sequence of one or more
digits, optionally preceded by a minus sign. The decimal part is a
decimal point followed by zero, one or more digits.
The exponent part is the character e or E followed by an
optional + or - sign, followed by one or more digits.
The decimal part or the exponent part can be omitted, but not both to
avoid ambiguity with integer literals.
The interpretation of floating-point literals that fall outside the
range of representable floating-point values is undefined.
For convenience and readability, underscore characters (_) are accepted
(and ignored) within floating-point literals.
Character literals
char-literal |
::= |
' regular-char ' |
|
∣ |
' escape-sequence ' |
escape-sequence |
::= |
\ (\ ∣ " ∣ ' ∣ n ∣ t ∣ b ∣ r) |
|
∣ |
\ (0…9) (0…9) (0…9) |
|
∣ |
\x (0…9∣ A…F∣ a…f)
(0…9∣ A…F∣ a…f) |
Character literals are delimited by ' (single quote) characters.
The two single quotes enclose either one character different from
' and \, or one of the escape sequences below:
Sequence |
Character denoted |
\\ |
backslash (\) |
\" |
double quote (") |
\' |
single quote (') |
\n |
linefeed (LF) |
\r |
carriage return (CR) |
\t |
horizontal tabulation (TAB) |
\b |
backspace (BS) |
\space |
space (SPC) |
\ddd |
the character with ASCII code ddd in decimal |
\xhh |
the character with ASCII code hh in hexadecimal |
String literals
string-literal |
::= |
" { string-character } " |
string-character |
::= |
regular-char-str |
|
∣ |
escape-sequence |
String literals are delimited by " (double quote) characters.
The two double quotes enclose a sequence of either characters
different from " and \, or escape sequences from the
table given above for character literals.
To allow splitting long string literals across lines, the sequence
\newline blanks (a \ at end-of-line followed by any
number of blanks at the beginning of the next line) is ignored inside
string literals.
The current implementation places practically no restrictions on the
length of string literals.
Naming labels
To avoid ambiguities, naming labels cannot just be defined
syntactically as the sequence of the three tokens ~, ident and
:, and have to be defined at the lexical level.
label |
::= |
~ (a … z) { letter∣ 0…9∣ _∣ ' } : |
optlabel |
::= |
? (a … z) { letter∣ 0…9∣ _∣ ' } : |
Naming labels come in two flavours: label for normal arguments and
optlabel for optional ones. They are simply distinguished by their
first character, either ~ or ?.
Prefix and infix symbols
infix-symbol |
::= |
(= ∣ < ∣ > ∣ @ ∣ ^ ∣ | ∣ & ∣
+ ∣ - ∣ * ∣ / ∣ $ ∣ %) { operator-char } |
prefix-symbol |
::= |
(! ∣ ? ∣ ~) { operator-char } |
operator-char |
::= |
! ∣ $ ∣ % ∣ & ∣ * ∣ + ∣ - ∣ . ∣
/ ∣ : ∣ < ∣ = ∣ > ∣ ? ∣ @ ∣
^ ∣ | ∣ ~ |
Sequences of “operator characters”, such as <=> or !!,
are read as a single token from the infix-symbol or prefix-symbol
class. These symbols are parsed as prefix and infix operators inside
expressions, but otherwise behave much as identifiers.
Keywords
The identifiers below are reserved as keywords, and cannot be employed
otherwise:
and as assert asr begin class
constraint do done downto else end
exception external false for fun function
functor if in include inherit initializer
land lazy let lor lsl lsr
lxor match method mod module mutable
new object of open or private
rec sig struct then to true
try type val virtual when while
with
The following character sequences are also keywords:
!= # & && ' ( ) * + , -
-. -> . .. : :: := :> ; ;; <
<- = > >] >} ? ?? [ [< [> [|
] _ ` { {< | |] } ~
Note that the following identifiers are keywords of the Camlp4
extensions and should be avoided for compatibility reasons.
parser << <: >> $ $$ $:
Ambiguities
Lexical ambiguities are resolved according to the “longest match”
rule: when a character sequence can be decomposed into two tokens in
several different ways, the decomposition retained is the one with the
longest first token.
Line number directives
linenum-directive |
::= |
# {0 … 9}+ |
|
∣ |
# {0 … 9}+ " { string-character } " |
Preprocessors that generate Caml source code can insert line number
directives in their output so that error messages produced by the
compiler contain line numbers and file names referring to the source
file before preprocessing, instead of after preprocessing.
A line number directive is composed of a # (sharp sign), followed by
a positive integer (the source line number), optionally followed by a
character string (the source file name).
Line number directives are treated as blank characters during lexical
analysis.