2.2. Syntax¶
This chapter describes how program texts are syntactically analyzed.
2.2.1. Steps of analysis¶
A program text is a unicode string.
A program text is first sliced to a sequence of lexemes using longest possible matching from the start to the end.
Next the lexeme sequence is reduced to a sequence of tokens, filtering out non-token lexemes.
And then, the token sequence is parsed as
a program
abstract source tree.
2.2.2. Lexemes¶
Whitespace and comments¶
Regular expression | Lexeme category |
---|---|
[\u0020\u000d\u000a] |
Whitespace |
#[^\u000a]* |
Comment |
Whitespace characters are used to make a program text easy to read and to separate tokens from each other. Space (U+0020), carriage return (U+000d) and line feed (U+000a) are whitespace characters.
Note
Horizontal tab (U+0009) is not a valid whitespace character. This design decision is intended to avoid bikeshed discussion about indentation.
The categories of lexemes (
, [
and {
may
differ when they are located after whitespace characters.
See the description of terminal symbols for detail.
Comments are used to describe the program text.
A number sign #
(U+0023) indicates a start of a comment.
The comment continues till the end of the line,
which is before the line feed character or the end of the program text.
Comments are treated as whitespace characters.
stdout.print_line('foo'*2) # => foofoo
stdout.print_line( 'foo' * 2 ) # => foofoo
# Comment line
do_something # trailing comment
Whitespace characters and comments are not tokens.
Symbol tokens¶
Regular expression | Lexeme category |
---|---|
[a-z_][a-z0-9_?]* |
Verb symbol token |
([a-z_][a-z0-9_?]*)?[A-Z][a-zA-Z0-9_?]* |
Noun symbol token |
A symbol token consists of a leading ascii letter (a-zA-Z) or an underscore _
(U+005f),
and a trailing sequence of zero or more ascii letters (a-zA-Z), ascii digits (0-9),
underscores _
(U+005f) and question marks ?
(U+0x3f).
There are two types of symbols: verbs and nouns.
A verb symbol token is a symbol token
which does not contain uppercase letters [A-Z]
.
Verbs are commonly used for names of variables which contain functions. The following are exmaples of verbs.
empty?
_loop
take_5
A noun symbol token is a symbol token
which contains one or more uppercase letters [A-Z]
.
Nouns are commonly used for names of variables which contain ordinary values. The following are examples of nouns.
More_lines?
ArrayList_class
FLAT_MAP
_HASH_TABLE
rarely_Used
Num tokens¶
Regular expression | Lexeme category |
---|---|
[0-9][0-9_]*(\.[0-9][0-9_]*)?(?![a-zA-Z0-9_?]) |
Num token (base10) |
0b_*[01][01_]*(?![a-zA-Z0-9_?]) |
Num token (base2) |
0x_*[0-9a-f][0-9a-f_]*(?![a-zA-Z0-9_?]) |
Num token (base16) |
There are three types of num tokens: base10, base16 and base2. A base10 num token represents a num in decimal, base16 in hexadecimal, and base2 in binary.
The fractional part can be represented only by base10 num tokens.
Digits before the period .
(U+002e) represents the integer portion,
and digits after the period represents the fractional portion.
Underscore characters _
(U+005f) can be placed for spacing.
These are examples of num tokens, each of which represents 42.
42
42__
0042
0x2a
0b_10_1010
These are examples of num tokens with the fractional portion.
0.0
0.001
3.141_592_653
All base10, base2 and base16 num tokens cannot be directly followed
by a character which can form a symbol.
This limitation is represented as (?![a-zA-Z0-9_?])
.
Thus, for example, a code fragment 24h
causes a syntax error.
Note
Without this rule, for example, 0b123
is parsed as 0b1
and 23
.
It is certainly error prone.
Each num token represents a number the scale of which is the count of digits of the fractional portion, and the mantissa of which is the integer made of the digits.
String tokens¶
Regular expression | Lexeme category |
---|---|
'(''|[^'])*' |
String token (simple) |
"([^"\\]|\\[0abtnvfre"\\]|\\x\{(0[0-9a-f]{0,5}|[2-9a-f][0-9a-f]{0,4}|10[0-9a-f]{0,4}|1[0-9a-f]{0,4})\})*" |
String token (rich) |
There are two types of string tokens: simple and rich.
In a simple string token, any characters between the two single quotation marks '
(U+0027)
are the content of the string.
If you want to include a quotation mark itself in the string,
put two consecutive quotation marks.
These are examples of simple strings.
'Hello world'
'Let''s go!'
(it represents"Let's go!"
)
In a rich string token, characters between the two double quotation marks "
(U+0022)
are the content of the string.
In the token, a sequence of characters prefixed by a backslash \ (U+005c) represents
a special character, such as a line feed (\n
) or a double quotation mark (\"
).
These are examples of rich string tokens.
"Let's go!"
"GET /index.html HTTP/1.1\r\nHost: host.example.org\r\n"
Here is a list of backslash notations.
Notation | Unicode | Description |
---|---|---|
\0 |
U+0000 | Null character |
\a |
U+0007 | Bell |
\b |
U+0008 | Backspace |
\t |
U+0009 | Horizontal tab |
\n |
U+000a | Line feed |
\v |
U+000b | Vertical tab |
\f |
U+000c | Form feed |
\r |
U+000d | Carriage return |
\e |
U+001b | Escape |
\" |
U+0022 | Double quotation mark " |
\\ |
U+005c | Backslash \ |
\x{xxxxxx} |
U+xxxxxx | Character specified by Unicode.
xxxxxx are one to six hexadecimal digits (0-9a-f).
The digits must be less than or equal to 10ffff . |
Mark tokens¶
Tokens other than described above are called genetically mark tokens.
Some lexemes are reduced to different mark tokens depending on the conditions.
- WS
- There are tokens preceding the current lexeme, and one or more whitespace characters are placed between the previous token and the current lexeme.
- not WS
- There are not tokens preceding the current lexeme, or no whitespace characters is placed between the previous token and the current lexeme.
Pattern | Condition | Mark token | Note |
---|---|---|---|
literal ! |
always | ! |
op_lognot operator |
literal ~ |
always | ~ |
op_not operator |
literal <- |
always | <- |
op_store operator |
literal || |
always | || |
op_logor operator |
literal && |
always | && |
op_logand operator |
literal == |
always | == |
op_eq operator |
literal != |
always | != |
not-equal operator |
literal < |
always | < |
op_lt operator |
literal > |
always | > |
greater-than operator |
literal <= |
always | <= |
less-than-or-equal-to operator |
literal >= |
always | >= |
greater-than-or-equal-to operator |
literal | |
always | | |
op_or operator |
literal ^ |
always | ^ |
op_xor operator |
literal & |
always | & |
op_and operator |
literal << |
always | << |
op_shl operator |
literal >> |
always | >> |
op_shr operator |
literal + |
always | + |
op_add operator |
literal - |
always | - |
op_sub operator, or op_minus operator |
literal * |
always | * |
op_mul operator |
literal / |
always | / |
op_div operator |
literal // |
always | // |
op_intdiv operator |
literal % |
always | % |
op_rem operator |
literal = |
always | = |
Let clauses |
literal : |
not WS | COLON | Local variable references |
literal : |
WS | WS_COLON | Attributional variable references |
literal $ |
not WS | DOLLAR | Local dereference of verb variables |
literal $ |
WS | WS_DOLLAR | Attributional dereference of verb variables |
literal . |
always | . |
Access to variables or functions |
literal !! |
always | !! |
Direct call expressions |
literal ... |
always | ... |
Elements spreading |
literal [ |
not WS | OPENBRACKET | Vectors, formal receivers, or actual receivers |
literal [ |
WS | WS_OPENBRACKET | Vectors |
literal ] |
always | ] |
Closing |
literal { |
not WS | OPENBRACE | Function expressions or function arguments |
literal { |
WS | WS_OPENBRACE | Function expressions |
literal } |
always | } |
Closing |
literal ( |
not WS | OPENPAREN | Parentheses expressions, formal arguments, or actual arguments |
literal ( |
WS | WS_OPENPAREN | Parentheses expressions |
literal ) |
always | ) |
Closing |
regex \binding(?![a-zA-Z0-9_?]) |
always | \binding |
Context binding expressions |
2.2.3. Abstract source tree¶
The list describes the parsing rule of the program
abstract source tree, or AST,
from the token sequence of a program text.
There are several shift/reduce conflicts in the rules, and the parser always choose to shift.
program ::=toplevel
toplevel ::=empty
expression
toplevel
seq ::=empty
substantial_seq
substantial_seq ::=expression
expression
substantial_seq
expression
'='expression
substantial_seq
expression ::=store_op
store_op ::=logor_op
logor_op
'<-'logor_op
logor_op ::=logand_op
logand_op
'||'logor_op
logand_op ::=relation_op
relation_op
'&&'logand_op
relation_op ::=add_op
add_op
'=='add_op
add_op
'!='add_op
add_op
'<'add_op
add_op
'>'add_op
add_op
'<='add_op
add_op
'>='add_op
add_op ::=multiply_op
add_op
'+'multiply_op
add_op
'-'multiply_op
add_op
'|'multiply_op
add_op
'^'multiply_op
multiply_op ::=unary_op
multiply_op
'*'unary_op
multiply_op
'/'unary_op
multiply_op
'//'unary_op
multiply_op
'%'unary_op
multiply_op
'&'unary_op
multiply_op
'<<'unary_op
multiply_op
'>>'unary_op
unary_op ::=primary
'-'unary_op
'!'unary_op
'~'unary_op
primary ::=num
str
context_binding
paren
vec
fun
local_deref
attr_deref
local_ref
attr_ref
local_call
attr_call
direct_call
num ::= NUM str ::= STRING context_binding ::= '\binding' paren ::= OPENPARENseq
')' WS_OPENPARENseq
')' vec ::= OPENBRACKETvec_body
']' WS_OPENBRACKETvec_body
']' fun ::= OPENBRACEfun_body
'}' WS_OPENBRACEfun_body
'}' local_deref ::= NOUN | DOLLAR VERB | WS_DOLLAR VERB local_ref ::= COLON NOUN WS_COLON NOUN COLON VERB WS_COLON VERB local_call ::= VERBrecv
args
attr_deref ::=primary
'.' NOUNprimary
DOLLAR VERB attr_ref ::=primary
COLON NOUNprimary
COLON VERB attr_call ::=primary
'.' VERBrecv
args
direct_call ::=primary
'!!'recv
args
recv ::=empty
OPENBRACKETexpression
']' args ::=paren_args
fun_args
paren_args ::=empty
OPENPARENvec_body
')' fun_args ::=empty
fun_arg
fun_args
fun_arg ::= OPENBRACEfun_body
'}' vec_body ::=empty
elements_producer
vec_body
elements_producer ::=expression
'...'expression
fun_body ::=formal_receiver
formal_args
seq
formal_receiver ::=empty
OPENBRACKETexpression
']' formal_args ::=empty
OPENPARENvec_body
')' empty ::=
These symbols shown in the production rules are terminal symbols.
Terminal symbol | Description |
---|---|
VERB | Verb symbol such as print |
NOUN | Noun symbol such as Count |
NUM | Num literal such as 42 or 3.14159 |
STRING | String literal such as 'Peng!' |
COLON | Mark token COLON |
WS_COLON | Mark token WS_COLON |
DOLLAR | Mark token DOLLAR |
WS_DOLLAR | Mark token WS_DOLLAR |
OPENPAREN | Mark token OPENPAREN |
WS_OPENPAREN | Mark token WS_OPENPAREN |
OPENBRACKET | Mark token OPENBRACKET |
WS_OPENBRACKET | Mark token WS_OPENBRACKET |
OPENBRACE | Mark token OPENBRACE |
WS_OPENBRACE | Mark token WS_OPENBRACE |
Those enclosed by single quotation marks | Corresponding mark token |