2.2. Syntax¶
This chapter describes how program texts are syntactically analyzed.
2.2.1. Steps of analysis¶
A program text is a unicode string.
A program text is first sliced to a sequence of lexemes using longest possible matching from the start to the end.
Next the lexeme sequence is reduced to a sequence of tokens, filtering out non-token lexemes.
And then, the token sequence is parsed as
a program
abstract source tree.
2.2.2. Lexemes¶
Whitespace and comments¶
Regular expression |
Lexeme category |
---|---|
|
Whitespace |
|
Comment |
Whitespace characters are used to make a program text easy to read and to separate tokens from each other. Space (U+0020), carriage return (U+000d) and line feed (U+000a) are whitespace characters.
Note
Horizontal tab (U+0009) is not a valid whitespace character. This design decision is intended to avoid bikeshed discussion about indentation.
The categories of lexemes (
, [
and {
may
differ when they are located after whitespace characters.
See the description of terminal symbols for detail.
Comments are used to describe the program text.
A number sign #
(U+0023) indicates a start of a comment.
The comment continues till the end of the line,
which is before the line feed character or the end of the program text.
Comments are treated as whitespace characters.
stdout.print_line('foo'*2) # => foofoo
stdout.print_line( 'foo' * 2 ) # => foofoo
# Comment line
do_something # trailing comment
Whitespace characters and comments are not tokens.
Symbol tokens¶
Regular expression |
Lexeme category |
---|---|
|
Verb symbol token |
|
Noun symbol token |
A symbol token consists of a leading ascii letter (a-zA-Z) or an underscore _
(U+005f),
and a trailing sequence of zero or more ascii letters (a-zA-Z), ascii digits (0-9),
underscores _
(U+005f) and question marks ?
(U+0x3f).
There are two types of symbols: verbs and nouns.
A verb symbol token is a symbol token
which does not contain uppercase letters [A-Z]
.
Verbs are commonly used for names of variables which contain functions. The following are exmaples of verbs.
empty?
_loop
take_5
A noun symbol token is a symbol token
which contains one or more uppercase letters [A-Z]
.
Nouns are commonly used for names of variables which contain ordinary values. The following are examples of nouns.
More_lines?
ArrayList_class
FLAT_MAP
_HASH_TABLE
rarely_Used
Num tokens¶
Regular expression |
Lexeme category |
---|---|
|
Num token (base10) |
|
Num token (base2) |
|
Num token (base16) |
There are three types of num tokens: base10, base16 and base2. A base10 num token represents a num in decimal, base16 in hexadecimal, and base2 in binary.
The fractional part can be represented only by base10 num tokens.
Digits before the period .
(U+002e) represents the integer portion,
and digits after the period represents the fractional portion.
Underscore characters _
(U+005f) can be placed for spacing.
These are examples of num tokens, each of which represents 42.
42
42__
0042
0x2a
0b_10_1010
These are examples of num tokens with the fractional portion.
0.0
0.001
3.141_592_653
All base10, base2 and base16 num tokens cannot be directly followed
by a character which can form a symbol.
This limitation is represented as (?![a-zA-Z0-9_?])
.
Thus, for example, a code fragment 24h
causes a syntax error.
Note
Without this rule, for example, 0b123
is parsed as 0b1
and 23
.
It is certainly error prone.
Each num token represents a number the scale of which is the count of digits of the fractional portion, and the mantissa of which is the integer made of the digits.
String tokens¶
Regular expression |
Lexeme category |
---|---|
|
String token (simple) |
|
String token (rich) |
There are two types of string tokens: simple and rich.
In a simple string token, any characters between the two single quotation marks '
(U+0027)
are the content of the string.
If you want to include a quotation mark itself in the string,
put two consecutive quotation marks.
These are examples of simple strings.
'Hello world'
'Let''s go!'
(it represents"Let's go!"
)
In a rich string token, characters between the two double quotation marks "
(U+0022)
are the content of the string.
In the token, a sequence of characters prefixed by a backslash \ (U+005c) represents
a special character, such as a line feed (\n
) or a double quotation mark (\"
).
These are examples of rich string tokens.
"Let's go!"
"GET /index.html HTTP/1.1\r\nHost: host.example.org\r\n"
Here is a list of backslash notations.
Notation |
Unicode |
Description |
---|---|---|
|
U+0000 |
Null character |
|
U+0007 |
Bell |
|
U+0008 |
Backspace |
|
U+0009 |
Horizontal tab |
|
U+000a |
Line feed |
|
U+000b |
Vertical tab |
|
U+000c |
Form feed |
|
U+000d |
Carriage return |
|
U+001b |
Escape |
|
U+0022 |
Double quotation mark |
|
U+005c |
Backslash |
|
U+xxxxxx |
Character specified by Unicode.
xxxxxx are one to six hexadecimal digits (0-9a-f).
The digits must be less than or equal to |
Mark tokens¶
Tokens other than described above are called genetically mark tokens.
Some lexemes are reduced to different mark tokens depending on the conditions.
- WS
There are tokens preceding the current lexeme, and one or more whitespace characters are placed between the previous token and the current lexeme.
- not WS
There are not tokens preceding the current lexeme, or no whitespace characters is placed between the previous token and the current lexeme.
Pattern |
Condition |
Mark token |
Note |
---|---|---|---|
literal |
always |
|
|
literal |
always |
|
|
literal |
always |
|
|
literal |
always |
|
|
literal |
always |
|
|
literal |
always |
|
|
literal |
always |
|
not-equal operator |
literal |
always |
|
|
literal |
always |
|
greater-than operator |
literal |
always |
|
less-than-or-equal-to operator |
literal |
always |
|
greater-than-or-equal-to operator |
literal |
always |
|
|
literal |
always |
|
|
literal |
always |
|
|
literal |
always |
|
|
literal |
always |
|
|
literal |
always |
|
|
literal |
always |
|
|
literal |
always |
|
|
literal |
always |
|
|
literal |
always |
|
|
literal |
always |
|
|
literal |
always |
|
Let clauses |
literal |
not WS |
COLON |
Local variable references |
literal |
WS |
WS_COLON |
Attributional variable references |
literal |
not WS |
DOLLAR |
Local load of verb variables |
literal |
WS |
WS_DOLLAR |
Attributional load of verb variables |
literal |
always |
|
Access to variables or functions |
literal |
always |
|
Elements spreading |
literal |
not WS |
OPENBRACKET |
Vectors, formal receivers, or actual receivers |
literal |
WS |
WS_OPENBRACKET |
Vectors |
literal |
always |
|
Closing |
literal |
not WS |
OPENBRACE |
Function expressions or function arguments |
literal |
WS |
WS_OPENBRACE |
Function expressions |
literal |
always |
|
Closing |
literal |
not WS |
OPENPAREN |
Parentheses expressions, formal arguments, or actual arguments |
literal |
WS |
WS_OPENPAREN |
Parentheses expressions |
literal |
always |
|
Closing |
regex |
always |
|
Context binding expressions |
2.2.3. Abstract source tree¶
The list describes the parsing rule of the program
abstract source tree, or AST,
from the token sequence of a program text.
There are several shift/reduce conflicts in the rules, and the parser always choose to shift.
program ::=toplevel
toplevel ::=empty
expression
toplevel
seq ::=empty
substantial_seq
substantial_seq ::=expression
expression
substantial_seq
expression
'='expression
substantial_seq
expression ::=store_op
store_op ::=logor_op
logor_op
'<-'logor_op
logor_op ::=logand_op
logand_op
'||'logor_op
logand_op ::=relation_op
relation_op
'&&'logand_op
relation_op ::=add_op
add_op
'=='add_op
add_op
'!='add_op
add_op
'<'add_op
add_op
'>'add_op
add_op
'<='add_op
add_op
'>='add_op
add_op ::=multiply_op
add_op
'+'multiply_op
add_op
'-'multiply_op
add_op
'|'multiply_op
add_op
'^'multiply_op
multiply_op ::=unary_op
multiply_op
'*'unary_op
multiply_op
'/'unary_op
multiply_op
'//'unary_op
multiply_op
'%'unary_op
multiply_op
'&'unary_op
multiply_op
'<<'unary_op
multiply_op
'>>'unary_op
unary_op ::=primary
'-'unary_op
'!'unary_op
'~'unary_op
primary ::=num
str
binding
paren
vec
fun
local_load
attr_load
local_ref
attr_ref
local_call
attr_call
num ::= NUM str ::= STRING binding ::= '\binding' paren ::= OPENPARENseq
')' WS_OPENPARENseq
')' vec ::= OPENBRACKETvec_body
']' WS_OPENBRACKETvec_body
']' fun ::= OPENBRACEfun_body
'}' WS_OPENBRACEfun_body
'}' local_load ::= NOUN | DOLLAR VERB | WS_DOLLAR VERB local_varref ::= COLON NOUN WS_COLON NOUN COLON VERB WS_COLON VERB local_call ::= VERBrecv
args
attr_load ::=primary
'.' NOUNprimary
DOLLAR VERB attr_varref ::=primary
COLON NOUNprimary
COLON VERB attr_call ::=primary
'.' VERBrecv
args
recv ::=empty
OPENBRACKETexpression
']' args ::=paren_args
fun_args
paren_args ::=empty
OPENPARENvec_body
')' fun_args ::=empty
fun_arg
fun_args
fun_arg ::= OPENBRACEfun_body
'}' vec_body ::=empty
elements_producer
vec_body
elements_producer ::=expression
'...'expression
fun_body ::=formal_receiver
formal_args
seq
formal_receiver ::=empty
OPENBRACKETexpression
']' formal_args ::=empty
OPENPARENvec_body
')' empty ::=
These symbols shown in the production rules are terminal symbols.
Terminal symbol |
Description |
---|---|
Verb symbol such as |
|
Noun symbol such as |
|
Num literal such as |
|
String literal such as |
|
COLON |
Mark token COLON |
WS_COLON |
Mark token WS_COLON |
DOLLAR |
Mark token DOLLAR |
WS_DOLLAR |
Mark token WS_DOLLAR |
OPENPAREN |
Mark token OPENPAREN |
WS_OPENPAREN |
Mark token WS_OPENPAREN |
OPENBRACKET |
Mark token OPENBRACKET |
WS_OPENBRACKET |
Mark token WS_OPENBRACKET |
OPENBRACE |
Mark token OPENBRACE |
WS_OPENBRACE |
Mark token WS_OPENBRACE |
Those enclosed by single quotation marks |
Corresponding mark token |