3.2. Syntax¶
This chapter describes how program texts are syntactically analyzed.
3.2.1. Steps of analysis¶
A program text is a unicode string.
A program text is first sliced to a sequence of lexemes using longest possible matching from the start to the end.
Next the lexeme sequence is reduced to a sequence of tokens, filtering out non-token lexemes.
And then, the token sequence is parsed as
a program abstract source tree.
3.2.2. Lexemes¶
3.2.2.1. Whitespace and comments¶
Regular expression |
Lexeme category |
|---|---|
|
Whitespace |
|
Comment |
Whitespace characters are used to make a program text easy to read and to separate tokens from each other. Space (U+0020), carriage return (U+000d) and line feed (U+000a) are whitespace characters.
注釈
Horizontal tab (U+0009) is not a valid whitespace character. This design decision is intended to avoid bikeshed discussion about indentation.
The categories of lexemes (, [ and { may
differ when they are located after whitespace characters.
See the description of terminal symbols for detail.
Comments are used to describe the program text.
A number sign # (U+0023) indicates a start of a comment.
The comment continues till the end of the line,
which is before the line feed character or the end of the program text.
Comments are treated as whitespace characters.
stdout.print_line('foo'*2) # => foofoo
stdout.print_line( 'foo' * 2 ) # => foofoo
# Comment line
do_something # trailing comment
Whitespace characters and comments are not tokens.
3.2.2.2. Symbol tokens¶
Regular expression |
Lexeme category |
|---|---|
|
Fun symbol token |
|
Data symbol token |
A symbol token consists of a leading ascii letter (a-zA-Z) or an underscore _ (U+005f),
and a trailing sequence of zero or more ascii letters (a-zA-Z), ascii digits (0-9),
underscores _ (U+005f) and question marks ? (U+0x3f).
There are two types of symbols: function symbols and data symbols.
- function symbol token¶
- function symbol¶
A symbol token which does not contain uppercase letters
[A-Z].
Fun symbols are commonly used for names of variables which contain functions. The following are exmaples of function symbols.
empty?_looptake_5
A data symbol token is a symbol token
which contains one or more uppercase letters [A-Z].
Data symbols are commonly used for names of variables which contain ordinary values. The following are examples of data symbols.
More_lines?ArrayList_classFLAT_MAP_HASH_TABLErarely_Used
3.2.2.3. Num tokens¶
Regular expression |
Lexeme category |
|---|---|
|
Num token (base10) |
|
Num token (base2) |
|
Num token (base16) |
There are three types of num tokens: base10, base16 and base2. A base10 num token represents a num in decimal, base16 in hexadecimal, and base2 in binary.
The fractional part can be represented only by base10 num tokens.
Digits before the period . (U+002e) represents the integer portion,
and digits after the period represents the fractional portion.
Underscore characters _ (U+005f) can be placed for spacing.
These are examples of num tokens, each of which represents 42.
4242__00420x2a0b_10_1010
These are examples of num tokens with the fractional portion.
0.00.0013.141_592_653
All base10, base2 and base16 num tokens cannot be directly followed
by a character which can form a symbol.
This limitation is represented as (?![a-zA-Z0-9_?]).
Thus, for example, a code fragment 24h causes a syntax error.
注釈
Without this rule, for example, 0b123 is parsed as 0b1 and 23.
It is certainly error prone.
Each num token represents a number the scale of which is the count of digits of the fractional portion, and the mantissa of which is the integer made of the digits.
3.2.2.4. String tokens¶
Regular expression |
Lexeme category |
|---|---|
|
String token (simple) |
|
String token (rich) |
There are two types of string tokens: simple and rich.
In a simple string token, any characters between the two single quotation marks ' (U+0027)
are the content of the string.
If you want to include a quotation mark itself in the string,
put two consecutive quotation marks.
These are examples of simple strings.
'Hello world''Let''s go!'(it represents"Let's go!")
In a rich string token, characters between the two double quotation marks " (U+0022)
are the content of the string.
In the token, a sequence of characters prefixed by a backslash \ (U+005c) represents
a special character, such as a line feed (\n) or a double quotation mark (\").
These are examples of rich string tokens.
"Let's go!""GET /index.html HTTP/1.1\r\nHost: host.example.org\r\n"
Here is a list of backslash notations.
Notation |
Unicode |
Description |
|---|---|---|
|
U+0000 |
Null character |
|
U+0007 |
Bell |
|
U+0008 |
Backspace |
|
U+0009 |
Horizontal tab |
|
U+000a |
Line feed |
|
U+000b |
Vertical tab |
|
U+000c |
Form feed |
|
U+000d |
Carriage return |
|
U+001b |
Escape |
|
U+0022 |
Double quotation mark |
|
U+005c |
Backslash |
|
U+xxxxxx |
Character specified by Unicode.
xxxxxx are one to six hexadecimal digits (0-9a-f).
The digits must be less than or equal to |
3.2.2.5. Mark tokens¶
Tokens other than described above are called genetically mark tokens.
Some lexemes are reduced to different mark tokens depending on the conditions.
- WS
There are tokens preceding the current lexeme, and one or more whitespace characters are placed between the previous token and the current lexeme.
- not WS
There are not tokens preceding the current lexeme, or no whitespace characters is placed between the previous token and the current lexeme.
Pattern |
Condition |
Mark token |
Note |
|---|---|---|---|
literal |
always |
|
|
literal |
always |
|
|
literal |
always |
|
|
literal |
always |
|
|
literal |
always |
|
|
literal |
always |
|
|
literal |
always |
|
not-equal operator |
literal |
always |
|
|
literal |
always |
|
greater-than operator |
literal |
always |
|
less-than-or-equal-to operator |
literal |
always |
|
greater-than-or-equal-to operator |
literal |
always |
|
|
literal |
always |
|
|
literal |
always |
|
|
literal |
always |
|
|
literal |
always |
|
|
literal |
always |
|
|
literal |
always |
|
|
literal |
always |
|
|
literal |
always |
|
|
literal |
always |
|
|
literal |
always |
|
|
literal |
always |
|
Let clauses |
literal |
not WS |
COLON |
Local variable references |
literal |
WS |
WS_COLON |
Member variable references |
literal |
not WS |
DOLLAR |
Local variable load of a function symbol |
literal |
WS |
WS_DOLLAR |
Member variable load of a function symbol |
literal |
always |
|
Access to variables or functions |
literal |
always |
|
Elements spreading |
literal |
not WS |
OPENBRACKET |
Vectors, formal receivers, or actual receivers |
literal |
WS |
WS_OPENBRACKET |
Vectors |
literal |
always |
|
Closing |
literal |
not WS |
OPENBRACE |
Function expressions or function arguments |
literal |
WS |
WS_OPENBRACE |
Function expressions |
literal |
always |
|
Closing |
literal |
not WS |
OPENPAREN |
Parentheses expressions, formal arguments, or actual arguments |
literal |
WS |
WS_OPENPAREN |
Parentheses expressions |
literal |
always |
|
Closing |
regex |
always |
|
Context binding expressions |
3.2.3. Abstract source tree¶
The list describes the parsing rule of the program abstract source tree, or AST,
from the token sequence of a program text.
There are several shift/reduce conflicts in the rules, and the parser always choose to shift.
program ::=topleveltoplevel ::=emptyexpressiontoplevelseq ::=emptysubstantial_seqsubstantial_seq ::=expressionexpressionsubstantial_seqexpression'='expressionsubstantial_seqexpression ::=store_opstore_op ::=logor_oplogor_op'<-'logor_oplogor_op ::=logand_oplogand_op'||'logor_oplogand_op ::=relation_oprelation_op'&&'logand_oprelation_op ::=add_opadd_op'=='add_opadd_op'!='add_opadd_op'<'add_opadd_op'>'add_opadd_op'<='add_opadd_op'>='add_opadd_op ::=multiply_opadd_op'+'multiply_opadd_op'-'multiply_opadd_op'|'multiply_opadd_op'^'multiply_opmultiply_op ::=unary_opmultiply_op'*'unary_opmultiply_op'/'unary_opmultiply_op'//'unary_opmultiply_op'%'unary_opmultiply_op'&'unary_opmultiply_op'<<'unary_opmultiply_op'>>'unary_opunary_op ::=primary'-'unary_op'!'unary_op'~'unary_opprimary ::=numstrbindingparenvecfunlocal_loadmember_loadlocal_varrefmember_varreflocal_callmember_callnum ::= NUM str ::= STRING binding ::= '\binding' paren ::= OPENPARENseq')' WS_OPENPARENseq')' vec ::= OPENBRACKETvec_body']' WS_OPENBRACKETvec_body']' fun ::= OPENBRACEfun_body'}' WS_OPENBRACEfun_body'}' local_load ::= DATA_SYM | DOLLAR FUN_SYM | WS_DOLLAR FUN_SYM local_varref ::= COLON DATA_SYM WS_COLON DATA_SYM COLON FUN_SYM WS_COLON FUN_SYM local_call ::= FUN_SYMrecvargsmember_load ::=primary'.' DATA_SYMprimaryDOLLAR FUN_SYM member_varref ::=primaryCOLON DATA_SYMprimaryCOLON FUN_SYM member_call ::=primary'.' FUN_SYMrecvargsrecv ::=emptyOPENBRACKETexpression']' args ::=paren_argsfun_argsparen_args ::=emptyOPENPARENvec_body')' fun_args ::=emptyfun_argfun_argsfun_arg ::= OPENBRACEfun_body'}' vec_body ::=emptyelements_producervec_bodyelements_producer ::=expression'...'expressionfun_body ::=formal_receiverformal_argsseqformal_receiver ::=emptyOPENBRACKETexpression']' formal_args ::=emptyOPENPARENvec_body')' empty ::=
These symbols shown in the production rules are terminal symbols.
Terminal symbol |
Description |
|---|---|
Fun symbol such as |
|
Data symbol such as |
|
Num literal such as |
|
String literal such as |
|
COLON |
Mark token COLON |
WS_COLON |
Mark token WS_COLON |
DOLLAR |
Mark token DOLLAR |
WS_DOLLAR |
Mark token WS_DOLLAR |
OPENPAREN |
Mark token OPENPAREN |
WS_OPENPAREN |
Mark token WS_OPENPAREN |
OPENBRACKET |
Mark token OPENBRACKET |
WS_OPENBRACKET |
Mark token WS_OPENBRACKET |
OPENBRACE |
Mark token OPENBRACE |
WS_OPENBRACE |
Mark token WS_OPENBRACE |
Those enclosed by single quotation marks |
Corresponding mark token |