3.1. 構文¶
この章では、プログラムテキストの構文解析がどのように行われるかを述べる。
3.1.1. 解析の手順¶
プログラムテキスト は unicode scalar value の列である。
プログラムテキストは、まず形態素の列に分割される。分割は、最長一致法をもちいて、先頭から末尾まで行われる。
次に形態素の列は、トークンでない形態素を省くことにより、トークンの列に変換される。
ついで、トークンの列は、 program の抽象構文木としてパースされる。
3.1.2. 形態素¶
3.1.2.1. Whitespace and comments¶
正規表現 |
形態素の種別 |
|---|---|
|
Whitespace |
|
Comment |
Whitespace is used to make a program text easy to read and to separate tokens from each other. Space (U+0020), LF (U+000a), and CR+LF (U+000d, U+000a) are whitespace character sequences.
注釈
Horizontal tab (U+0009) is not a valid whitespace character. This design decision is intended to avoid bikeshed discussion about indentation.
The categories of morphemes (, [ and { may
differ when they are located after whitespace characters.
See the description of token types for detail.
Comments are used to describe the program text.
A number sign # (U+0023) indicates a start of a comment.
The comment continues till the end of the line,
which is before the CR or LF character or the end of the program text.
Comments are treated as whitespace characters.
stdout.print_line('foo'*2) # => foofoo
stdout.print_line( 'foo' * 2 ) # => foofoo
# Comment line
do_something # trailing comment
Whitespace characters and comments are not tokens.
3.1.2.2. Symbol tokens¶
正規表現 |
形態素の種別 |
|---|---|
|
Fun symbol token |
|
Data symbol token |
A symbol token consists of a leading ascii letter (a-zA-Z) or an underscore _ (U+005f),
and a trailing sequence of zero or more ascii letters (a-zA-Z), ascii digits (0-9),
underscores _ (U+005f) and question marks ? (U+0x3f).
There are two types of symbol tokens: function symbol tokens and data symbol tokens.
A function symbol token is a symbol token which represents a function symbol. Function symbols are commonly used for names of variables which contain functions. The following are exmaples of function symbols.
any?_looptake_5
A data symbol token is a symbol token which represents a data symbol. Data symbols are commonly used for names of variables which contain ordinary values. The following are examples of data symbols.
More_lines?ArrayList_classFLAT_MAP_HASH_TABLErarely_Used
3.1.2.3. Num tokens¶
正規表現 |
形態素の種別 |
|---|---|
|
Num token (base10) |
|
Num token (base2) |
|
Num token (base16) |
There are three types of num tokens: base10, base16 and base2. A base10 num token represents a num in decimal, base16 in hexadecimal, and base2 in binary.
The fractional part can be represented only by base10 num tokens.
Digits before the period . (U+002e) represents the integer portion,
and digits after the period represents the fractional portion.
Underscore characters _ (U+005f) can be placed for spacing.
These are examples of num tokens, each of which represents 42.
4242__00420x2a0b_10_1010
These are examples of num tokens with the fractional portion.
0.00.0013.141_592_653
All base10, base2 and base16 num tokens cannot be directly followed
by a character which can form a symbol.
This limitation is represented as (?![a-zA-Z0-9_?]).
Thus, for example, a code fragment 24h causes a syntax error.
注釈
Without this rule, for example, 0b123 is parsed as 0b1 and 23.
It is certainly error prone.
Each num token represents a number the scale of which is the count of digits of the fractional portion, and the mantissa of which is the integer made of the digits.
3.1.2.4. 文字列トークン¶
正規表現 |
形態素の種別 |
|---|---|
|
文字列トークン(単純) |
|
文字列トークン(リッチ) |
文字列トークン には単純とリッチの二種類がある。
単純文字列トークン の中では、ふたつの一重引用符 ' (U+0027)にはさまれた文字が、文字列の中身になる。ただし、一重引用符自体を文字列に含みたい場合は、一重引用符をふたつ続ける。
単純文字列トークンの例:
'Hello world''Let''s go!'(これは"Let's go!"を表す)
リッチ文字列トークン の中では、ふたつの二重引用符 " (U+0022)にはさまれた文字が、文字列の中身になる。このトークンの中では、バックスラッシュ \ (U+005c)で始まる列が特殊な文字を表す。たとえば、ラインフィード (\n)や二重引用符(\")のように。
リッチ文字列トークンの例:
"Let's go!""GET /index.html HTTP/1.1\r\nHost: host.example.org\r\n"
バックスラッシュ記法 の一覧:
記法 |
Unicode |
説明 |
|---|---|---|
|
U+0000 |
ヌル文字 |
|
U+0007 |
ベル文字 |
|
U+0008 |
後退 |
|
U+0009 |
水平タブ |
|
U+000a |
ラインフィード |
|
U+000b |
垂直タブ |
|
U+000c |
書式送り |
|
U+000d |
キャリッジリターン |
|
U+001b |
エスケープ文字 |
|
U+0022 |
二重引用符 |
|
U+005c |
バックスラッシュ |
|
U+xxxxxx |
Unicode scalar value で指定された文字。xxxxxxは1つから6つ並んだ16進数の桁(0-9a-f)で、0からd7ff16 またはe00016 から10ffff16 の範囲(いずれも包含的)の整数を表す。 |
注釈
正規表現が示すように、単純文字列トークン、リッチ文字列トークンともに、複数行のプログラム行にまたがることはできない。これは、エディタープラグインなどの外部ツールが、容易に字句解析できるようにするためである。複数行に渡るテキストをプログラムから扱いたい場合は、リッチ文字列トークンでバックスラッシュ記法を使うか、 データファイルモジュール を同梱すると良い。
3.1.2.5. Binding token¶
正規表現 |
形態素の種別 |
|---|---|
|
Binding token |
A binding token is literally \binding.
It is used for a binding expression.
3.1.2.6. Punctuation marks and operators¶
Tokens other than described above are punctuation marks or operators.
Some morphemes, like $, are reduced to different tokens depending on
whether it is placed just after a whitespace character or not.
If the morpheme is placed after a whitespace character,
the token is represented with a prefix ws,
like ws$.
If the morpheme is not placed after a whitespace character,
the token is represented with a prefix nows, like nows$.
Other morphemes, like ==, are reduced to a single token
not depending on whether it is placed just after a whitespace character or not.
Those tokens are represented literally like ==.
The list of punctuation marks and operators:
Token |
Note |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
not-equal operator |
|
less-than operator |
|
greater-than operator |
|
|
|
greater-than-or-equal-to operator |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Let clauses |
|
Local variable references |
|
Member variable references |
|
Local variable load of a function symbol |
|
Member variable load of a function symbol |
|
Access to variables or functions |
|
Elements spreading |
|
Vectors, formal receivers, or actual receivers |
|
Vectors |
|
Closing |
|
Function expressions or function arguments |
|
Function expressions |
|
Closing |
|
Parentheses expressions, formal arguments, or actual arguments |
|
Parentheses expressions |
|
Closing |
3.1.3. Abstract source tree¶
The list describes the parsing rule of the program abstract source tree, or AST,
from the token sequence of a program text.
There are several shift/reduce conflicts in the rules, and the parser always choose to shift.
program ::=topleveltoplevel ::=emptyexpressiontoplevelseq ::=emptysubstantial_seqsubstantial_seq ::=expressionexpressionsubstantial_seqexpression'='expressionsubstantial_seqexpression ::=store_opstore_op ::=logor_oplogor_op'<-'logor_oplogor_op ::=logand_oplogand_op'||'logor_oplogand_op ::=relation_oprelation_op'&&'logand_oprelation_op ::=add_opadd_op'=='add_opadd_op'!='add_opadd_op'<'add_opadd_op'>'add_opadd_op'<='add_opadd_op'>='add_opadd_op ::=multiply_opadd_op'+'multiply_opadd_op'-'multiply_opadd_op'|'multiply_opadd_op'^'multiply_opmultiply_op ::=unary_opmultiply_op'*'unary_opmultiply_op'/'unary_opmultiply_op'//'unary_opmultiply_op'%'unary_opmultiply_op'&'unary_opmultiply_op'<<'unary_opmultiply_op'>>'unary_opunary_op ::=primary'-'unary_op'!'unary_op'~'unary_opprimary ::=numstrbindingparenvecfunlocal_loadmember_loadlocal_varrefmember_varreflocal_callmember_callnum ::= NUM str ::= STRING binding ::= BINDING paren ::= 'nows('seq')' 'ws('seq')' vec ::= 'nows['vec_body']' 'ws['vec_body']' fun ::= 'nows{'fun_body'}' 'ws{'fun_body'}' local_load ::= DATA_SYM | nows'$' FUN_SYM | ws'$' FUN_SYM local_varref ::= 'nows:' DATA_SYM 'ws:' DATA_SYM 'nows:' FUN_SYM 'ws:' FUN_SYM local_call ::= FUN_SYMrecvargsmember_load ::=primary'.' DATA_SYMprimary'nows$' FUN_SYM member_varref ::=primary'nows:' DATA_SYMprimary'nows:' FUN_SYM member_call ::=primary'.' FUN_SYMrecvargsrecv ::=empty'nows['expression']' args ::=paren_argsfun_argsparen_args ::=empty'nows('vec_body')' fun_args ::=emptyfun_argfun_argsfun_arg ::= 'nows{'fun_body'}' vec_body ::=emptyelements_producervec_bodyelements_producer ::=expression'...'expressionfun_body ::=formal_receiverformal_argsseqformal_receiver ::=empty'nows['expression']' formal_args ::=empty'nows('vec_body')' empty ::=
The following identifiers represent token types.
Token type |
説明 |
|---|---|
FUN_SYM |
Function symbol token such as |
DATA_SYM |
Data symbol token such as |
NUM |
Num token such as |
STRING |
String token such as |
BINDING |
Binding token |
Those enclosed by single quotation marks |