2.2. Syntax

This chapter describes how program texts are syntactically analyzed.

2.2.1. Steps of analysis

A program text is a unicode string.

A program text is first sliced to a sequence of lexemes using longest possible matching from the start to the end.

Next the lexeme sequence is reduced to a sequence of tokens, filtering out non-token lexemes.

And then, the token sequence is parsed as a program abstract source tree.

2.2.2. Lexemes

Whitespace and comments

Regular expression Lexeme category
[\u0020\u000d\u000a] Whitespace
#[^\u000a]* Comment

Whitespace characters are used to make a program text easy to read and to separate tokens from each other. Space (U+0020), carriage return (U+000d) and line feed (U+000a) are whitespace characters.

注釈

Horizontal tab (U+0009) is not a valid whitespace character. This design decision is intended to avoid bikeshed discussion about indentation.

The categories of lexemes (, [ and { may differ when they are located after whitespace characters. See the description of terminal symbols for detail.

Comments are used to describe the program text. A number sign # (U+0023) indicates a start of a comment. The comment continues till the end of the line, which is before the line feed character or the end of the program text. Comments are treated as whitespace characters.

stdout.print_line('foo'*2)      # => foofoo

stdout.print_line( 'foo' * 2 )  # => foofoo

# Comment line
do_something  # trailing comment

Whitespace characters and comments are not tokens.

Symbol tokens

Regular expression Lexeme category
[a-z_][a-z0-9_?]* Verb symbol token
([a-z_][a-z0-9_?]*)?[A-Z][a-zA-Z0-9_?]* Noun symbol token

A symbol token consists of a leading ascii letter (a-zA-Z) or an underscore _ (U+005f), and a trailing sequence of zero or more ascii letters (a-zA-Z), ascii digits (0-9), underscores _ (U+005f) and question marks ? (U+0x3f). There are two types of symbols: verbs and nouns.

A verb symbol token is a symbol token which does not contain uppercase letters [A-Z].

Verbs are commonly used for names of variables which contain functions. The following are exmaples of verbs.

  • empty?
  • _loop
  • take_5

A noun symbol token is a symbol token which contains one or more uppercase letters [A-Z].

Nouns are commonly used for names of variables which contain ordinary values. The following are examples of nouns.

  • More_lines?
  • ArrayList_class
  • FLAT_MAP
  • _HASH_TABLE
  • rarely_Used

Integer tokens

Regular expression Lexeme category
[0-9][0-9_]* Integer token (base10)
0x_*[0-9a-f][0-9a-f_]* Integer token (base16)
0b_*[01][01_]* Integer token (base2)

There are three types of integer tokens: base10, base16 and base2. A base10 integer token represents an integer number in decimal, base16 in hexadecimal, and base2 in binary.

Underscore characters _ (U+005f) can be placed for spacing.

These are examples of integer tokens, each of which represents 42.

  • 42
  • 42__
  • 0042
  • 0x2a
  • 0b_10_1010

Note that octal integer notation is not supported. A sequence of digits which starts with 0 is read as a base10 integer.

Each integer token represents a number whose scale is zero, and whose mantissa is the specified integer.

Decimal tokens

Regular expression Lexeme category
[0-9][0-9_]*\.[0-9][0-9_]* Decimal token

A decimal token consists of one or more digits (0-9) which represents the integer portion, a period . (U+002e), and one or more digits (0-9) which represents the fractional portion.

Underscore characters _ (U+005f) can be placed for spacing.

These are examples of decimals tokens.

  • 0.0
  • 0.001
  • 3.141_592_653

Each decimal token represents a number whose scale is the number of digits on the right side of the period, and whose mantissa is the integer made of the digits.

String tokens

Regular expression Lexeme category
'(''|[^'])*' String token (simple)
"([^"\\]|\\[0abtnvfre"\\]|\\x\{(0[0-9a-f]{0,5}|[2-9a-f][0-9a-f]{0,4}|10[0-9a-f]{0,4}|1[0-9a-f]{0,4})\})*" String token (rich)

There are two types of string tokens: simple and rich.

In a simple string token, any characters between the two single quotation marks ' (U+0027) are the content of the string. If you want to include a quotation mark itself in the string, put two consecutive quotation marks.

These are examples of simple strings.

  • 'Hello world'
  • 'Let''s go!' (it represents "Let's go!")

In a rich string token, characters between the two double quotation marks " (U+0022) are the content of the string. In the token, a sequence of characters prefixed by a backslash \ (U+005c) represents a special character, such as a line feed (\n) or a double quotation mark (\").

These are examples of rich string tokens.

  • "Let's go!"
  • "GET /index.html HTTP/1.1\r\nHost: host.example.org\r\n"

Here is a list of backslash notations.

Notation Unicode Description
\0 U+0000 Null character
\a U+0007 Bell
\b U+0008 Backspace
\t U+0009 Horizontal tab
\n U+000a Line feed
\v U+000b Vertical tab
\f U+000c Form feed
\r U+000d Carriage return
\e U+001b Escape
\" U+0022 Double quotation mark "
\\ U+005c Backslash \
\x{xxxxxx} U+xxxxxx Character specified by Unicode. xxxxxx are one to six hexadecimal digits (0-9a-f). The digits must be less than or equal to 10ffff.

Mark tokens

Tokens other than described above are called genetically mark tokens.

Some lexemes are reduced to different mark tokens depending on the conditions.

WS
There are tokens preceding the current lexeme, and one or more whitespace characters are placed between the previous token and the current lexeme.
not WS
There are not tokens preceding the current lexeme, or no whitespace characters is placed between the previous token and the current lexeme.
Literal pattern Condition Mark token Note
! always ! op_lognot operator
~ always ~ op_not operator
<- always <- op_store operator
|| always || op_logor operator
&& always && op_logand operator
== always == op_eq operator
!= always != op_ne operator
< always < op_lt operator
> always > op_gt operator
<= always <= op_le operator
>= always >= op_ge operator
| always | op_or operator
^ always ^ op_xor operator
& always & op_and operator
<< always << op_shl operator
>> always >> op_shr operator
+ always + op_add operator
- always - op_sub operator, or op_minus operator
* always * op_mul operator
/ always / op_div operator
// always // op_intdiv operator
% always % op_rem operator
= always = Let clauses
: not WS COLON Local variable references
: WS WS_COLON Attributional variable references
\ always \ Prefix of pseudo variables
$ not WS DOLLAR Local dereference of verb variables
$ WS WS_DOLLAR Attributional dereference of verb variables
. always . Access to variables or functions
!! always !! Direct call expressions
... always ... Elements spreading
[ not WS OPENBRACKET Vectors, formal receivers, or actual receivers
[ WS WS_OPENBRACKET Vectors
] always ] Closing
{ not WS OPENBRACE Function expressions or function arguments
{ WS WS_OPENBRACE Function expressions
} always } Closing
( not WS OPENPAREN Parentheses expressions, formal arguments, or actual arguments
( WS WS_OPENPAREN Parentheses expressions
) always ) Closing

2.2.3. Abstract source tree

The list describes the parsing rule of the program abstract source tree, or AST, from the token sequence of a program text.

There are several shift/reduce conflicts in the rules, and the parser always choose to shift.

program           ::=  toplevel
toplevel          ::=  empty
                       expression toplevel
seq               ::=  empty
                       substantial_seq
substantial_seq   ::=  expression
                       expression substantial_seq
                       expression '=' expression substantial_seq
expression        ::=  store_op
store_op          ::=  logor_op
                       logor_op '<-' logor_op
logor_op          ::=  logand_op
                       logand_op '||' logor_op
logand_op         ::=  relation_op
                       relation_op '&&' logand_op
relation_op       ::=  add_op
                       add_op '==' add_op
                       add_op '!=' add_op
                       add_op '<' add_op
                       add_op '>' add_op
                       add_op '<=' add_op
                       add_op '>=' add_op
add_op            ::=  multiply_op
                       add_op '+' multiply_op
                       add_op '-' multiply_op
                       add_op '|' multiply_op
                       add_op '^' multiply_op
multiply_op       ::=  unary_op
                       multiply_op '*' unary_op
                       multiply_op '/' unary_op
                       multiply_op '//' unary_op
                       multiply_op '%' unary_op
                       multiply_op '&' unary_op
                       multiply_op '<<' unary_op
                       multiply_op '>>' unary_op
unary_op          ::=  primary
                       '-' unary_op
                       '!' unary_op
                       '~' unary_op
primary           ::=  num
                       str
                       context_env
                       context_recv
                       context_args
                       paren
                       vec
                       fun
                       local_deref
                       attr_deref
                       local_ref
                       attr_ref
                       local_call
                       attr_call
                       direct_call
num               ::=  INTEGER
                       DECIMAL
str               ::=  STRING
context_env       ::=  '\' VERB('env')
context_recv      ::=  '\' VERB('recv')
context_args      ::=  '\' VERB('args')
paren             ::=  OPENPAREN seq ')'
                       WS_OPENPAREN seq ')'
vec               ::=  OPENBRACKET vec_body ']'
                       WS_OPENBRACKET vec_body ']'
fun               ::=  OPENBRACE fun_body '}'
                       WS_OPENBRACE fun_body '}'
local_deref       ::=  NOUN | DOLLAR VERB | WS_DOLLAR VERB
local_ref         ::=  COLON NOUN
                       WS_COLON NOUN
                       COLON VERB
                       WS_COLON VERB
local_call        ::=  VERB recv args
attr_deref        ::=  primary '.' NOUN
                       primary DOLLAR VERB
attr_ref          ::=  primary COLON NOUN
                       primary COLON VERB
attr_call         ::=  primary '.' VERB recv args
direct_call       ::=  primary '!!' recv args
recv              ::=  empty
                       OPENBRACKET expression ']'
args              ::=  paren_args fun_args
paren_args        ::=  empty
                       OPENPAREN vec_body ')'
fun_args          ::=  empty
                       fun_arg fun_args
fun_arg           ::=  OPENBRACE fun_body '}'
vec_body          ::=  empty
                       elements_producer vec_body
elements_producer ::=  expression
                       '...' expression
fun_body          ::=  formal_receiver formal_args seq
formal_receiver   ::=  empty
                       OPENBRACKET expression ']'
formal_args       ::=  empty
                       OPENPAREN vec_body ')'
empty             ::= 

These symbols shown in the production rules are terminal symbols.

Terminal symbol Description
VERB Verb symbol such as print, including env, recv and args
VERB('env') Verb symbol env
VERB('recv') Verb symbol recv
VERB('args') Verb symbol args
NOUN Noun symbol such as Count
INTEGER Integer literal such as 42
DECIMAL Decimal literal such as 3.14159
STRING String literal such as 'Peng!'
COLON Mark token COLON
WS_COLON Mark token WS_COLON
DOLLAR Mark token DOLLAR
WS_DOLLAR Mark token WS_DOLLAR
OPENPAREN Mark token OPENPAREN
WS_OPENPAREN Mark token WS_OPENPAREN
OPENBRACKET Mark token OPENBRACKET
WS_OPENBRACKET Mark token WS_OPENBRACKET
OPENBRACE Mark token OPENBRACE
WS_OPENBRACE Mark token WS_OPENBRACE
Those enclosed by single quotation marks Corresponding mark token