2.2. Syntax

This chapter describes how program texts are syntactically analyzed.

2.2.1. Steps of analysis

A program text is a unicode string.

A program text is first sliced to a sequence of lexemes using longest possible matching from the start to the end.

Next the lexeme sequence is reduced to a sequence of tokens, filtering out non-token lexemes.

And then, the token sequence is parsed as a program abstract source tree.

2.2.2. Lexemes

2.2.2.1. Whitespace and comments

Regular expression

Lexeme category

[\u0020\u000d\u000a]

Whitespace

#[^\u000a]*

Comment

Whitespace characters are used to make a program text easy to read and to separate tokens from each other. Space (U+0020), carriage return (U+000d) and line feed (U+000a) are whitespace characters.

Note

Horizontal tab (U+0009) is not a valid whitespace character. This design decision is intended to avoid bikeshed discussion about indentation.

The categories of lexemes (, [ and { may differ when they are located after whitespace characters. See the description of terminal symbols for detail.

Comments are used to describe the program text. A number sign # (U+0023) indicates a start of a comment. The comment continues till the end of the line, which is before the line feed character or the end of the program text. Comments are treated as whitespace characters.

stdout.print_line('foo'*2)      # => foofoo

stdout.print_line( 'foo' * 2 )  # => foofoo

# Comment line
do_something  # trailing comment

Whitespace characters and comments are not tokens.

2.2.2.2. Symbol tokens

Regular expression

Lexeme category

[a-z_][a-z0-9_?]*

Fun symbol token

([a-z_][a-z0-9_?]*)?[A-Z][a-zA-Z0-9_?]*

Data symbol token

A symbol token consists of a leading ascii letter (a-zA-Z) or an underscore _ (U+005f), and a trailing sequence of zero or more ascii letters (a-zA-Z), ascii digits (0-9), underscores _ (U+005f) and question marks ? (U+0x3f). There are two types of symbols: function symbols and data symbols.

function symbol token
function symbol

A symbol token which does not contain uppercase letters [A-Z].

Fun symbols are commonly used for names of variables which contain functions. The following are exmaples of function symbols.

  • empty?

  • _loop

  • take_5

data symbol token
data symbol

A symbol token which contains one or more uppercase letters [A-Z].

A data symbol token is a symbol token which contains one or more uppercase letters [A-Z].

Data symbols are commonly used for names of variables which contain ordinary values. The following are examples of data symbols.

  • More_lines?

  • ArrayList_class

  • FLAT_MAP

  • _HASH_TABLE

  • rarely_Used

2.2.2.3. Num tokens

Regular expression

Lexeme category

[0-9][0-9_]*(\.[0-9][0-9_]*)?(?![a-zA-Z0-9_?])

Num token (base10)

0b_*[01][01_]*(?![a-zA-Z0-9_?])

Num token (base2)

0x_*[0-9a-f][0-9a-f_]*(?![a-zA-Z0-9_?])

Num token (base16)

There are three types of num tokens: base10, base16 and base2. A base10 num token represents a num in decimal, base16 in hexadecimal, and base2 in binary.

The fractional part can be represented only by base10 num tokens. Digits before the period . (U+002e) represents the integer portion, and digits after the period represents the fractional portion.

Underscore characters _ (U+005f) can be placed for spacing.

These are examples of num tokens, each of which represents 42.

  • 42

  • 42__

  • 0042

  • 0x2a

  • 0b_10_1010

These are examples of num tokens with the fractional portion.

  • 0.0

  • 0.001

  • 3.141_592_653

All base10, base2 and base16 num tokens cannot be directly followed by a character which can form a symbol. This limitation is represented as (?![a-zA-Z0-9_?]). Thus, for example, a code fragment 24h causes a syntax error.

Note

Without this rule, for example, 0b123 is parsed as 0b1 and 23. It is certainly error prone.

Each num token represents a number the scale of which is the count of digits of the fractional portion, and the mantissa of which is the integer made of the digits.

2.2.2.4. String tokens

Regular expression

Lexeme category

'(''|[^'])*'

String token (simple)

"([^"\\]|\\[0abtnvfre"\\]|\\x\{(0[0-9a-f]{0,5}|[2-9a-f][0-9a-f]{0,4}|10[0-9a-f]{0,4}|1[0-9a-f]{0,4})\})*"

String token (rich)

There are two types of string tokens: simple and rich.

In a simple string token, any characters between the two single quotation marks ' (U+0027) are the content of the string. If you want to include a quotation mark itself in the string, put two consecutive quotation marks.

These are examples of simple strings.

  • 'Hello world'

  • 'Let''s go!' (it represents "Let's go!")

In a rich string token, characters between the two double quotation marks " (U+0022) are the content of the string. In the token, a sequence of characters prefixed by a backslash \ (U+005c) represents a special character, such as a line feed (\n) or a double quotation mark (\").

These are examples of rich string tokens.

  • "Let's go!"

  • "GET /index.html HTTP/1.1\r\nHost: host.example.org\r\n"

Here is a list of backslash notations.

Notation

Unicode

Description

\0

U+0000

Null character

\a

U+0007

Bell

\b

U+0008

Backspace

\t

U+0009

Horizontal tab

\n

U+000a

Line feed

\v

U+000b

Vertical tab

\f

U+000c

Form feed

\r

U+000d

Carriage return

\e

U+001b

Escape

\"

U+0022

Double quotation mark "

\\

U+005c

Backslash \

\x{xxxxxx}

U+xxxxxx

Character specified by Unicode. xxxxxx are one to six hexadecimal digits (0-9a-f). The digits must be less than or equal to 10ffff.

2.2.2.5. Mark tokens

Tokens other than described above are called genetically mark tokens.

Some lexemes are reduced to different mark tokens depending on the conditions.

WS

There are tokens preceding the current lexeme, and one or more whitespace characters are placed between the previous token and the current lexeme.

not WS

There are not tokens preceding the current lexeme, or no whitespace characters is placed between the previous token and the current lexeme.

Pattern

Condition

Mark token

Note

literal !

always

!

op_lognot operator

literal ~

always

~

op_not operator

literal <-

always

<-

op_store operator

literal ||

always

||

op_logor operator

literal &&

always

&&

op_logand operator

literal ==

always

==

op_eq operator

literal !=

always

!=

not-equal operator

literal <

always

<

op_lt operator

literal >

always

>

greater-than operator

literal <=

always

<=

less-than-or-equal-to operator

literal >=

always

>=

greater-than-or-equal-to operator

literal |

always

|

op_or operator

literal ^

always

^

op_xor operator

literal &

always

&

op_and operator

literal <<

always

<<

op_shl operator

literal >>

always

>>

op_shr operator

literal +

always

+

op_add operator

literal -

always

-

op_sub operator, or op_minus operator

literal *

always

*

op_mul operator

literal /

always

/

op_div operator

literal //

always

//

op_intdiv operator

literal %

always

%

op_rem operator

literal =

always

=

Let clauses

literal :

not WS

COLON

Local variable references

literal :

WS

WS_COLON

Member variable references

literal $

not WS

DOLLAR

Local variable load of a function symbol

literal $

WS

WS_DOLLAR

Member variable load of a function symbol

literal .

always

.

Access to variables or functions

literal ...

always

...

Elements spreading

literal [

not WS

OPENBRACKET

Vectors, formal receivers, or actual receivers

literal [

WS

WS_OPENBRACKET

Vectors

literal ]

always

]

Closing

literal {

not WS

OPENBRACE

Function expressions or function arguments

literal {

WS

WS_OPENBRACE

Function expressions

literal }

always

}

Closing

literal (

not WS

OPENPAREN

Parentheses expressions, formal arguments, or actual arguments

literal (

WS

WS_OPENPAREN

Parentheses expressions

literal )

always

)

Closing

regex \binding(?![a-zA-Z0-9_?])

always

\binding

Context binding expressions

2.2.3. Abstract source tree

The list describes the parsing rule of the program abstract source tree, or AST, from the token sequence of a program text.

There are several shift/reduce conflicts in the rules, and the parser always choose to shift.

program           ::=  toplevel
toplevel          ::=  empty
                       expression toplevel
seq               ::=  empty
                       substantial_seq
substantial_seq   ::=  expression
                       expression substantial_seq
                       expression '=' expression substantial_seq
expression        ::=  store_op
store_op          ::=  logor_op
                       logor_op '<-' logor_op
logor_op          ::=  logand_op
                       logand_op '||' logor_op
logand_op         ::=  relation_op
                       relation_op '&&' logand_op
relation_op       ::=  add_op
                       add_op '==' add_op
                       add_op '!=' add_op
                       add_op '<' add_op
                       add_op '>' add_op
                       add_op '<=' add_op
                       add_op '>=' add_op
add_op            ::=  multiply_op
                       add_op '+' multiply_op
                       add_op '-' multiply_op
                       add_op '|' multiply_op
                       add_op '^' multiply_op
multiply_op       ::=  unary_op
                       multiply_op '*' unary_op
                       multiply_op '/' unary_op
                       multiply_op '//' unary_op
                       multiply_op '%' unary_op
                       multiply_op '&' unary_op
                       multiply_op '<<' unary_op
                       multiply_op '>>' unary_op
unary_op          ::=  primary
                       '-' unary_op
                       '!' unary_op
                       '~' unary_op
primary           ::=  num
                       str
                       binding
                       paren
                       vec
                       fun
                       local_load
                       member_load
                       local_varref
                       member_varref
                       local_call
                       member_call
num               ::=  NUM
str               ::=  STRING
binding           ::=  '\binding'
paren             ::=  OPENPAREN seq ')'
                       WS_OPENPAREN seq ')'
vec               ::=  OPENBRACKET vec_body ']'
                       WS_OPENBRACKET vec_body ']'
fun               ::=  OPENBRACE fun_body '}'
                       WS_OPENBRACE fun_body '}'
local_load        ::=  DATA_SYM | DOLLAR FUN_SYM | WS_DOLLAR FUN_SYM
local_varref      ::=  COLON DATA_SYM
                       WS_COLON DATA_SYM
                       COLON FUN_SYM
                       WS_COLON FUN_SYM
local_call        ::=  FUN_SYM recv args
member_load       ::=  primary '.' DATA_SYM
                       primary DOLLAR FUN_SYM
member_varref     ::=  primary COLON DATA_SYM
                       primary COLON FUN_SYM
member_call       ::=  primary '.' FUN_SYM recv args
recv              ::=  empty
                       OPENBRACKET expression ']'
args              ::=  paren_args fun_args
paren_args        ::=  empty
                       OPENPAREN vec_body ')'
fun_args          ::=  empty
                       fun_arg fun_args
fun_arg           ::=  OPENBRACE fun_body '}'
vec_body          ::=  empty
                       elements_producer vec_body
elements_producer ::=  expression
                       '...' expression
fun_body          ::=  formal_receiver formal_args seq
formal_receiver   ::=  empty
                       OPENBRACKET expression ']'
formal_args       ::=  empty
                       OPENPAREN vec_body ')'
empty             ::= 

These symbols shown in the production rules are terminal symbols.

Terminal symbol

Description

FUN_SYM

Fun symbol such as print

DATA_SYM

Data symbol such as Count

NUM

Num literal such as 42 or 3.14159

STRING

String literal such as 'Peng!'

COLON

Mark token COLON

WS_COLON

Mark token WS_COLON

DOLLAR

Mark token DOLLAR

WS_DOLLAR

Mark token WS_DOLLAR

OPENPAREN

Mark token OPENPAREN

WS_OPENPAREN

Mark token WS_OPENPAREN

OPENBRACKET

Mark token OPENBRACKET

WS_OPENBRACKET

Mark token WS_OPENBRACKET

OPENBRACE

Mark token OPENBRACE

WS_OPENBRACE

Mark token WS_OPENBRACE

Those enclosed by single quotation marks

Corresponding mark token