2.2. Syntax

A program is regarded as a sequence of tokens. Whitespace characters, comments and the linefeed character can be located at almost all places between tokens.

Tokens are analyzed by the greedy-match rule. For example, a fragment catch22 is analyzed as one token: “catch22.” If you want the text to be analyzed in a different way, separate tokens by whitespace characters like catch 22. It is analyzed as two tokens: “catch” and “22.”

2.2.1. Whitespace, comments and line feed characters

Whitespace characters are used to make a program easy to read and to separate tokens from each other. Space (U+0020), horizontal tab (U+0009) and carriage return (U+000d) are whitespace characters.

Comments are used to describe the program. A number sign # (U+0023) indicates a start of a comment. The comment continues till the end of the line, which is before the line feed character or the end of the program. Comments are treated as whitespace characters.

print_line(21*2)      # => 42

print_line( 21 * 2 )  # => 42

# Comment line
do_something  # trailing comment

Line feed characters (U+000a) are used to make a program easy to read, like whitespace characters.

# All expressions in one line
:Num = Argv.first.int  :Result = Num * 3  print_line(Result)

# Separated by line feed characters
:Num = Argv.first.int
:Result = Num * 3
print_line(Result)

(, [ and { may have may have different meanings if located after whitespace characters or line feed characters. See the description of terminal symbols for detail.

2.2.2. Symbol tokens

A symbol token consists of a leading ascii letter (a-zA-Z) or an underscore _ (U+005f), and a trailing sequence of zero or more ascii letters (a-zA-Z), ascii digits (0-9), underscores _ (U+005f) and question marks ? (U+0x3f). There are two types of symbols: verbs and nouns.

If the first character of a symbol is a lower case letter (a-z), it is a verb. Verbs are commonly used for names of variables which contain functions.

These are exmaples of verbs.

  • any?
  • _loop
  • getClassLoader

If the first character of a symbol is an upper case letter (A-Z) or an underscore _, it is a noun. Nouns are commonly used for names of variables which contain regular values.

These are examples of nouns.

  • ArrayList
  • MAX_VALUE
  • More_lines?

2.2.3. Integer tokens

There are three types of integer tokens: base10, base16 and base2.

An integer in the base10 notation consists of one or more digits (0-9). An integer in the base16 notation consists of a prefix 0x and one or more hexadecimal digits (0-9a-f). An integer in the base2 notation consists of a prefix 0b and one or more binary digits (0-1).

You can place spacing underscores _ (U+005f), after a prefix 0x or 0b, between digits, and after digits. Underscores are simply ignored.

These are examples of integer tokens, each of which represents 42.

  • 42
  • 42__
  • 0042
  • 0x2a
  • 0b_10_1010

Note that octal integer notation is not supported. A sequence of digits which starts with 0 is read as a base10 integer.

Each integer token represents a number whose scale is zero, and whose unscaled value is the specified integer.

2.2.4. Decimal tokens

A decimal token consists of one or more digits (0-9) which represents the integer portion, a period . (U+002e), and one or more digits (0-9) which represents the fractional portion.

On the integer portion and the fractional portion, you can place spacing underscores _ (U+005f) between digits and after digits. Underscores are simply ignored.

These are examples of decimals.

  • 0.0
  • 0.001
  • 3.141_592_653

Each decimal token represents a number whose scale is the number of digits on the right side of the period, and whose unscaled value is the integer made of the digits.

2.2.5. String tokens

There are two types of string tokens: simple string tokens and rich string tokens.

In a simple string token, any characters between the two single quotation marks ' (U+0027) are the content of the string. If you want to include a quotation mark itself in the string, put two consecutive quotation marks.

These are examples of simple strings.

  • 'Hello world'
  • 'Let''s go!' (it represents “Let’s go!”)

In a rich string token, characters between the two double quotation marks " (U+0022) are the content of the string. In the token, a sequence of characters prefixed by a backslash \ (U+005c) represents a special character, such as a line feed (\n) or a double quotation mark (\").

These are examples of rich string tokens.

  • "Hey! ho! let's go!"
  • "GET /index.html HTTP/1.1\r\nHost: host.example.org\r\n"

Here is a list of backslash notations.

Notation Unicode Description
\0 U+0000 Null character
\a U+0007 Bell
\b U+0008 Backspace
\t U+0009 Horizontal tab
\n U+000a Line feed
\v U+000b Vertical tab
\f U+000c Form feed
\r U+000d Carriage return
\e U+001b Escape
\" U+0022 Double quotation mark "
\\ U+005c Backslash \
\uxxxx U+xxxx Character specified by Unicode. xxxx are four hexadecimal digits (0-9a-f).
\Uxxxxxx U+xxxxxx Character specified by Unicode. xxxxxx are six hexadecimal digits (0-9a-f).

2.2.6. Mark tokens

Here is a list of mark tokens.

Mark Usage
! op_lognot operator
~ op_not operator
= op_set operator
||= op_logor_set operator
&&= op_logand_set operator
|= op_or_set operator
^= op_xor_set operator
&= op_and_set operator
<<= op_shl_set operator
>>= op_shr_set operator
+= op_add_set operator
-= op_sub_set operator
*= op_mul_set operator
/= op_div_set operator
//= op_intdiv_set operator
%= op_rem_set operator
**= op_pow_set operator
|| op_logor operator
&& op_logand operator
== op_eq operator
!= op_ne operator
< op_lt operator
> op_gt operator, or formal receiver
<= op_le operator
>= op_ge operator
<=> ompare operator
| op_or operator
^ op_xor operator
& op_and operator
<< op_shl operator
>> op_shr operator
+ op_add operator
- op_sub operator, or op_minus operator
* op_mul operator, or formal rest arguments
/ op_div operator
// op_intdiv operator
% op_rem operator
** op_pow operator
.. op_range_ii operator
<.. op_range_ei operator
..< op_range_ie operator
<..< op_range_ee operator
: Local variable reference
:: Attributional variable reference
\ Pseudo variable
$ Local dereference of a verb variable
$$ Attributional dereference of a verb variable
. Access to variables or functions
-> Delimiters between formal receiver+arguments and the rest chunk in function bodies.
[ Opening bracket of a list expression
] Closing bracket of a list expression
{ Opening brace of local function expressions, dotted function expressions, or function arguments (just after a call, without whitespace characters)
} Closing brace
( Opening parenthesis for higher operation precedence, or opening parenthesis for formal arguments (just after {, without line feed characters) or opening parenthesis for actual arguments (just after a verb, without whitespace or line feed characters)
) Closing parenthesis
[| Opening of elements expansion
|] Closing of elements expansion

2.2.7. Production rules in BNF

This list describes the production rules of programs. There are several shift/reduce conflicts in the rules, and the parser always choose to shift.

program           ::=  chunk
chunk             ::=  empty
                       expression chunk
expression        ::=  assign_op
assign_op         ::=  logor_op
                       logor_op '=' logor_op    # op_set
                       logor_op '||=' logor_op  # op_logor_set
                       logor_op '&&=' logor_op  # op_logand_set
                       logor_op '|=' logor_op   # op_or_set
                       logor_op '^=' logor_op   # op_xor_set
                       logor_op '&=' logor_op   # op_and_set
                       logor_op '<<=' logor_op  # op_shl_set
                       logor_op '>>=' logor_op  # op_shr_set
                       logor_op '+=' logor_op   # op_add_set
                       logor_op '-=' logor_op   # op_sub_set
                       logor_op '*=' logor_op   # op_mul_set
                       logor_op '/=' logor_op   # op_div_set
                       logor_op '//=' logor_op  # op_intdiv_set
                       logor_op '%=' logor_op   # op_rem_set
                       logor_op '**=' logor_op  # op_pow_set
logor_op          ::=  logand_op
                       logand_op '||' logor_op      # op_logor
logand_op         ::=  relation_op
                       relation_op '&&' logand_op  # op_logand
relation_op       ::=  range_op
                       range_op '==' range_op   # op_eq
                       range_op '!=' range_op   # op_ne
                       range_op '<' range_op    # op_lt
                       range_op '>' range_op    # op_gt
                       range_op '<=' range_op   # op_le
                       range_op '>=' range_op   # op_ge
                       range_op '<=>' range_op  # op_cmp
range_op          ::=  or_op
                       or_op '..' or_op      # op_range_ii
                       or_op '..<' or_op     # op_range_ie
                       or_op '<..' or_op     # op_range_ei
                       or_op '<..<' or_op    # op_range_ee
or_op             ::=  and_op
                       or_op '|' and_op         # op_or
                       or_op '^' and_op         # op_xor
and_op            ::=  shift_op
                       and_op '&' shift_op     # op_and
shift_op          ::=  add_op
                       shift_op '<<' add_op  # op_shl
                       shift_op '>>' add_op  # op_shr
add_op            ::=  multiply_op
                       add_op '+' multiply_op  # op_add
                       add_op '-' multiply_op  # op_sub
multiply_op       ::=  power_op
                       multiply_op '*' power_op   # op_mul
                       multiply_op '/' power_op   # op_div
                       multiply_op '//' power_op  # op_intdiv
                       multiply_op '%' power_op   # op_rem
power_op          ::=  unary_op
                       unary_op '**' power_op  # op_pow
unary_op          ::=  primary
                       '-' unary_op   # op_minus
                       '!' unary_op   # op_lognot
                       '~' unary_op   # op_not
primary           ::=  num
                       str
                       context_env
                       context_recv
                       context_args
                       context_arg
                       paren
                       list
                       local_fun
                       dotted_fun
                       local_deref
                       attr_deref
                       local_ref
                       attr_ref
                       local_call
                       attr_call
num               ::=  INTEGER
                       DECIMAL
str               ::=  STRING
context_env       ::=  '\' VERB('env')
context_recv      ::=  '\' VERB('recv')
context_args      ::=  '\' VERB('args')
context_arg       ::=  '\' INTEGER
paren             ::=  OPENPAREN chunk ')'
                       NL_OPENPAREN chunk ')'
                       WS_OPENPAREN chunk ')'
list              ::=  OPENBRACKET list_body ']'
                       WS_OPENBRACKET list_body ']'
                       NL_OPENBRACKET list_body ']'
local_fun         ::=  OPENBRACE fun_body '}'
                       WS_NL_OPENBRACE fun_body '}'
local_deref       ::=  NOUN | '$' VERB
local_ref         ::=  ':' NOUN | ':' VERB
local_call        ::=  call
dotted_fun        ::=  primary '.' OPENBRACE fun_body '}'
                       primary '.' WS_NL_OPENBRACE fun_body '}'
attr_deref        ::=  primary '.' NOUN
                       primary '$$' VERB
attr_ref          ::=  primary '::' NOUN
                       primary '::' VERB
attr_call         ::=  primary '.' call
call              ::=  VERB recv args
recv              ::=  empty
                       OPENBRACKET expression ']'
args              ::=  paren_args fun_args
paren_args        ::=  empty
                       OPENPAREN list_body ')'
fun_args          ::=  empty
                       fun_arg fun_args
fun_arg           ::=  OPENBRACE fun_body '}'
list_body         ::=  empty
                       elements_producer list_body
elements_producer ::=  expression
                       '[|' expression '|]'
fun_body          ::=  formal_receiver formal_args chunk
                       formal_receiver formal_args '->' chunk
formal_receiver   ::=  empty
                       OPENBRACKET expression ']'
                       WS_OPENBRACKET expression ']'
formal_args       ::=  empty
                       OPENPAREN list_body ')'
                       WS_OPENPAREN list_body ')'
empty             ::= 

These symbols shown in the production rules are terminal symbols.

Terminal symbol Description
VERB Verb symbol such as print, including env, recv and args
VERB(‘env’) Verb symbol env
VERB(‘recv’) Verb symbol recv
VERB(‘args’) Verb symbol args
NOUN Noun symbol such as COUNT
INTEGER Integer literal such as 42
DECIMAL Decimal literal such as 3.14159
STRING String literal such as 'Peng!'
OPENPAREN Opening parenthesis ( just after the previous token without any whitespace or line feed characters
WS_OPENPAREN Opening parenthesis ( with at least one whitespace characters and no line feed characters after the previous token
NL_OPENPAREN Opening parenthesis ( with at least one line feed characters and optional whitespace characters after the previous token
OPENBRACKET Opening bracket [ just after the previous token without any whitespace or line feed characters
WS_OPENBRACKET Opening bracket [ with at least one whitespace characters and no line feed characters after the previous token
NL_OPENBRACKET Opening bracket [ with at least one line feed characters and optional whitespace characters after the previous token
OPENBRACE Opening brace { after the previous token without any whitespace or line feed characters
WS_NL_OPENBRACE Opening parenthesis { with at least one whitespace or line feed characters after the previous token