To learn about typecheckers, I decided to make a small typed calculator. I figured this could be useful to other folks trying to learn about implementing typed languages.
High-level requirements ๐
- Support integer, float and boolean datatypes and common operations on them
- Support typechecking of expressions
- Support variables
- Take multiple expressions, one per line, and evaluate each in order:
- Report type errors if any
- Print evaluation result if no errors
- An expression can use variables set in previous expressions
Example program ๐
|
|
Running the above should print
|
|
Implementation design ๐
The calculator is structured like a very typical amateur interpreter project:
The scanner first splits the input program into a list of tokens. For our program, this
step produces the tokens [Ident(a), Sym(=), Int(10), Sym(;), Ident(b), Sym(=), Int(42), ...]
. Note how no meaning is ascribed to groups of symbols yet, this
is all about grouping characters into meaningful symbols.
Next, the parser groups tokens together into expressions, and emits a list of
parse trees, one per expression. Note how our expressions are delimited by the
;
symbol, and just splitting the input text on the semicolon would also be
kind of a grouping of expressions. But our parser produces a richer, tree
structure that encodes precedence. In each expression tree node, any child
expressions need to be evaluated before evaluating the node itself. For example,
the last expression in our program, a + b < 100;
is parsed into
BinaryExpr(Sym(<),
BinaryExpr(Sym(+),
IdentExpr(Ident(a); attrs={}),
IdentExpr(Ident(b); attrs={}); attrs={}),
Literal(Int(100); attrs={}); attrs={})
Note how the +
operation is a child of the <
operation, indicating that +
should be evaluated first (i.e., binary +
has a higher precedence than binary
<
in our language).
Next, the typechecker attaches type information to each expression. That’s what
will go in the attrs
attached to each parse tree node. For instance, this is how the above tree for a + b < 100;
looks when type-annotated:
BinaryExpr(Sym(<),
BinaryExpr(Sym(+),
IdentExpr(Ident(a); attrs={'type': <type:Int>}),
IdentExpr(Ident(b); attrs={'type': <type:Int>}); attrs={'type': <type:Int>}),
Literal(Int(100); attrs={'type': <type:Int>}); attrs={'type': <type:Bool>})
See how each sub-expression is now annotated with its deduced type, and so is the top-level expression.
Finally, the interpreter walks the list of expression trees, evaluating each in post-order.
At any point, any of the components can have a fatal error that causes the program to halt with an exception.
Grammar ๐
To implement the scanner and parser components, it is useful to write down the the grammar of our little expression language. Here’s the grammar in PEG syntax:
TyCalc {
Prog = Stmt*
Stmt = Expr ";"
Expr = Assignment
Assignment = AssignmentAux | LogicalOr
AssignmentAux = ident "=" Assignment
LogicalOr = LogicalAnd ("||" LogicalAnd)*
LogicalAnd = Equality ("&&" Equality)*
Equality = Comparison (("!=" | "==") Comparison)*
Comparison = Term (("<=" | "<" | ">" | ">=") Term)*
Term = Factor (("+" | "-") Factor)*
Factor = Unary (("*" | "/" | "%") Unary)*
Unary = ("+" | "-")? Exponent
Exponent = ExponentExpr | Primary
ExponentExpr = Exponent ("**" Exponent)*
Primary = ParenExpr | ident | numlit | boollit
ParenExpr = "(" Expr ")"
ident = letter alnum*
numlit = fractional | integral
fractional = digit* "." digit+
integral = digit+
boollit = "true" | "false"
}
My favorite tool to play with a grammar is the online OhmJS editor. Paste this grammar, add examples and see the editor explain the parses.
Nothing fancy, most of the complexity in the grammar is to properly encode operator precedence, and some quirks of the Ohm PEG notation. Here’s some examples:
- Variable assignment
1 2
a = 1 b = true
- Arithmetic
1
a + 12.3 ** 6
- Logic
1
b && true
- Equality
1
b && a < 10
- Parenthesized expressions
1
(2 + a) * 4
Parsing ๐
Once the grammar is written down, you can either generate a parser for it using OhmJS or ANTLR, or roll your own recursive-descent parser. I rolled my own parser, and if you are new to this, I highly recommend Crafting Interpreters.
You can print the parse tree of any tycalc program by invoking parsing.py as follows:
|
|
Typechecking ๐
Typechecking is implemented in typechecking.py.
The goal of the typechecking phase is to assign a type to each expression in the program, producing an error if this is not possible. Modern languages have sophisticated type inference, and it is not always an “error” if the typechecker cannot figure out the type of an expression – it just might need more type annotations to eliminate ambiguity. Our language is, however, very small, and the typechecker is also quite simple. Here are the rules for type assignment:
-
Literals get their types trivially:
true
is always aBool
1234
is anInt
12.3
,.4
,99.
are allFloat
s
-
Each variable assignments sets the type of the variable to the assignment’s right hand side, and this is the type of the variable until it is reassigned later.
-
Arithmetic operations (
+
,-
,*
,/
) expect numeric arguments (eitherInt
orFloat
). If either of the arguments is aFloat
, the result is also aFloat
, else it is anInt
. -
The integer division remainder (mod) operation (
%
) expectsInt
arguments. -
The operators (
<=
,>=
,<
,>
) expect numeric arguments and produce aBool
. -
The operators
==
and!=
expect either two numeric arguments, or twoBool
arguments, and produce aBool
. -
Logic operators (
&&
,||
) take twoBool
s and produce aBool
.
We can model type assignment of an expression as a recursive problem of assigning types to the child expressions in the expression’s AST, and then calculating the resulting type using the rules above.
Let’s look at an example:
a = 42;
b = 10.;
a + b < 100;
Here’s the result of running our parser on it.
|
|
The empty attrs
associated with each subexpression will eventually contain the
type. Let’s work out the type deduction by hand.
For the first expression assigns the integer literal 42
to a
. The result of this assignment expression is twofold:
- The binary assignment expression itself is typed as
Int
, - The typechecker remembers that subsequent references to the variable
a
areInt
-typed, untila
is redefined.
Similarly, the next expression assigns the Float
literal 10.
to the variable
b
, so the expression has type Float
, and the typechecker remembers that b
has type Float
.
Now let’s zoom into the last expression:
|
|
We know that the type of a valid <
operation is always a boolean, and the
operation is valid if both its arguments are numbers. Therefore, we need to
first find the types of subexpr 1
and subexpr 2
.
-
To find the type of subexpr 1, we need the types of subexprs 1.1 and 1.2
- subexpr 1.1 is a variable reference to
a
, anda
is previously defined as anInt
ina = 42;
. - subexpr 1.2 is a variable reference to
b
, andb
is previously defined as aFloat
ina = 10.;
- Therefore, subexpr 1 is typed as a
Float
by rule (3) above.
- subexpr 1.1 is a variable reference to
-
subexpr 2 is easy, it is an
Int
literal, which always has the typeInt
. -
Therefore, the arguments to our
<
operation have typesFloat
andInt
. This is a valid operation, and the result has the typeBool
.
Here’s the result of running the typechecker alone on our program above:
|
|
Notice how each subexpression now has types, the same ones we worked out manually.
Now let’s look at an example where the typechecking fails:
|
|
So here, we attempted to add an Int
and a Bool
, and the typechcker produced
the “type” of the final expression as a special type, which actually represents
a typecheck error. This strategy, of progressively typing as many expressions
as possible without breaking the program, allows catching multiple errors at
once.
Evaluation ๐
Evaluation is implemented in interp.py.
The evaluation is very simple, it just visits the type-annotated expression trees, remembering values of variables, and evaluates each expression by walking its tree. In fact, for this simple calculator, we did not need a separate typechecking pass, and could have bundled typechecking along with evaluation. But in a more interesting language, it is almost always a good idea to keep typechecking and evaluation as different passes. This will make your life easier as you add more features to the language.