tisdag 12 november 2013

Parsing Expressions (Part 2)

In this part I will cover ASTs. In later parts I will add error reporting. I also noticed a mistake in one of the code examples from the previous part - specifically that the left parentheses were tokenized as unary, causing trouble later, that I overlooked. Also, like an idiot, I didn't actually push in the last number as a token. This is fixed in the upload to github and in the examples.


Abstract Syntax Trees

ASTs are used to represent syntactic structure in tree form. Consider the expression 1+2. The way we represent this in tree form is to say that the plus sign is the parent of the 1 and the 2, this will look like this:

   +  
  / \  
 1   2  

Of course, the advantage of building trees like this isn't apparent until we build trees from more complex expressions. Let's consider this: 1*(2+3)/4. The AST for that will look like this:

       _(/)_  
      /     \  
   _(*)_     4  
  /     \  
 1     _(+)_  
      /     \  
     2       3  
The question is then, what does that mean? One way to think about it, and in fact this is the way we will think about it, is to say that every node in the tree 'executes' the operator in that node using the children at either sides as operands. If the node is a leaf then the node is a number (in fact that must be the case). So leaf nodes are the exception, as they merely return their value (instead of doing an operation on its children nodes). Unary operators is simply an operator that only has one child node. We can build entire code segments this way; in fact a compiler could represent the entire program it's compiling as an AST. If you're using clang there is a command that makes the compiler pretty print the AST for the code you feed it (-ast-print). The upside to using ASTs is that they are very efficient to analyze. For example, let's say we allow variables in our program, and we enter the same expression as before with the addition of replacing one of the numbers with a variable: x*(2+3)/4. If we build an AST from this it becomes very apparent that we can prune one of the nodes but just performing 2+3 before we are told to evaluate the expression. This:
       _(/)_  
      /     \  
   _(*)_     4  
  /     \  
 x     _(+)_  
      /     \  
     2       3  
becomes this:
       _(/)_  
      /     \  
   _(*)_     4  
  /     \  
 x       5
We save ourselves one evaluation. This is a very elementary way of optimizing, but it shows the strength of the AST contra the stack-based RPN we built earlier. Note that it is possible to do the same optimization from the RPN, but it's a lot easier and straight-forward to do it here. Another advantage is that we can parse our expression over multiple passes if we wanted to: one of the nodes could simply be part of the expression flat out. Herb Sutter in his reveal of the VC++-compiler claimed that the VC++ compiler never used an AST and this made parsing some details of the new language either hard or impossible (if I understand correctly).

Operands and Operators

Before we start, we need to build ourselves a convenient way of representing operators and operands in the tree. As I hinted in the previous post we can do better than using a switch statement to do the operations. We are going to use polymorphism for this, so we build up this a class hierarchy where we call our base class Node. A Node can be evaluated. To make things a little easier on us we will define a somewhat deep hierarchy consisting of classes: UnaryOperator, BinaryOperator and Number. The code for the classes is below, I wrote everything in one monolithic header-file:

 #include <memory>  
 #include <algorithm>  
 #include "Token.h"  
   
 class Node  
 {  
 public:  
     virtual ~Node() {};  
   
     virtual int evaluate() const = 0;  
     virtual int getDepth(int current) const = 0;  
     virtual std::string toString() const = 0;  
   
     virtual std::shared_ptr<Node> getLeft() const { return nullptr; };  
     virtual std::shared_ptr<Node> getRight() const { return nullptr; };  
   
 };  
   
 class BinaryOperator : public Node  
 {  
 public:  
     BinaryOperator(std::shared_ptr<Node> left_, std::shared_ptr<Node> right_)  
         :left(left_), right(right_) {}  
   
     int getDepth(int current) const  
     {  
         return std::max(left->getDepth(current + 1), right->getDepth(current + 1));  
     }  
   
     std::shared_ptr<Node> getLeft() const { return left; };  
     std::shared_ptr<Node> getRight() const { return right; };  
   
 protected:  
     const std::shared_ptr<Node> left;  
     const std::shared_ptr<Node> right;  
 };  
   
 class UnaryOperator : public Node  
 {  
 public:  
     UnaryOperator(std::shared_ptr<Node> left_)  
         :left(left_) {}  
   
     int getDepth(int current) const  
     {  
         return left->getDepth(current + 1);  
     }  
   
     std::shared_ptr<Node> getLeft() const { return left; };  
   
 protected:  
     const std::shared_ptr<Node> left;  
 };  
   
 class Number : public Node  
 {  
 public:  
     Number(int value_)  
         :value(value_) {}  
   
     int getDepth(int current) const  
     {  
         return current + 1;  
     }  
       
     int evaluate() const  
     {  
         return value;  
     }  
        
     std::string toString() const  
     {  
         return std::to_string(value);  
     }  
   
 private:  
     const int value;  
 };  
   
 class Pos : public UnaryOperator  
 {  
 public:  
     Pos(std::shared_ptr<Node> left_)  
     :UnaryOperator(left_)  
     {}  
   
     int evaluate() const  
     {  
         return left->evaluate();  
     }  
   
     std::string toString() const  
     {  
         return "+";  
     }  
 };  
   
 class Neg : public UnaryOperator  
 {  
 public:  
     Neg(std::shared_ptr<Node> left_)  
         :UnaryOperator(left_)  
     {}  
   
     int evaluate() const  
     {  
         return -left->evaluate();  
     }  
   
     std::string toString() const  
     {  
         return "-";  
     }  
 };  
   
 class Add : public BinaryOperator  
 {  
 public:  
     Add(std::shared_ptr<Node> left_, std::shared_ptr<Node> right_)  
         :BinaryOperator(left_, right_)  
     {}  
   
     int evaluate() const  
     {  
         return left->evaluate() + right->evaluate();  
     }  
   
     std::string toString() const  
     {  
         return "+";  
     }  
 };  
   
 class Sub : public BinaryOperator  
 {  
 public:  
     Sub(std::shared_ptr<Node> left_, std::shared_ptr<Node> right_)  
     :BinaryOperator(left_, right_)  
     {}  
   
     int evaluate() const  
     {  
         return left->evaluate() - right->evaluate();  
     }  
   
     std::string toString() const  
     {  
         return "-";  
     }  
 };  
   
 class Mul : public BinaryOperator  
 {  
 public:  
     Mul(std::shared_ptr<Node> left_, std::shared_ptr<Node> right_)  
     :BinaryOperator(left_, right_)  
     {}  
   
     int evaluate() const  
     {  
         return left->evaluate() * right->evaluate();  
     }  
   
     std::string toString() const  
     {  
         return "*";  
     }  
 };  
   
 class Div : public BinaryOperator  
 {  
 public:  
     Div(std::shared_ptr<Node> left_, std::shared_ptr<Node> right_)  
     :BinaryOperator(left_, right_)  
     {}  
   
     int evaluate() const  
     {  
         return left->evaluate() / right->evaluate();  
     }  
   
     std::string toString() const  
     {  
         return "/";  
     }  
 };  
   

All interface functions except for evaluate are for pretty printing the tree, which I will explain later. Other than that it should be pretty straight forward. The concept is very common and very useful. It's explained extremely well here if you have problem understand it. Note that even though unary operators only have one node we still call it the left node and to the world it will look like its leaf node. AST are not binary trees so we are not required to do this. I chose to do it this way for convenience in this application. Since we never have more than 2 children per node, we can view our tree as a binary tree, and so we can use that terminology. In an AST derived from code in a programming language a node can have any number of children, depending on what token the node represents.

Growing a tree

After we've decided on a way to represent our tokens in the AST, we will begin to create it. We can use the tokens we get after the parsing (using the method found in the previous part) to build an AST using the following algorithm:

  • For every token in the queue of tokens:
    • Push operands on a stack as leaf-nodes.
    • For operators, pop the necessary operands off the stack, create a node with the operator at the top, and the operands hanging below it, push the new node onto the stack
The queue of tokens will simply be the list of tokens from the parser, and the stack is simply a vector of Node-pointers (we will use shared_ptr for simplicity). We also need function to convert tokens to nodes. If we wanted our code it be easily extensible we could use a factory similar to what I described in this post, but using a Token as the key instead of string. In fact this is essentially what I did in my old implementation of SML. In this case we will opt for what is faster to implement: switch statements. The full code for creating ASTs can be found here:

 #include <vector>  
 #include <memory>  
   
 #include "Node.h"  
 #include "Token.h"  
 #include <exception>  
   
 std::shared_ptr<Node> pop(std::vector<std::shared_ptr<Node>>& stack)  
 {  
     auto ret = stack.back();  
     stack.pop_back();  
     return ret;  
 }  
   
 std::shared_ptr<Node> makeNewBinary(Token token, std::shared_ptr<Node> left, std::shared_ptr<Node> right)  
 {  
     switch (token.token.op)  
     {  
     case (Token::Add) :  
         return std::make_shared<Add>(left, right);  
     case (Token::Sub) :  
         return std::make_shared<Sub>(left, right);  
     case (Token::Mul) :  
         return std::make_shared<Mul>(left, right);  
     case (Token::Div) :  
         return std::make_shared<Div>(left, right);  
     }  
     return nullptr; //should never ever ever happen.  
 }  
   
 std::shared_ptr<Node> makeNewUnary(Token token, std::shared_ptr<Node> left)  
 {  
     switch (token.token.op)  
     {  
     case (Token::Pos) :  
         return std::make_shared<Pos>(left);  
     case (Token::Neg) :  
         return std::make_shared<Neg>(left);  
     }  
     return nullptr; //should never ever ever happen.  
 }  
   
 std::shared_ptr<Node> makeTree(const std::vector<Token>& queue)  
 {  
     std::vector<std::shared_ptr<Node>> stack;  
   
     for (auto token : queue)  
     {  
         if (token.type == Token::Type::Operator)  
         {  
             bool unary = (token.token.op == Token::OpType::Pos ||  
                 token.token.op == Token::OpType::Neg);  
             if (unary)  
             {  
                 auto left = pop(stack);  
                 auto node = makeNewUnary(token, left);  
                 if (node)  
                     stack.push_back(node);  
                 else  
                 {  
                     stack.clear();  
                     break;  
                 }  
             }  
             else  
             {  
                 auto right = pop(stack);  
                 auto left = pop(stack);  
                 auto node = makeNewBinary(token, left, right);  
                 if (node)  
                     stack.push_back(node);  
                 else  
                 {  
                     stack.clear();  
                     break;  
                 }  
             }  
         }  
         else  
         {  
             stack.push_back(std::make_shared<Number>(token.token.value));  
         }  
     }  
     return stack.back();  
 }  
  

The AST then is simply the pointer to the root of the tree. One reason for using an AST is that it makes analyzing easier, so we might as well offer a way to analyze the tree. To do this we will simply pretty print the tree using breadth first search. The way we do this is to start at the root node, print it, then push all of its children on a vector, print the vector and clear it, push of all the new nodes' children, print that and clear the vector, etcetc. Because we use a queue, we have to reverse the order in which I print the nodes, but that's fine. Because division is the same character as I would normally use for edges, I decided to use the pipe ('|') for edges instead. I think the output is as clear as you would expect from a console program though. It does limit how deep the expression can be (essentially how many nested parenthesis we can have), but the code can be expanded to print deeper trees. The code is below, and will be in the same file as the code above:

 std::string prettyNodes(std::vector<std::string> row, int spacing)  
 {  
     std::string ret;  
   
     std::string fill(spacing / 2, ' ');  
     ret += fill;  
     for (auto riter = row.rbegin(); riter != row.rend(); ++riter)  
     {  
         std::string fill(spacing - riter->size(), ' ');  
         ret += *riter + fill;  
     }  
     return std::move(ret);  
 }  
   
 std::string prettyEdges(std::vector<std::string> row, int spacing)  
 {  
     std::string ret;  
     std::string fill(spacing / 2, ' ');  
     ret += fill;  
     for (auto riter = row.rbegin(); riter != row.rend(); ++riter)  
     {  
         std::string fill(spacing - riter->size() , ' ');  
         ret += *riter + fill;  
     }  
     return std::move(ret);  
 }  
   
   
 std::string pretty(std::shared_ptr<Node> node)  
 {  
     int depth = node->getDepth(0);  
     int level = depth;  
     std::string ret;  
   
     std::vector<std::shared_ptr<Node>> currentLevel, nextLevel;  
   
     currentLevel.push_back(node);  
     std::vector<std::string> nodes;  
     std::vector<std::string> edges;  
     while (!currentLevel.empty()) {  
         auto currNode = currentLevel.front();  
         bool isOccupied = false;  
         if (currNode == nullptr)  
         {  
             nodes.push_back("");  
         }  
         else  
             nodes.push_back(currNode->toString());  
   
         currentLevel.erase(currentLevel.begin());  
         if (currNode) {  
   
             if (currNode->getRight())  
                 edges.push_back("|");  
             else  
                 edges.push_back("");  
             if (currNode->getLeft())  
                 edges.push_back("|");  
             else  
                 edges.push_back("");  
               
   
             nextLevel.push_back(currNode->getRight());  
             nextLevel.push_back(currNode->getLeft());  
   
             isOccupied = true;  
         }  
         else  
         {  
             edges.push_back("");  
             edges.push_back("");  
   
             nextLevel.push_back(nullptr);  
             nextLevel.push_back(nullptr);  
         }  
         if (currentLevel.empty()) {  
   
             ret += prettyNodes(nodes, std::pow(2, level)) + "\n";  
             ret += prettyEdges(edges, std::pow(2, (level - 1))) + "\n";  
               
             nodes.clear();  
             edges.clear();  
             swap(currentLevel, nextLevel);  
             --level;  
             if (level <= 0)  
                 currentLevel.clear();  
         }  
     }  
     return std::move(ret);  
 }  

We can try some inputs and see what happens:
 #include "Tokenization.h"  
 #include "Parsing.h"  
 #include <iostream>  
 #include "AST.h"  
   
 int main()  
 {  
     auto AST = [](const std::string& expr){return makeTree(parse(Tokenize(expr))); };  
   
     std::cout << pretty(AST("1*(2+3)/4")) << "\n";  
     std::cout << pretty(AST("(1+1)*(33-12 * (2 - 1))")) << "\n";  
   
     std::cout << "yey!";   
 }  
Output:
     /          
   |    |      
   *    4      
  |  |        
  1  +        
    | |       
    2 3       
           
     *          
   |    |      
   *    -      
  |  |  |  |    
  +  -  2  1    
 | | | |       
 1 1 3312      
           
 yey!  

This is what we expected. The second tree becomes sort of unclear, but it's not worse than we expected.

Next we can try some inputs of expression and evaluate them to see what if the results are correct. The way we do this is to run an expression through the tokenizer, run the resulting list of tokens from that through our parser, and finally make a tree out of the results from the parser. This is what the AST-lambda in the code below does. We can then call the evaluate function on the root node of the tree (which is return from the makeTree-function):
 int main()  
 {  

     auto AST = [](const std::string& expr){return makeTree(parse(Tokenize(expr))); };  
   
     std::cout << AST("9/(3*(2-1))")->evaluate() << "\n";  
     std::cout << AST("(1+1)*(33-12 * (2 - 1))")->evaluate() << "\n";  
   
     std::cout << "yey!";   
 }  
   
Result:
 3  
 42  
Which is correct. This concludes this part. The code is up on github and can be viewed here.

Further Reading

torsdag 24 oktober 2013

Parsing Expressions (Part 1)

How does a program evaluate and calculate an expression such as: 2*3+1 or (3*4)/5^2 etc? In this post I will, attempt to, explain some of the ways this can be done; there are several ways of doing it. I will go through three main topics: tokenization, parsing, and evaluation. Tokenization is fairly easy to do when it's just for mathematical expressions. Parsing can be done in a few ways, but I will go through the shunting-yard algorithm, first described by Dijkstra. As for evaluation I will explain two ways: Reverse Polish Notation (RPN), and the Abstract Syntax Tree (AST). RPN is easier to deal with and understand, while ASTs are stronger and can do more. In this part we will look into RPN.


Tokenization

A token, essentially, is any number of characters from a text (or more abstractly, raw data) that make up a logical unit. For example, in the sentence "The answer is 42", there can be said to be 4 tokens: "The", "Answer", "is", "42". The order of the tokes matter. I used the words "can be said to" for an important reason, it doesn't have to be 4 tokens. Whitespaces can be included, every character may be its own token, the text and the numerals may make up only two tokens ("The answer is" and "42"). It all depends on what it is we want to do with it. (The above description is a mild simplification, in truth the word token usually refers to the type of the token, while the word lexeme is used for the value of the token. For this application we have no reason to differentiate.)

In the case of a mathematical expressions what is a token should be quite clear: the different numbers and operators make up the tokens. For example, the expression "(4 * 10) + 2" are made up by the tokens "(", "4", "*", "10", ")", "+" and "2". Note that we don't care about whitespaces here.

A simple Tokenizer

So it looks like what we want to do is go through the string, one character at a time. If we find a number (for example), we read the number, make a token out of it, and go on to the next token. To help us out we have some information and some rules.

In a well formatted expression:

  • A number is always followed by a binary operator or a right parenthesis.
  • A binary operator is always followed by either a number, a unary operator or a left parenthesis.
  • A unary operator is always followed by a number, a left parenthesis (or another unary operator).

In case you don't already know, a binary operator is an operator that takes two arguments, one on either side (such as +, -, * and /), while a unary number only takes one argument, on its left hand side (negation: -). By the way, I won't claim this list to be exhaustive, it's not, but it should give you an idea that we can form these rules and they will help us tokenize. These rules, or rules similar to them, can also be expressed using formal grammar. Indeed this will be a lot more helpful, but it goes beyond the scope of this post.

Our tokenizer will use these rules lazily, that is to say that it will only use them in order to determine what operator an ambigious token is (differentiating between a negation and subtraction). If the input is wrong, then so be it. At this point, we don't care; keep it simple. We know that if we just read a number, the operator must be a binary operator, whereas if we just read an operator, the operator must be a unary operator. A tokenizer isn't necessarily involved in whether the input is faulty, this could've either been handled earlier, or can be handled after; we're just interested in reading tokens.

The algoritm we will use will be very simple. Go through the string character by character. If we find a numeral, add this to a current number-token. If we find an operator add our current number-token, if any, to the list of tokens, then add the operator to the list of tokens, and keep going. If we find a number, expect a binary operator; in any other case, expect a unary operator. Our expectation doesn't mean that we really expect it to come, just if we do find an operator, it will be either a binary operator or a unary operator. We will, for now, represent tokens as strings, for unary operators we will add a 'U' to the string. The following function takes a string as input and tokenizes it, printing out the tokens:

 #include <string>  
 #include <vector>  
 #include <set>  
 #include <iostream>  
   
 void Tokenize(const std::string& str)  
 {  
     bool nextUnary = true;  
     std::vector<std::string> tokens;  
     std::string currentNumber;  
     std::set<char> operators;  
     for (auto op : "+-/*()")  
     {  
         operators.insert(op);  
     }  
     for (auto chr: str)  
     {  
         if (chr <= ' ')  
             continue;  
         if (chr >= '0' && chr <= '9')  
         {  
             currentNumber.push_back(chr);  
             nextUnary = false;  
         }  
         else if (operators.count(chr) == 1)  
         {  
             std::string token;  
             if (!currentNumber.empty())  
                 tokens.push_back(currentNumber);  
             if (nextUnary)  
                 token.push_back('U');  
             token.push_back(chr);  
             tokens.push_back(token);  
             currentNumber.clear();  
             if (chr != '(' && chr != ')')
    nextUnary = true;
         }  
     }  
   
     std::cout << "The expression : \'" << str << "\' was tokenized to:\n";  
     if (tokens.size() == 0)  
         std::cout << "\'\'\n :(";  
     else  
     {  
         std::cout << "\'" << tokens.front() << "\' ";  
         if (tokens.size() != 1)  
         {  
             for (auto iter = tokens.begin() + 1; iter != tokens.end(); ++iter)  
             {  
                 std::cout << ", \'" << *iter << "\'";  
             }  
         }  
     }  
 }  
 
 int main()   
  {   
    Tokenize("1*(2+3)"); // Wellformatted input   
    Tokenize("1122 313 ++---"); //Gee wiz.   
  }   
Output:
 The expression: '1*(2+3)' was tokenized to:   
  '1', '*', 'U(', '2', '+', '3', ')'   
     
  The expression: '1122 313 ++---' was tokenized to:   
  '1122313', '+', 'U+', 'U-', 'U-', 'U-'   

The reader should note that this code really relies on the user providing valid input. For example the *-operator can be treated as unary (indeed the left parenthesis IS treated as unary following the first input). We will have to fix that later, but for now this will suffice, it's a very simple solution that works when it works. Also note that the program didn't crash when provided with bad input, it just output faulty output. Again this is not really a problem, this will be caught elsewhere, if it ever entered into tokenization in the first place. In the case of this post though, we will simply not care about it.

Representing Tokens

In reality, tokens can be represented in any way at all. You would want them to be as compact as possible though, and only carry just the necessary information. One way of doing it would be using a Union:

 struct Token  
 {  
     enum Type { Operator, Number };  
     Type type;  
     enum OpType { Add, Sub, Pos, Neg, Mul, Div, LParen, RParen };  
     union  
     {  
         int value;  
         OpType op;  
     } token;  
 };  
As long as the program can interpret the representation, it doesn't matter. We would have to look into the naming in that struct, but that would have to do for now.


Parsing

While the formal definition is slightly different, to us parsing will essentially mean taking tokens and making sense of them, somehow. The way I look at it parsing is taking some input, analyze it and turn it into a format that we can use. What we are going to do is take our list of tokens and turn them into a format that we can then use to evaluate the value of the original expression.

Shunting-Yard

We are going to use the Shunting-Yard Algorithm for this. Here's the algorithm as described here, using an operator stack and an operand stack reading tokens from left to right:

  • If we see an operand, push it on the operand stack.
  • If we see an operator:
    • While there’s an operator on top of the operator stack of precedence higher than or equal to that of the operator we’re currently processing, pop it off and apply it. (That is, pop the required operand(s) off the stack, apply the operator to them, and push the result back on the operand stack.)
    • Then, push the current operator on the operator stack.
  • When we get to the end of the formula, apply any operators remaining on the stack, from the top down. Then the result is the only item left on the operand stack (assuming well-formed input).
Later in that post this is said about parentheses:
  • Parens are a bit of a special case. When you see a left paren, push it on the operator stack; no other operators can pop a paren (so it’s as if it has the lowest precedence). Then when you see a right paren, pop-and-apply any operators on the stack until you get back to a left paren, which is popped and discarded.
There is also a special case for unary operators, these are said to be right-to-left associative, while our other operators are left-to-right associative. The difference in handling them is that they only pop operators that are equal in precedence when they are encountered in the operator stack.

This image, from the wikipedia article, gives an overview and an example:

Fair enough. All we have to do is change our tokenizer to return the list tokens, then we can write the algorithm. When it comes to returning tokens from the tokenizer we have to choices: return the tokens as raw text the way we already store them in a vector, or turn them into tokens, as defined in the union structure described earlier. Either way is fine, but keeping with out philosophy that the tokenizer does no error checking, return them as strings, were a string of numerals is an operand token, an operator character is a binary operator, and an operator character preceded by a U is a unary operator. So before we shunt the tokens through the yard, so to say, we should translate them into the more precise representation format (which is more compact). We also need to define the precedence of our operators, luckily this is very well defined, both in mathematics and in every programming language ever, so we can just use someone else's list of precedence.

The algorithm as explained above also assumes that you want to evaluate the expression while you parse it. We don't want that. The algorithm needs to change so that instead of 'applying' the operator to the operands, we just push the operator onto the operand stack. Logically, instead of calling it an operand stack we will call it a output queue. (The reason I posted that version of the algorithm was that it was the most compact version I found, but it still applies). The algorithm written in C++-code looks like follows:

 #include <string>  
 #include <vector>  
 #include "Token.h"  
 #include <exception>  
 #include <algorithm>  
 #include <map>  

 std::vector<Token> toTokens(const std::vector<std::string>& tokens)  
 {  
     std::vector<Token> ret; 
 
     for (auto& token : tokens)  
     {  
         if (token.size() == 0)  
             continue;  
         if (token[0] >= '0' && token[0] <= '9')  
         {  
             if (std::find_if(token.begin(), token.end(),  
                 [](char test) {return !(test >= '0' && test <= '9'); })  
                 != token.end())  
                 continue; //not all numerals, abort!  
             Token t;  
             t.type = Token::Type::Number;  
             t.token.value = atoi(token.c_str());  
             ret.push_back(t);  
         }  
         else if (token[0] == 'U')  
         {  
             if (token.size() == 1)  
                 continue;  
             if (token[1] == '-' || token[1] == '+')  
             {  
                 Token t;  
                 t.type = Token::Type::Operator;  
                 t.token.op = Token::OpType::Neg;  
                 ret.push_back(t);  
             }  
         }  
         else  
         {  
             Token t;  
             t.type = Token::Type::Operator;  
             switch (token[0])  
             {  
             case '+':  
                 t.token.op = Token::OpType::Add;  
                 break;  
             case '-':  
                 t.token.op = Token::OpType::Sub;  
                 break;  
             case '*':  
                 t.token.op = Token::OpType::Mul;  
                 break;  
             case '/':  
                 t.token.op = Token::OpType::Div;  
                 break;  
             case '\\':  
                 t.token.op = Token::OpType::Div;  
                 break;  
             case '(':  
                 t.token.op = Token::OpType::LParen;  
                 break;  
             case ')':  
                 t.token.op = Token::OpType::RParen;  
                 break;  
             default:  
                 continue;  
             }  
             ret.push_back(t);  
         }  
     }  

     return ret;  
 }  

 std::vector<Token> parse(const std::vector<std::string>& tokens)  
 {  
     std::vector<Token> input = toTokens(tokens);  
     std::vector<Token> operators;  
     std::vector<Token> output;  
     //:3 C++14:  
     std::map<Token::OpType, int> precedence =   
     {   
         {Token::Add, 4},   
         {Token::Sub, 4},  
         {Token::Pos, 2},  
         {Token::Neg, 2},  
         {Token::Mul, 3},  
         {Token::Div, 3},  
         {Token::LParen, 1},  
         {Token::RParen, 1},  
     };  

     for (auto token : input)  
     {  
         if (token.type == Token::Number)  
         {  
             output.push_back(token);  
         }  
         else if (token.type == Token::Operator)  
         {  
             if (token.token.op == Token::OpType::LParen)  
                 operators.push_back(token);  
             else if (token.token.op == Token::OpType::RParen)  
             {  
                 bool foundOne = false;  
                 while (!operators.empty())  
                 {  
                     if (operators.back().token.op != Token::OpType::LParen)  
                     {  
                         output.push_back(operators.back());  
                         operators.pop_back();  
                     }  
                     else  
                     {  
                         foundOne = true;  
                         operators.pop_back();  
                         break;  
                     }  
                     if (!foundOne);  
                         //this is an indication of bad input.   
                         //Ignore that for now  
                 }  
             }  
             else  
             {  
                 bool unary = (token.token.op == Token::OpType::Pos ||  
                     token.token.op == Token::OpType::Neg);  
                 while (!operators.empty())  
                 {  
                     if (unary && precedence[operators.back().token.op]  
                         == precedence[token.token.op] &&  
                         operators.back().token.op != Token::OpType::LParen)  
                     {  
                         output.push_back(operators.back());  
                         operators.pop_back();  
                     }  
                     if (!unary && precedence[operators.back().token.op]  
                         >= precedence[token.token.op] &&  
                         operators.back().token.op != Token::OpType::LParen)  
                     {  
                         output.push_back(operators.back());  
                         operators.pop_back();  
                     }  
                     else  
                         break;  
                 }  
                 operators.push_back(token);  
             }  
         }  
     }  

     while (!operators.empty())  
     {  
         //finding any Parens here is an indication of bad input, here  
         //we simply opt out of pushing that unto the stack.  
         if (operators.back().token.op != Token::OpType::LParen ||  
             operators.back().token.op != Token::OpType::RParen)  
             output.push_back(operators.back());  
         operators.pop_back();  
     }  

     return output;  
 }  

Note that toTokens() doesn't report errors (neither does parse()). It just lets errors slide silently by. This is for brevity. In serious code we would handle errors something like this: every token passed into the parser is a pair of an int and a string, where the int points to the position in text the token was found. If the parse finds an error, it can report back (either in an exception or through a special return value) not only that an error was found, but also where it was found. One way of doing that would be to have a third entry in the Token::Type enum being Error, and the value in the token union would let us know what the error was. For brevity I will just let errors slide by silently in these example; this is not production code, and either way errors won't cause crashes. However, if we want our code be at all useful we will have to add error reporting eventually, sooner rather than later.


Evaluation

What the algorithm does is to produce a sequence of tokens from infix notation to postfix notation, this is also known as Reverse Polish Notation. We can use this directly to evaluate it, all it takes is some understanding on how RPN works. We can also very easily turn this into a parse tree (an AST), and then use the parse tree to evaluate it. Let's start with the simpler of them.

Reverse Polish Notation

In my youth I had a calculator, given to my by my dad, where the user would use RPN to calculator. It made the calculation of complex expressions very easy, since you didn't have to use the calculator's memory to temporarily save values, you would use use the calculator's built-in stack.

The way the calculator worked is you would push numbers onto a stack, and then when you pressed an operator it would take the appropriate amount of of numbers (commonly two) and perform that operation, popping the values it used away from the stack, and pushing the new value onto the stack. Here's an online version of an RPN calculator you can play around with. For example: say we want to calculate (1+2)*3. What we would do is to push 1 and 2 onto the stack, then press +, and the calculator would pop the number 2 from the stack, and the number 1 from the stack, and perform addition in, pushing number 3 onto the stack. The stack would now consist of the number 3. To multiply that by 3, we would simply push 3 onto the stack, then press multiply, and the calculator would leave a number 9 on the stack. In postfix notation this would be written as:

 1 2 + 3 *  

This is fairly straight-forward to write, the code looks like this:

 #pragma once  
 #include "Token.h"  
 #include <vector>  
 #include <utility>  

 int pop(std::vector<int>& stack)  
 {  
     int ret = stack.back();  
     stack.pop_back();  
     return ret;  
 }  

 std::pair<bool, int> calculate(const std::vector<Token>& queue)  
 {  
     std::vector<int> stack;  
     std::pair<bool, int> ret = { true, 0 };
  
     for (auto token : queue)  
     {  
         if (token.type == Token::Type::Number)  
             stack.push_back(token.token.value);  
         else if (token.type == Token::Type::Operator)  
         {  
             switch (token.token.op)  
             {  
             case (Token::OpType::Add):  
                 if (stack.size() >= 2)  
                     stack.push_back(pop(stack) + pop(stack));  
                 else  
                     ret.first = false;  
                 break;      
             case (Token::OpType::Sub) :  
                 if (stack.size() >= 2)  
                     stack.push_back(-(pop(stack) - pop(stack)));  
                 else  
                     ret.first = false;  
                 break;  
             case (Token::OpType::Pos) :  
                 if (stack.size() >= 1)  
                     stack.push_back(pop(stack));  
                 else  
                     ret.first = false;  
                 break;  
             case (Token::OpType::Neg) :  
                 if (stack.size() >= 1)  
                     stack.push_back(-pop(stack));  
                 else  
                     ret.first = false;  
                 break;  
             case (Token::OpType::Mul) :  
                 if (stack.size() >= 2)  
                     stack.push_back(pop(stack) * pop(stack));  
                 else  
                     ret.first = false;  
                 break;  
             case (Token::OpType::Div) :  
                 if (stack.size() >= 2)  
                 {  
                     int d = pop(stack);  
                     stack.push_back(pop(stack) / d);  
                 }  
                 else  
                     ret.first = false;  
                 break;  
             default:  
                 ret.first = false;  
                 break;  
             }  
         }  
     }  

     if (stack.size() == 1)  
         ret.second = stack.back();  
     else  
         ret.first = false;  

     return ret;  
 }  

Here we have to do something if the input is faulty. What I decided to do (in keeping with the overall laziness in error handling so far) is to return an std::pair<bool, int> where the first value of the pair is false if we encountered errors. Something that is very important to realize is that when we perform operations for which order matters (subtraction) we must also read the stack in the right order, the right hand operand will be the first operand we take from the stack. Another thing to note is how the evaluation of the different operators are handled: using a switch statement. There are other ways of dealing with this, that are more expandable and less error prone. One way would be to use an std::map from an OpType to a lamda accepting appropriate inputs and returning the value of the expression. We could then just define the lamdas and the mappings earlier in the function. Another way is using inheritance and polymorphism, this youtube-video can give you an idea how that would work. We could then even use a factory to create appropriate operator-objects. In fact I use that method in my old SML-implementation. While it was quite easy to work with I can't get away from the fact that it involves a lot of pointer indirections, which might not be appropriate, I would much rather use the lamda-method if I needed something similar. For now, though, there is no reason not to use a switch-statement. We will see later in our discussion on AST's that this might not be the best way to do it.

Either way, this works. Let's try some inputs. We are nice now, and we won't try any faulty inputs, our system isn't built for that anyway. The more observant reader will realize that it's also only built to handle integer values. That'll have to do for now, but expanding the system for floats instead should not be very difficult (a problem might arise when we want to handle both, accurately). Our new main looks like this:

 #include "Tokenization.h"  
 #include "Parsing.h"  
 #include "RPN.h"  
 #include <iostream>  
   
 int main()  
 {  
     auto RPN = [](const std::string& expr)
         { return calculate(parse(Tokenize(expr))).second; };  
       
     std::cout << RPN("3*-(2+1)") << "\n"; //should equal -9  
     std::cout << RPN("9/(3*(2-1))") << "\n"; //should equal 3  
     std::cout << RPN("(1+1)*(33-12)") << "\n"; //should equal 42  
     std::cout << "yey!";   
 }  
... and produces the expected output:
 -9  
 3  
 42  
 yey!  

AST's are a more complicated subject. I suspect, strongly, that the write up on that will be as long as this entire post. Therefore I will leave it for a second part. I will post the code on github.

Further Reading

An introduction to SML

Since the dawn of time I've been (for unknown reasons) unreasonably interested in parsing strings of text. I have no idea why, but for some reason the idea of solving the problem of making a computer make sense of text have always been a pet problem of mine. My first major projects, plural, when I first learned programming (in Visual Basic) involved parsing text in one form or another. One of those projects was a crude MP3-player that was script-able in a crude language I made up that was mostly inspired by the scripting in the first Half-Life engine.

A year or so ago I undertook a project in designing a scripting language. I made an interpreter for it over a few weeks of work and it did work, but it was a very crude and inefficient engine (as was the language). I do like the idea of the language that I came up with and it did have some interesting points, and I hope to keep designing and writing it during off times, while blogging about it. Hopefully it will make for some interesting articles; I know I will personally learn a lot. I won't go into too much detail about the language in this post, but I might as well write about some of the more general points and the design philosophies.

I'm calling the language SML, which is short for Simply/Scripting Markup Language. Sadly the abbreviation is already taken by another language, so I will have to change that (perhaps SSML but that sounds corny). The name will stand however, since it reflects what the language is supposed to do just perfectly. My idea, or what I want to do with it, is to design a language that blends a markup language akin to XML (just simpler) with a scripting language.

The idea of the language is not to make it useful, for me it's a learning experience, and I believe it's an important experience. It's also something that I am very interested in anyway. Hopefully someone somewhere will find the language useful, but God knows there are enough programming languages as is. In either case I will start a series of blog posts related to the language, covering areas that come up when writing a language. The first post will be about parsing expressions, quite an interesting topic on its own.


(In case of interest the old interpreter can be found here, but it does not reflect what the language will be very well. A lot of the design decisions, even in the design of the language, are flawed. Indeed I knew that when I wrote it, I just wanted so see if I could do it. I could, for some definition of could :).)

onsdag 23 oktober 2013

A TMP factory

Let's say you have a domain, a game for instance, that includes well over 100 different classes in a class hierarchy (this is not an unreasonable number, but it's also an arbitrary number). You need a way of creating all these different objects. Fair enough, just create them. The problem that arises, sooner rather than later, is that you have will eventually have countless of different places where you include different parts of this hierarchy. In the specific case of C++ this bloats your include diagrams into an unreadable form in the worst cases, but more generally it creates a lot of dependencies. For example your player class needs to include the different bullet classes in order to create them. Moreover, whenever you add a new type of bullet to your game, this needs to also be included.

Would it not be nice, then, to have a single class whose only responsibility is to create these objects. It, solely, includes every class it needs to create, and it then provides a method which create specific objects and returns them. This is a factory.

There are different ways of implementing a factory, but (part of) the method described in Design Patterns (by Gamma et. al) is something akin to this:

    Creator.h 

1:  namespace Creator  
2:  {  
3:      Enum ObjectID {player, enemy};  
4:      object* Creator::create(ObjectID id);  
5:  };  

    Creator.cpp  

1:  #include "Player.h"  
2:  #include "Enemy.h"  
3:  object* Creator::create(ObjectID id)  
4:  {  
5:      if (id == player) return new Player();  
6:      else if (id == enemy) return new Enemy();  
7:      /* and so on*/  
8:      else return NULL; /*    could be useful if the enum contains an   
9:                   entry for which we have no class yet     */  
10:  }  

Note: the actual pattern as it's described in the book is more flexible and fleshed out, but this is the gist of it: a method that takes an ID as a parameter, and based on that ID returns an object. Here the ID is in the form of an enum, but it can as easily be a string (which has its own advantages and disadvantages).

The thing to note here is that any class that uses the creator to create its objects have no idea how to create these objects, all it has is a list of IDs that magically turns into an object through the create method. Also note that whenever a new object is included in the creator, only Creator.c has to be recompiled. This is a big step forward from the naive approach.

One disadvantage of this approach is that - most obviously - a new if statement must be added for every class we add to the factory. A better approach would be to use a lookup table somehow:

1:  #include <map>  
2:  #include <string>  
3:  #include <functional>  
4:  class Factory  
5:  {  
6:  public:  
7:      Factory()  
8:      {  
9:          creators.insert(std::make_pair  
10:              ("Player",  
11:              []() -> Object* {return new Player();}));  
12:          creators.insert(std::make_pair  
13:              ("Enemy",  
14:              []() -> Object* {return new Enemy();}));      
15:      }  
16:      Object* create(const std::string& type)  
17:      {  
18:          return creators[type]();  
19:      }      
20:  private:  
21:      std::map<std::string, std::function<Object* ()>> creators;  
22:  };  

Here we instead us a mapping from a string to an object creator-function, supported by std::map and std::functional. For brevity I defined all functions inside the header; outside of this example they should be defined in a separate *.cpp-file to limit dependencies, as we did in the previous code snippet (if we don't we might as well not use a factory to begin with!).

The disadvantages of using a string for an ID is obvious, it's very easy to mistype and the error will not be caught during compilation. That's a huge problem. In the example above we don't even check whether the string is mapped or not, we naively assume the caller will do the right thing. In this case we have undefined behaviour if the string isn't mapped. This is easy to fix of course (std::map::count()).

The advantages are this:

  1. It's very flexible.
  2. There is no enum we need to know about.
  3. We can clearly see what we are creating (as opposed to create(1) for enemy and create(0) for player.
  4. We can create objects directly by reading from a file or similar.

In my opinion the last point is the most enticing. The simplest way to demonstrate this would be something akin to this:

1:  void UserCreatesObjects()  
2:  {      
3:      std::string input;  
4:      std::vector<Object*> objects;  
5:      std::cout << "What should I create? \n";  
6:      std::getline(std::cin, input);  
7:      while (!input.empty());  
8:      {  
9:          objects.push_back(factory.create(input));  
10:          std::cout << "What should I create next \n";  
11:          std::getline(std::cin, input);  
12:      }  
13:      for (auto object : objects)  
14:      {  
15:          std::cout << object->what() << "\n";  
16:      }  
17:  }  

No pesky if statements, just create what the user says. This makes loading a game state from a file a breeze, for example. Note that the only include you need to make this method work is to include the factory-header, which in turn doesn't include any of the concrete classes itself. It's safe to say that the user still have no idea how to create these objects, even when he, literally, does! Here we also use a creator, which is a function a factory uses to create an object. The factory itself doesn't actually create the object, it calls a creator which creates the object for the factory.

Of course, we need to handle bad user input.

A question that might arise is how do we allow parameters? Some objects doesn't make sense to create without parameters. This problem CAN be solved, but it will bloat the code up a lot (at least my solution), and I will not include it here. A problem with allowing parameters is that it makes the coupling between the user of the factory and the objects he creates, he needs to know more about how the object is created, and it's not simply just a string anymore. One can argue that allowing parameters defeats the purpose of the factory. There are cases where you might allow parameters: for example the Unity engine requires every object to have a transform component, in the same way your game engine might require every object to have a position. In this case it makes sense that every object be created with a position, and in then you might have the create method for the factory take a position-parameter, knowing that every object has a constructor for it.

The following code is a template version of the type of factory I've been describing (it's a TMP factory!):
    Factory.h

1:  #pragma once  
2:  #include <memory>  
3:  #include <map>  
4:  #include <string>  
5:  #include <algorithm>  
6:  #include <functional>  
7:  template <class T, class ... Ps>  
8:  class Factory  
9:  {  
10:  public:  
11:      std::shared_ptr<T> create(const std::string& str, Ps... ps)  
12:      {  
13:          auto found = creators.find(str);  
14:          if (found != creators.end())  
15:              return found->second(ps...);  
16:          else  
17:              return nullptr;  
18:      }  
19:      template <class U>  
20:      void addCreator(std::string str)  
21:      {  
22:          creators.insert(std::make_pair(str, 
23:              [&] (Ps... ps) -> std::shared_ptr<U>
24:              {return std::make_shared<U>(ps...);}));  
25:      }  
26:  private:  
27:      std::map<std::string,std::function<std::shared_ptr<T>(Ps...)>> creators;  
28:  };  

A factory object is set up with a type, which is the type the create()-function will return. This type will be the base class of whatever objects we create. We also provide a variadic number of parameters; every object we create should (must) provide a constructor which takes these parameters (in that order). Using parameters with the factory is, of course, completely optional. I should add, this code has been tested, and works, in Visual Studios Express 2013 Preview.

The interface for the class is rather simple. The function addCreator essentially maps a string to a class, by creating a lamda which returns an object (actually a shared_ptr) of that class, mapping that lamda to the string-value. The create-function uses this mapping to return a new object of that class; it also takes some parameters, which are the parameters we specified as the template arguments when creating the Factory object. It handles faulty user input by returning a nullptr if the string is not found in the map.

Example of use:

As an example of use I will use fruits (yum). It should hopefully convey what my implementation of a factory can do:

    Fruit.h

1:  #pragma once  
2:  #include <string>  
3:  #include "Factory.h"  
4:  struct Fruit  
5:  {  
6:      static void addCreators(Factory<Fruit, int>& factory);  
7:      virtual ~Fruit() {};  
8:      virtual int ripeness() = 0;  
9:      virtual std::string what() = 0;  
10:  };  

    Fruit.cpp

1:  #include "Fruit.h"  
2:  struct Apple;  
3:  struct Banana;  
4:  struct Orange;  
5:  void Fruit::addCreators(Factory<Fruit, int>& factory)  
6:  {  
7:      factory.addCreator<Banana>("Banana");  
8:      factory.addCreator<Orange>("Orange");  
9:      factory.addCreator<Apple>("Apple");  
10:  }  
11:  struct Apple : public Fruit  
12:  {  
13:      Apple(int ripeness) : ripeness_(ripeness) {};  
14:      std::string what()  
15:      {  
16:          return "I am an apple!";  
17:      }  
18:      int ripeness()  
19:      {  
20:          return ripeness_;  
21:      }  
22:  private:  
23:      int ripeness_;  
24:  };  
25:  struct Banana : public Fruit  
26:  {  
27:      Banana(int ripeness) : ripeness_(ripeness) {};  
28:      std::string what()  
29:      {  
30:          return "I am a banana!";  
31:      }  
32:      int ripeness()  
33:      {  
34:          return ripeness_;  
35:      }  
36:  private:  
37:      int ripeness_;  
38:  };  
39:  struct Orange : public Fruit  
40:  {  
41:      Orange(int ripeness) : ripeness_(ripeness) {};  
42:      std::string what()  
43:      {  
44:          return "I am an Orange!";  
45:      }  
46:      int ripeness()  
47:      {  
48:          return ripeness_;  
49:      }  
50:  private:  
51:      int ripeness_;  
52:  };  

Note that, whomever includes fruit.h has no idea what an Orange, an Apple, or a Banana is, let alone how to create them. By the time the fruit class creates the creators for these object, neither does the fruit class! (I'm actually curious as to how that really works). We could add any number of different fruits, and this would change nothing for whoever uses the factory to create the fruits. Below is the code for main(), which uses the factory to create fruits:

    main.cpp

1:  #include "Factory.h"  
2:  #include <vector>  
3:  #include <memory>  
4:  #include <iostream>  
5:  #include "Fruit.h"  
6:  int main()  
7:  {  
8:      std::vector<std::shared_ptr<Fruit>> fruits;  
9:      Factory<Fruit, int> factory;  
10:      Fruit::addCreators(factory);  
11:      fruits.push_back(factory.create("Banana", 1));  
12:      fruits.push_back(factory.create("Apple", 9000));  
13:      fruits.push_back(factory.create("Orange", 42));  
14:      fruits.push_back(factory.create("Apple", -14));  
15:      for (auto& fruit : fruits)  
16:      {  
17:          std::cout << fruit->what() << " - " << fruit->ripeness() << "\n";  
18:      }  
19:  }  

Output:

 Banana! - 1  
 Apple! - 9000  
 Orange - 42  
 Apple - -14  

As we expected.

Again, note that main does not knows how to create the different fruits!

If you have any input at all, feel free to comment. I am sure there are better ways of doing things, and I'm also sure at least half of what I wrote was partly wrong! If you would like to use the above code, feel free to do so.

First Entry

I am a second year programming student at a university in Sweden, where I specifically study games programming. This blog is intended to be a place where I can write about whatever I'm working on in a tutorial form. Hopefully someone will consider it educational. My main language is C++, although I know C# and I can spell Java, and I have looked into some other languages, like Ruby and Javascript.

Currently I'm working on three projects, that are in different stages. I'm working together with a group of students - we call ourselves Tarhead Studios - on a game called Grief. Initially a student project, we've won two awards, most notably Swedish Game Awards 2013 in the category Best Scenario (which was an amazing experience), as well as a local prize called SAGA. Thanks to the awards we are working on a release of that game. As a studio we're also working on our second title, which we will use to break into the business! So far we're barely out of pre-production. I'm also working on my own original idea for a rogue-like(-like!) - together with a handful of other people that I've involved into the project. That project has fallen into the waysides, until I take the time to learn Qt at an acceptable level, which I will need for the editor. I will go into more detail on that game throughout writing on this blog.