Sebastian's Programming: Parsing Expressions (Part 2)

In this part I will cover ASTs. In later parts I will add error reporting. I also noticed a mistake in one of the code examples from the previous part - specifically that the left parentheses were tokenized as unary, causing trouble later, that I overlooked. Also, like an idiot, I didn't actually push in the last number as a token. This is fixed in the upload to github and in the examples.

Abstract Syntax Trees

ASTs are used to represent syntactic structure in tree form. Consider the expression 1+2. The way we represent this in tree form is to say that the plus sign is the parent of the 1 and the 2, this will look like this:

   +  
  / \  
 1   2

Of course, the advantage of building trees like this isn't apparent until we build trees from more complex expressions. Let's consider this: 1*(2+3)/4. The AST for that will look like this:

       _(/)_  
      /     \  
   _(*)_     4  
  /     \  
 1     _(+)_  
      /     \  
     2       3

The question is then, what does that mean? One way to think about it, and in fact this is the way we will think about it, is to say that every node in the tree 'executes' the operator in that node using the children at either sides as operands. If the node is a leaf then the node is a number (in fact that must be the case). So leaf nodes are the exception, as they merely return their value (instead of doing an operation on its children nodes). Unary operators is simply an operator that only has one child node. We can build entire code segments this way; in fact a compiler could represent the entire program it's compiling as an AST. If you're using clang there is a command that makes the compiler pretty print the AST for the code you feed it (-ast-print). The upside to using ASTs is that they are very efficient to analyze. For example, let's say we allow variables in our program, and we enter the same expression as before with the addition of replacing one of the numbers with a variable: x*(2+3)/4. If we build an AST from this it becomes very apparent that we can prune one of the nodes but just performing 2+3 before we are told to evaluate the expression. This:

       _(/)_  
      /     \  
   _(*)_     4  
  /     \  
 x     _(+)_  
      /     \  
     2       3

becomes this:

       _(/)_  
      /     \  
   _(*)_     4  
  /     \  
 x       5

We save ourselves one evaluation. This is a very elementary way of optimizing, but it shows the strength of the AST contra the stack-based RPN we built earlier. Note that it is possible to do the same optimization from the RPN, but it's a lot easier and straight-forward to do it here. Another advantage is that we can parse our expression over multiple passes if we wanted to: one of the nodes could simply be part of the expression flat out. Herb Sutter in his reveal of the VC++-compiler claimed that the VC++ compiler never used an AST and this made parsing some details of the new language either hard or impossible (if I understand correctly).

Operands and Operators

Before we start, we need to build ourselves a convenient way of representing operators and operands in the tree. As I hinted in the previous post we can do better than using a switch statement to do the operations. We are going to use polymorphism for this, so we build up this a class hierarchy where we call our base class Node. A Node can be evaluated. To make things a little easier on us we will define a somewhat deep hierarchy consisting of classes: UnaryOperator, BinaryOperator and Number. The code for the classes is below, I wrote everything in one monolithic header-file:

 #include <memory>  
 #include <algorithm>  
 #include "Token.h"  
   
 class Node  
 {  
 public:  
     virtual ~Node() {};  
   
     virtual int evaluate() const = 0;  
     virtual int getDepth(int current) const = 0;  
     virtual std::string toString() const = 0;  
   
     virtual std::shared_ptr<Node> getLeft() const { return nullptr; };  
     virtual std::shared_ptr<Node> getRight() const { return nullptr; };  
   
 };  
   
 class BinaryOperator : public Node  
 {  
 public:  
     BinaryOperator(std::shared_ptr<Node> left_, std::shared_ptr<Node> right_)  
         :left(left_), right(right_) {}  
   
     int getDepth(int current) const  
     {  
         return std::max(left->getDepth(current + 1), right->getDepth(current + 1));  
     }  
   
     std::shared_ptr<Node> getLeft() const { return left; };  
     std::shared_ptr<Node> getRight() const { return right; };  
   
 protected:  
     const std::shared_ptr<Node> left;  
     const std::shared_ptr<Node> right;  
 };  
   
 class UnaryOperator : public Node  
 {  
 public:  
     UnaryOperator(std::shared_ptr<Node> left_)  
         :left(left_) {}  
   
     int getDepth(int current) const  
     {  
         return left->getDepth(current + 1);  
     }  
   
     std::shared_ptr<Node> getLeft() const { return left; };  
   
 protected:  
     const std::shared_ptr<Node> left;  
 };  
   
 class Number : public Node  
 {  
 public:  
     Number(int value_)  
         :value(value_) {}  
   
     int getDepth(int current) const  
     {  
         return current + 1;  
     }  
       
     int evaluate() const  
     {  
         return value;  
     }  
        
     std::string toString() const  
     {  
         return std::to_string(value);  
     }  
   
 private:  
     const int value;  
 };  
   
 class Pos : public UnaryOperator  
 {  
 public:  
     Pos(std::shared_ptr<Node> left_)  
     :UnaryOperator(left_)  
     {}  
   
     int evaluate() const  
     {  
         return left->evaluate();  
     }  
   
     std::string toString() const  
     {  
         return "+";  
     }  
 };  
   
 class Neg : public UnaryOperator  
 {  
 public:  
     Neg(std::shared_ptr<Node> left_)  
         :UnaryOperator(left_)  
     {}  
   
     int evaluate() const  
     {  
         return -left->evaluate();  
     }  
   
     std::string toString() const  
     {  
         return "-";  
     }  
 };  
   
 class Add : public BinaryOperator  
 {  
 public:  
     Add(std::shared_ptr<Node> left_, std::shared_ptr<Node> right_)  
         :BinaryOperator(left_, right_)  
     {}  
   
     int evaluate() const  
     {  
         return left->evaluate() + right->evaluate();  
     }  
   
     std::string toString() const  
     {  
         return "+";  
     }  
 };  
   
 class Sub : public BinaryOperator  
 {  
 public:  
     Sub(std::shared_ptr<Node> left_, std::shared_ptr<Node> right_)  
     :BinaryOperator(left_, right_)  
     {}  
   
     int evaluate() const  
     {  
         return left->evaluate() - right->evaluate();  
     }  
   
     std::string toString() const  
     {  
         return "-";  
     }  
 };  
   
 class Mul : public BinaryOperator  
 {  
 public:  
     Mul(std::shared_ptr<Node> left_, std::shared_ptr<Node> right_)  
     :BinaryOperator(left_, right_)  
     {}  
   
     int evaluate() const  
     {  
         return left->evaluate() * right->evaluate();  
     }  
   
     std::string toString() const  
     {  
         return "*";  
     }  
 };  
   
 class Div : public BinaryOperator  
 {  
 public:  
     Div(std::shared_ptr<Node> left_, std::shared_ptr<Node> right_)  
     :BinaryOperator(left_, right_)  
     {}  
   
     int evaluate() const  
     {  
         return left->evaluate() / right->evaluate();  
     }  
   
     std::string toString() const  
     {  
         return "/";  
     }  
 };

All interface functions except for evaluate are for pretty printing the tree, which I will explain later. Other than that it should be pretty straight forward. The concept is very common and very useful. It's explained extremely well here if you have problem understand it. Note that even though unary operators only have one node we still call it the left node and to the world it will look like its leaf node. AST are not binary trees so we are not required to do this. I chose to do it this way for convenience in this application. Since we never have more than 2 children per node, we can view our tree as a binary tree, and so we can use that terminology. In an AST derived from code in a programming language a node can have any number of children, depending on what token the node represents.

Growing a tree

After we've decided on a way to represent our tokens in the AST, we will begin to create it. We can use the tokens we get after the parsing (using the method found in the previous part) to build an AST using the following algorithm:

For every token in the queue of tokens:

Push operands on a stack as leaf-nodes.
For operators, pop the necessary operands off the stack, create a node with the operator at the top, and the operands hanging below it, push the new node onto the stack

The queue of tokens will simply be the list of tokens from the parser, and the stack is simply a vector of Node-pointers (we will use shared_ptr for simplicity). We also need function to convert tokens to nodes. If we wanted our code it be easily extensible we could use a factory similar to what I described in this post, but using a Token as the key instead of string. In fact this is essentially what I did in my old implementation of SML. In this case we will opt for what is faster to implement: switch statements. The full code for creating ASTs can be found here:

 #include <vector>  
 #include <memory>  
   
 #include "Node.h"  
 #include "Token.h"  
 #include <exception>  
   
 std::shared_ptr<Node> pop(std::vector<std::shared_ptr<Node>>& stack)  
 {  
     auto ret = stack.back();  
     stack.pop_back();  
     return ret;  
 }  
   
 std::shared_ptr<Node> makeNewBinary(Token token, std::shared_ptr<Node> left, std::shared_ptr<Node> right)  
 {  
     switch (token.token.op)  
     {  
     case (Token::Add) :  
         return std::make_shared<Add>(left, right);  
     case (Token::Sub) :  
         return std::make_shared<Sub>(left, right);  
     case (Token::Mul) :  
         return std::make_shared<Mul>(left, right);  
     case (Token::Div) :  
         return std::make_shared<Div>(left, right);  
     }  
     return nullptr; //should never ever ever happen.  
 }  
   
 std::shared_ptr<Node> makeNewUnary(Token token, std::shared_ptr<Node> left)  
 {  
     switch (token.token.op)  
     {  
     case (Token::Pos) :  
         return std::make_shared<Pos>(left);  
     case (Token::Neg) :  
         return std::make_shared<Neg>(left);  
     }  
     return nullptr; //should never ever ever happen.  
 }  
   
 std::shared_ptr<Node> makeTree(const std::vector<Token>& queue)  
 {  
     std::vector<std::shared_ptr<Node>> stack;  
   
     for (auto token : queue)  
     {  
         if (token.type == Token::Type::Operator)  
         {  
             bool unary = (token.token.op == Token::OpType::Pos ||  
                 token.token.op == Token::OpType::Neg);  
             if (unary)  
             {  
                 auto left = pop(stack);  
                 auto node = makeNewUnary(token, left);  
                 if (node)  
                     stack.push_back(node);  
                 else  
                 {  
                     stack.clear();  
                     break;  
                 }  
             }  
             else  
             {  
                 auto right = pop(stack);  
                 auto left = pop(stack);  
                 auto node = makeNewBinary(token, left, right);  
                 if (node)  
                     stack.push_back(node);  
                 else  
                 {  
                     stack.clear();  
                     break;  
                 }  
             }  
         }  
         else  
         {  
             stack.push_back(std::make_shared<Number>(token.token.value));  
         }  
     }  
     return stack.back();  
 }

The AST then is simply the pointer to the root of the tree. One reason for using an AST is that it makes analyzing easier, so we might as well offer a way to analyze the tree. To do this we will simply pretty print the tree using breadth first search. The way we do this is to start at the root node, print it, then push all of its children on a vector, print the vector and clear it, push of all the new nodes' children, print that and clear the vector, etcetc. Because we use a queue, we have to reverse the order in which I print the nodes, but that's fine. Because division is the same character as I would normally use for edges, I decided to use the pipe ('|') for edges instead. I think the output is as clear as you would expect from a console program though. It does limit how deep the expression can be (essentially how many nested parenthesis we can have), but the code can be expanded to print deeper trees. The code is below, and will be in the same file as the code above:

 std::string prettyNodes(std::vector<std::string> row, int spacing)  
 {  
     std::string ret;  
   
     std::string fill(spacing / 2, ' ');  
     ret += fill;  
     for (auto riter = row.rbegin(); riter != row.rend(); ++riter)  
     {  
         std::string fill(spacing - riter->size(), ' ');  
         ret += *riter + fill;  
     }  
     return std::move(ret);  
 }  
   
 std::string prettyEdges(std::vector<std::string> row, int spacing)  
 {  
     std::string ret;  
     std::string fill(spacing / 2, ' ');  
     ret += fill;  
     for (auto riter = row.rbegin(); riter != row.rend(); ++riter)  
     {  
         std::string fill(spacing - riter->size() , ' ');  
         ret += *riter + fill;  
     }  
     return std::move(ret);  
 }  
   
   
 std::string pretty(std::shared_ptr<Node> node)  
 {  
     int depth = node->getDepth(0);  
     int level = depth;  
     std::string ret;  
   
     std::vector<std::shared_ptr<Node>> currentLevel, nextLevel;  
   
     currentLevel.push_back(node);  
     std::vector<std::string> nodes;  
     std::vector<std::string> edges;  
     while (!currentLevel.empty()) {  
         auto currNode = currentLevel.front();  
         bool isOccupied = false;  
         if (currNode == nullptr)  
         {  
             nodes.push_back("");  
         }  
         else  
             nodes.push_back(currNode->toString());  
   
         currentLevel.erase(currentLevel.begin());  
         if (currNode) {  
   
             if (currNode->getRight())  
                 edges.push_back("|");  
             else  
                 edges.push_back("");  
             if (currNode->getLeft())  
                 edges.push_back("|");  
             else  
                 edges.push_back("");  
               
   
             nextLevel.push_back(currNode->getRight());  
             nextLevel.push_back(currNode->getLeft());  
   
             isOccupied = true;  
         }  
         else  
         {  
             edges.push_back("");  
             edges.push_back("");  
   
             nextLevel.push_back(nullptr);  
             nextLevel.push_back(nullptr);  
         }  
         if (currentLevel.empty()) {  
   
             ret += prettyNodes(nodes, std::pow(2, level)) + "\n";  
             ret += prettyEdges(edges, std::pow(2, (level - 1))) + "\n";  
               
             nodes.clear();  
             edges.clear();  
             swap(currentLevel, nextLevel);  
             --level;  
             if (level <= 0)  
                 currentLevel.clear();  
         }  
     }  
     return std::move(ret);  
 }

We can try some inputs and see what happens:

 #include "Tokenization.h"  
 #include "Parsing.h"  
 #include <iostream>  
 #include "AST.h"  
   
 int main()  
 {  
     auto AST = [](const std::string& expr){return makeTree(parse(Tokenize(expr))); };  
   
     std::cout << pretty(AST("1*(2+3)/4")) << "\n";  
     std::cout << pretty(AST("(1+1)*(33-12 * (2 - 1))")) << "\n";  
   
     std::cout << "yey!";   
 }

Output:

This is what we expected. The second tree becomes sort of unclear, but it's not worse than we expected.

Next we can try some inputs of expression and evaluate them to see what if the results are correct. The way we do this is to run an expression through the tokenizer, run the resulting list of tokens from that through our parser, and finally make a tree out of the results from the parser. This is what the AST-lambda in the code below does. We can then call the evaluate function on the root node of the tree (which is return from the makeTree-function):

 int main()  
 {  

     auto AST = [](const std::string& expr){return makeTree(parse(Tokenize(expr))); };  
   
     std::cout << AST("9/(3*(2-1))")->evaluate() << "\n";  
     std::cout << AST("(1+1)*(33-12 * (2 - 1))")->evaluate() << "\n";  
   
     std::cout << "yey!";   
 }

Result:

 3  
 42

Which is correct. This concludes this part. The code is up on github and can be viewed here.

Sebastian's Programming

tisdag 12 november 2013

Parsing Expressions (Part 2)

Abstract Syntax Trees

Operands and Operators

Growing a tree

Inga kommentarer:

Skicka en kommentar