Sebastian's Programming: Parsing Expressions (Part 1)

How does a program evaluate and calculate an expression such as: 2*3+1 or (3*4)/5^2 etc? In this post I will, attempt to, explain some of the ways this can be done; there are several ways of doing it. I will go through three main topics: tokenization, parsing, and evaluation. Tokenization is fairly easy to do when it's just for mathematical expressions. Parsing can be done in a few ways, but I will go through the shunting-yard algorithm, first described by Dijkstra. As for evaluation I will explain two ways: Reverse Polish Notation (RPN), and the Abstract Syntax Tree (AST). RPN is easier to deal with and understand, while ASTs are stronger and can do more. In this part we will look into RPN.

Tokenization

A token, essentially, is any number of characters from a text (or more abstractly, raw data) that make up a logical unit. For example, in the sentence "The answer is 42", there can be said to be 4 tokens: "The", "Answer", "is", "42". The order of the tokes matter. I used the words "can be said to" for an important reason, it doesn't have to be 4 tokens. Whitespaces can be included, every character may be its own token, the text and the numerals may make up only two tokens ("The answer is" and "42"). It all depends on what it is we want to do with it. (The above description is a mild simplification, in truth the word token usually refers to the type of the token, while the word lexeme is used for the value of the token. For this application we have no reason to differentiate.)

In the case of a mathematical expressions what is a token should be quite clear: the different numbers and operators make up the tokens. For example, the expression "(4 * 10) + 2" are made up by the tokens "(", "4", "*", "10", ")", "+" and "2". Note that we don't care about whitespaces here.

A simple Tokenizer

So it looks like what we want to do is go through the string, one character at a time. If we find a number (for example), we read the number, make a token out of it, and go on to the next token. To help us out we have some information and some rules.

In a well formatted expression:

A number is always followed by a binary operator or a right parenthesis.
A binary operator is always followed by either a number, a unary operator or a left parenthesis.
A unary operator is always followed by a number, a left parenthesis (or another unary operator).

In case you don't already know, a binary operator is an operator that takes two arguments, one on either side (such as +, -, * and /), while a unary number only takes one argument, on its left hand side (negation: -). By the way, I won't claim this list to be exhaustive, it's not, but it should give you an idea that we can form these rules and they will help us tokenize. These rules, or rules similar to them, can also be expressed using formal grammar. Indeed this will be a lot more helpful, but it goes beyond the scope of this post.

Our tokenizer will use these rules lazily, that is to say that it will only use them in order to determine what operator an ambigious token is (differentiating between a negation and subtraction). If the input is wrong, then so be it. At this point, we don't care; keep it simple. We know that if we just read a number, the operator must be a binary operator, whereas if we just read an operator, the operator must be a unary operator. A tokenizer isn't necessarily involved in whether the input is faulty, this could've either been handled earlier, or can be handled after; we're just interested in reading tokens.

The algoritm we will use will be very simple. Go through the string character by character. If we find a numeral, add this to a current number-token. If we find an operator add our current number-token, if any, to the list of tokens, then add the operator to the list of tokens, and keep going. If we find a number, expect a binary operator; in any other case, expect a unary operator. Our expectation doesn't mean that we really expect it to come, just if we do find an operator, it will be either a binary operator or a unary operator. We will, for now, represent tokens as strings, for unary operators we will add a 'U' to the string. The following function takes a string as input and tokenizes it, printing out the tokens:

 #include <string>  
 #include <vector>  
 #include <set>  
 #include <iostream>  
   
 void Tokenize(const std::string& str)  
 {  
     bool nextUnary = true;  
     std::vector<std::string> tokens;  
     std::string currentNumber;  
     std::set<char> operators;  
     for (auto op : "+-/*()")  
     {  
         operators.insert(op);  
     }  
     for (auto chr: str)  
     {  
         if (chr <= ' ')  
             continue;  
         if (chr >= '0' && chr <= '9')  
         {  
             currentNumber.push_back(chr);  
             nextUnary = false;  
         }  
         else if (operators.count(chr) == 1)  
         {  
             std::string token;  
             if (!currentNumber.empty())  
                 tokens.push_back(currentNumber);  
             if (nextUnary)  
                 token.push_back('U');  
             token.push_back(chr);  
             tokens.push_back(token);  
             currentNumber.clear();  
             if (chr != '(' && chr != ')')
    nextUnary = true;
         }  
     }  
   
     std::cout << "The expression : \'" << str << "\' was tokenized to:\n";  
     if (tokens.size() == 0)  
         std::cout << "\'\'\n :(";  
     else  
     {  
         std::cout << "\'" << tokens.front() << "\' ";  
         if (tokens.size() != 1)  
         {  
             for (auto iter = tokens.begin() + 1; iter != tokens.end(); ++iter)  
             {  
                 std::cout << ", \'" << *iter << "\'";  
             }  
         }  
     }  
 }  
 
 int main()   
  {   
    Tokenize("1*(2+3)"); // Wellformatted input   
    Tokenize("1122 313 ++---"); //Gee wiz.   
  }

Output:

 The expression: '1*(2+3)' was tokenized to:   
  '1', '*', 'U(', '2', '+', '3', ')'   
     
  The expression: '1122 313 ++---' was tokenized to:   
  '1122313', '+', 'U+', 'U-', 'U-', 'U-'

The reader should note that this code really relies on the user providing valid input. For example the *-operator can be treated as unary (indeed the left parenthesis IS treated as unary following the first input). We will have to fix that later, but for now this will suffice, it's a very simple solution that works when it works. Also note that the program didn't crash when provided with bad input, it just output faulty output. Again this is not really a problem, this will be caught elsewhere, if it ever entered into tokenization in the first place. In the case of this post though, we will simply not care about it.

Representing Tokens

In reality, tokens can be represented in any way at all. You would want them to be as compact as possible though, and only carry just the necessary information. One way of doing it would be using a Union:

 struct Token  
 {  
     enum Type { Operator, Number };  
     Type type;  
     enum OpType { Add, Sub, Pos, Neg, Mul, Div, LParen, RParen };  
     union  
     {  
         int value;  
         OpType op;  
     } token;  
 };

As long as the program can interpret the representation, it doesn't matter. We would have to look into the naming in that struct, but that would have to do for now.

Parsing

While the formal definition is slightly different, to us parsing will essentially mean taking tokens and making sense of them, somehow. The way I look at it parsing is taking some input, analyze it and turn it into a format that we can use. What we are going to do is take our list of tokens and turn them into a format that we can then use to evaluate the value of the original expression.

Shunting-Yard

We are going to use the Shunting-Yard Algorithm for this. Here's the algorithm as described here, using an operator stack and an operand stack reading tokens from left to right:

If we see an operand, push it on the operand stack.
If we see an operator:
- While there’s an operator on top of the operator stack of precedence higher than or equal to that of the operator we’re currently processing, pop it off and apply it. (That is, pop the required operand(s) off the stack, apply the operator to them, and push the result back on the operand stack.)
- Then, push the current operator on the operator stack.
When we get to the end of the formula, apply any operators remaining on the stack, from the top down. Then the result is the only item left on the operand stack (assuming well-formed input).

Later in that post this is said about parentheses:

Parens are a bit of a special case. When you see a left paren, push it on the operator stack; no other operators can pop a paren (so it’s as if it has the lowest precedence). Then when you see a right paren, pop-and-apply any operators on the stack until you get back to a left paren, which is popped and discarded.

There is also a special case for unary operators, these are said to be right-to-left associative, while our other operators are left-to-right associative. The difference in handling them is that they only pop operators that are equal in precedence when they are encountered in the operator stack.

This image, from the wikipedia article, gives an overview and an example:

Fair enough. All we have to do is change our tokenizer to return the list tokens, then we can write the algorithm. When it comes to returning tokens from the tokenizer we have to choices: return the tokens as raw text the way we already store them in a vector, or turn them into tokens, as defined in the union structure described earlier. Either way is fine, but keeping with out philosophy that the tokenizer does no error checking, return them as strings, were a string of numerals is an operand token, an operator character is a binary operator, and an operator character preceded by a U is a unary operator. So before we shunt the tokens through the yard, so to say, we should translate them into the more precise representation format (which is more compact). We also need to define the precedence of our operators, luckily this is very well defined, both in mathematics and in every programming language ever, so we can just use someone else's list of precedence.

The algorithm as explained above also assumes that you want to evaluate the expression while you parse it. We don't want that. The algorithm needs to change so that instead of 'applying' the operator to the operands, we just push the operator onto the operand stack. Logically, instead of calling it an operand stack we will call it a output queue. (The reason I posted that version of the algorithm was that it was the most compact version I found, but it still applies). The algorithm written in C++-code looks like follows:

 #include <string>  
 #include <vector>  
 #include "Token.h"  
 #include <exception>  
 #include <algorithm>  
 #include <map>  

 std::vector<Token> toTokens(const std::vector<std::string>& tokens)  
 {  
     std::vector<Token> ret; 
 
     for (auto& token : tokens)  
     {  
         if (token.size() == 0)  
             continue;  
         if (token[0] >= '0' && token[0] <= '9')  
         {  
             if (std::find_if(token.begin(), token.end(),  
                 [](char test) {return !(test >= '0' && test <= '9'); })  
                 != token.end())  
                 continue; //not all numerals, abort!  
             Token t;  
             t.type = Token::Type::Number;  
             t.token.value = atoi(token.c_str());  
             ret.push_back(t);  
         }  
         else if (token[0] == 'U')  
         {  
             if (token.size() == 1)  
                 continue;  
             if (token[1] == '-' || token[1] == '+')  
             {  
                 Token t;  
                 t.type = Token::Type::Operator;  
                 t.token.op = Token::OpType::Neg;  
                 ret.push_back(t);  
             }  
         }  
         else  
         {  
             Token t;  
             t.type = Token::Type::Operator;  
             switch (token[0])  
             {  
             case '+':  
                 t.token.op = Token::OpType::Add;  
                 break;  
             case '-':  
                 t.token.op = Token::OpType::Sub;  
                 break;  
             case '*':  
                 t.token.op = Token::OpType::Mul;  
                 break;  
             case '/':  
                 t.token.op = Token::OpType::Div;  
                 break;  
             case '\\':  
                 t.token.op = Token::OpType::Div;  
                 break;  
             case '(':  
                 t.token.op = Token::OpType::LParen;  
                 break;  
             case ')':  
                 t.token.op = Token::OpType::RParen;  
                 break;  
             default:  
                 continue;  
             }  
             ret.push_back(t);  
         }  
     }  

     return ret;  
 }  

 std::vector<Token> parse(const std::vector<std::string>& tokens)  
 {  
     std::vector<Token> input = toTokens(tokens);  
     std::vector<Token> operators;  
     std::vector<Token> output;  
     //:3 C++14:  
     std::map<Token::OpType, int> precedence =   
     {   
         {Token::Add, 4},   
         {Token::Sub, 4},  
         {Token::Pos, 2},  
         {Token::Neg, 2},  
         {Token::Mul, 3},  
         {Token::Div, 3},  
         {Token::LParen, 1},  
         {Token::RParen, 1},  
     };  

     for (auto token : input)  
     {  
         if (token.type == Token::Number)  
         {  
             output.push_back(token);  
         }  
         else if (token.type == Token::Operator)  
         {  
             if (token.token.op == Token::OpType::LParen)  
                 operators.push_back(token);  
             else if (token.token.op == Token::OpType::RParen)  
             {  
                 bool foundOne = false;  
                 while (!operators.empty())  
                 {  
                     if (operators.back().token.op != Token::OpType::LParen)  
                     {  
                         output.push_back(operators.back());  
                         operators.pop_back();  
                     }  
                     else  
                     {  
                         foundOne = true;  
                         operators.pop_back();  
                         break;  
                     }  
                     if (!foundOne);  
                         //this is an indication of bad input.   
                         //Ignore that for now  
                 }  
             }  
             else  
             {  
                 bool unary = (token.token.op == Token::OpType::Pos ||  
                     token.token.op == Token::OpType::Neg);  
                 while (!operators.empty())  
                 {  
                     if (unary && precedence[operators.back().token.op]  
                         == precedence[token.token.op] &&  
                         operators.back().token.op != Token::OpType::LParen)  
                     {  
                         output.push_back(operators.back());  
                         operators.pop_back();  
                     }  
                     if (!unary && precedence[operators.back().token.op]  
                         >= precedence[token.token.op] &&  
                         operators.back().token.op != Token::OpType::LParen)  
                     {  
                         output.push_back(operators.back());  
                         operators.pop_back();  
                     }  
                     else  
                         break;  
                 }  
                 operators.push_back(token);  
             }  
         }  
     }  

     while (!operators.empty())  
     {  
         //finding any Parens here is an indication of bad input, here  
         //we simply opt out of pushing that unto the stack.  
         if (operators.back().token.op != Token::OpType::LParen ||  
             operators.back().token.op != Token::OpType::RParen)  
             output.push_back(operators.back());  
         operators.pop_back();  
     }  

     return output;  
 }

Note that toTokens() doesn't report errors (neither does parse()). It just lets errors slide silently by. This is for brevity. In serious code we would handle errors something like this: every token passed into the parser is a pair of an int and a string, where the int points to the position in text the token was found. If the parse finds an error, it can report back (either in an exception or through a special return value) not only that an error was found, but also where it was found. One way of doing that would be to have a third entry in the Token::Type enum being Error, and the value in the token union would let us know what the error was. For brevity I will just let errors slide by silently in these example; this is not production code, and either way errors won't cause crashes. However, if we want our code be at all useful we will have to add error reporting eventually, sooner rather than later.

Evaluation

What the algorithm does is to produce a sequence of tokens from infix notation to postfix notation, this is also known as Reverse Polish Notation. We can use this directly to evaluate it, all it takes is some understanding on how RPN works. We can also very easily turn this into a parse tree (an AST), and then use the parse tree to evaluate it. Let's start with the simpler of them.

Reverse Polish Notation

In my youth I had a calculator, given to my by my dad, where the user would use RPN to calculator. It made the calculation of complex expressions very easy, since you didn't have to use the calculator's memory to temporarily save values, you would use use the calculator's built-in stack.

The way the calculator worked is you would push numbers onto a stack, and then when you pressed an operator it would take the appropriate amount of of numbers (commonly two) and perform that operation, popping the values it used away from the stack, and pushing the new value onto the stack. Here's an online version of an RPN calculator you can play around with. For example: say we want to calculate (1+2)*3. What we would do is to push 1 and 2 onto the stack, then press +, and the calculator would pop the number 2 from the stack, and the number 1 from the stack, and perform addition in, pushing number 3 onto the stack. The stack would now consist of the number 3. To multiply that by 3, we would simply push 3 onto the stack, then press multiply, and the calculator would leave a number 9 on the stack. In postfix notation this would be written as:

 1 2 + 3 *

This is fairly straight-forward to write, the code looks like this:

 #pragma once  
 #include "Token.h"  
 #include <vector>  
 #include <utility>  

 int pop(std::vector<int>& stack)  
 {  
     int ret = stack.back();  
     stack.pop_back();  
     return ret;  
 }  

 std::pair<bool, int> calculate(const std::vector<Token>& queue)  
 {  
     std::vector<int> stack;  
     std::pair<bool, int> ret = { true, 0 };
  
     for (auto token : queue)  
     {  
         if (token.type == Token::Type::Number)  
             stack.push_back(token.token.value);  
         else if (token.type == Token::Type::Operator)  
         {  
             switch (token.token.op)  
             {  
             case (Token::OpType::Add):  
                 if (stack.size() >= 2)  
                     stack.push_back(pop(stack) + pop(stack));  
                 else  
                     ret.first = false;  
                 break;      
             case (Token::OpType::Sub) :  
                 if (stack.size() >= 2)  
                     stack.push_back(-(pop(stack) - pop(stack)));  
                 else  
                     ret.first = false;  
                 break;  
             case (Token::OpType::Pos) :  
                 if (stack.size() >= 1)  
                     stack.push_back(pop(stack));  
                 else  
                     ret.first = false;  
                 break;  
             case (Token::OpType::Neg) :  
                 if (stack.size() >= 1)  
                     stack.push_back(-pop(stack));  
                 else  
                     ret.first = false;  
                 break;  
             case (Token::OpType::Mul) :  
                 if (stack.size() >= 2)  
                     stack.push_back(pop(stack) * pop(stack));  
                 else  
                     ret.first = false;  
                 break;  
             case (Token::OpType::Div) :  
                 if (stack.size() >= 2)  
                 {  
                     int d = pop(stack);  
                     stack.push_back(pop(stack) / d);  
                 }  
                 else  
                     ret.first = false;  
                 break;  
             default:  
                 ret.first = false;  
                 break;  
             }  
         }  
     }  

     if (stack.size() == 1)  
         ret.second = stack.back();  
     else  
         ret.first = false;  

     return ret;  
 }

Here we have to do something if the input is faulty. What I decided to do (in keeping with the overall laziness in error handling so far) is to return an std::pair<bool, int> where the first value of the pair is false if we encountered errors. Something that is very important to realize is that when we perform operations for which order matters (subtraction) we must also read the stack in the right order, the right hand operand will be the first operand we take from the stack. Another thing to note is how the evaluation of the different operators are handled: using a switch statement. There are other ways of dealing with this, that are more expandable and less error prone. One way would be to use an std::map from an OpType to a lamda accepting appropriate inputs and returning the value of the expression. We could then just define the lamdas and the mappings earlier in the function. Another way is using inheritance and polymorphism, this youtube-video can give you an idea how that would work. We could then even use a factory to create appropriate operator-objects. In fact I use that method in my old SML-implementation. While it was quite easy to work with I can't get away from the fact that it involves a lot of pointer indirections, which might not be appropriate, I would much rather use the lamda-method if I needed something similar. For now, though, there is no reason not to use a switch-statement. We will see later in our discussion on AST's that this might not be the best way to do it.

Either way, this works. Let's try some inputs. We are nice now, and we won't try any faulty inputs, our system isn't built for that anyway. The more observant reader will realize that it's also only built to handle integer values. That'll have to do for now, but expanding the system for floats instead should not be very difficult (a problem might arise when we want to handle both, accurately). Our new main looks like this:

 #include "Tokenization.h"  
 #include "Parsing.h"  
 #include "RPN.h"  
 #include <iostream>  
   
 int main()  
 {  
     auto RPN = [](const std::string& expr)
         { return calculate(parse(Tokenize(expr))).second; };  
       
     std::cout << RPN("3*-(2+1)") << "\n"; //should equal -9  
     std::cout << RPN("9/(3*(2-1))") << "\n"; //should equal 3  
     std::cout << RPN("(1+1)*(33-12)") << "\n"; //should equal 42  
     std::cout << "yey!";   
 }

... and produces the expected output:

 -9  
 3  
 42  
 yey!

AST's are a more complicated subject. I suspect, strongly, that the write up on that will be as long as this entire post. Therefore I will leave it for a second part. I will post the code on github.

Sebastian's Programming

torsdag 24 oktober 2013

Parsing Expressions (Part 1)