The Allegro Wiki is migrating to github at https://github.com/liballeg/allegro_wiki/wiki

Difference between revisions of "Yacc-less parse"

From Allegro Wiki
Jump to: navigation, search
m
m (YaccLessParse moved to Yacc-less parse: Twiki -> MediaWiki title conversion)
(No difference)

Revision as of 07:08, July 15, 2007

As a thank you to the allegro community, since it's where I learned all about C/C++.

A Simple Equation Parser

By Winston Ewert

In this article I'm going to show you how to build an equation parser without using an external tools like flex/bison. Why would we do this? Firstly, it's a good idea to understand some of the mechanics of parsing. Secondly, I think that writing using native programming constructs to describe the language is far better then using an external tool. Thirdly, because it was just fun to try and do something out of the ordinary.

So, first let's define our input class. This class will be called CInput and provide the basic facilities required to get input from the user.

<highlightSyntax language="cpp"> class CInput {

   public:
       CInput();
       char Look();        // returns the current lookahead
       void Match(char c); // matches the next character
       int ReadNumber();   // returns the number currently being looked at
   protected:
       void Forward();     // moves forward in the input stream
       char mLookAhead;    // holds the next character

}; </highlightSyntax>

The lookahead character represents the first character we have not dealt with. It is returned by calling Look(). In our parser we will always be able to tell what the user wants by looking one character ahead. This isn't always true, but for this simple case it is. Match() is called to take care of characters in the input stream. If you call match with a different character then what is really up there, it will throw an exception. ReadNumber() will read a number starting with the current lookahead character and return it to you.

mLookAhead obviously holds the lookahead character. Forward() is used by the Match() and ReadNumber() methods to go to the next character. The user himself cannot call Forward(). He must use one of the other methods. Why? One good reason is the fact that Match() and ReadNumber() will report errors. Bypassing the error system is not a good idea.

Now for some implementation:

<highlightSyntax language="cpp"> CInput() { Forward(); } </highlightSyntax>

When the class is created it will need to read the first bit of input.

<highlightSyntax language="cpp">

       char Look() // returns the current lookahead
       {
           return mLookAhead;
       }

</highlightSyntax> That's pretty simple, just return a member variable.

<highlightSyntax language="cpp"> void CInput::Match(char c) {

   if(c == mLookAhead) // if this really is the correct character
   {
       Forward(); // then simply move forward
   } else {    
       string error = "Expected: "; // make a string
       error.push_back(c); // and throw it
       throw error;
   }    

} </highlightSyntax>

This function is a little bit more complicated, but it's still rather simple. We just check to make sure it was a match. If it wasn't a match, then we throw a string. All the errors in this program will result in throwing a string. In a more heavy duty program you'd probably want an exception class and include line information.

<highlightSyntax language="cpp"> int CInput::ReadNumber() {

   if(isdigit(mLookAhead)) // first check if the lookahead is a digit
   {
       string number; // a string
       while(isdigit(mLookAhead)) // as long as the look ahead is a digit
       {
           number.push_back(mLookAhead);
           Forward();
       }
       
       istringstream ss(number); // open a stream
       int value; 
       ss >> value; // convert string into number
       return value; // return it
   } else {
       throw string("That's not a number!");
   }

} </highlightSyntax>

Again mostly self explanatory. In case some people don't know, isdigit() is a function that returns true if the letter you pass it is a number as opposed to a non-numeric symbol. We use this simple test to read all the digits we can. istreamstream is like cin except that it uses a string as its input instead of the keyboard. Here we use it to convert our number in string form into a plain C++ int. Again, if we didn't find a number, we will throw an exception.

<highlightSyntax language="cpp"> void CInput::Forward() {

   cin >> mLookAhead; // just read the next char

} </highlightSyntax>

Well, that was short. Basically we just used the standard C input. As a bonus from this it will ignore space, newlines, and tabs. So we don't have to handle whitespace. But in a more serious parser, we would want to so we can handle whitespace.

Next, let's do a test. Put code like this in your main function.

<highlightSyntax language="cpp"> try {

 CInput input;
 cout << input.ReadNumber() << endl;
 input.Match('(');
 input.Match(')');
 cout << input.ReadNumber() << endl;

} catch(string error) {

   cout << error << endl;

} </highlightSyntax>

Basically, we'll just call the class a bunch of times to make sure it does what it should. One thing to note is that cin will wait until you press enter before anything happens. So typing 2 () 3 will not even start the checking process until you push enter. Another thing to note is that it ignores spaces. So, 2 3 is considered the same as 23. This is okay for now, but we'll probably decide to fix it later.

Now let's see about actually parsing something:

<highlightSyntax language="cpp">int Expression(CInput & input);</highlightSyntax>

The idea here is simple. We pass to this function the input and it returns the evaluated expression. For our first try we'll only have adding/subtracting.

But first, let's define

<highlightSyntax language="cpp"> int Term(CInput & input) {

   return input.ReadNumber(); // dead simple

} </highlightSyntax>

Yes, there is a reason I'm creating this as a function rather then just calling it, but we'll worry about that later.

<highlightSyntax language="cpp"> int Expression(CInput & input) {

   int a = Term(input);
   switch(input.Look())
   {
       case('+'):
           return a + Term(input);
       case('-'):
           return a - Term(input);
   }

} </highlightSyntax>

Simple enough, first we grab a number. Then we check to see if we are adding or subtracting. We grab a second number and do the requested action, returning the result to our main program. Let's test it, here is new testing code.

<highlightSyntax language="cpp"> try {

 CInput input;
 cout << Expression(input) << endl;

} catch(string error) {

   cout << error << endl;

} </highlightSyntax>

So, I run it, and ask it what 2+3 is. An it tells me that's not a number. So I'll break out the debugger. Now it can be a bit weird with the debugger because the input is read before the parsing begins, but with pratice you can get used to it.

It would appear that I forgot to call match in my expression method. Here is the new version:

<highlightSyntax language="cpp"> int Expression(CInput & input) {

   int a = Term(input);
   switch(input.Look())
   {
       case('+'):
           input.Match('+');
           return a + Term(input);
       case('-'):
           input.Match('-');
           return a - Term(input);
   }

} </highlightSyntax>

It works! But now, what if we want to add 2+3+5? Well, normally for such things in C++ we'd use a while loop. And since we are using C++, we use the native features.

You may ask why I don't throw an error if it's not a plus or a minus? First, I don't like having to exit out of a program by having an error. Secondly, what would normally happen is that it would assume that the character which didn't fit is part of the next construct after the expression. It just so happens we don't have one. But since you brought that up, let's make a rule. From now on, the statement must be ended with a semicolon. Simply add input.Match(';'); to the end of the last function. That way exiting can be done naturally.

<highlightSyntax language="cpp"> int Expression(CInput & input) {

   int a = Term(input);
   int cont = true;
   while(cont)
   {
       switch(input.Look())
       {
           case('+'):
               input.Match('+');
               a = a + Term();
           case('-'):
               input.Match('-');
               a = a - Term();
           default:
               cont = false;
        }
    }    
    input.Match(';');

} </highlightSyntax>

The obvious change is the addition of the while. But you will also note that instead of returning the value, our two actions change the a value. Also note the semicolon match at the bottom.

Hmm, I had a semicolon but then it gave me an error saying I didn't. Doh! Break statements are very important.

Yet another version of Expression.

<highlightSyntax language="cpp"> int Expression(CInput & input) {

   int a = Term(input);
   int cont = true;
   while(cont)
   {
       switch(input.Look())
       {
           case('+'):
               input.Match('+');
               a = a + Term(input);
               break;
           case('-'):
               input.Match('-');
               a = a - Term(input);
               break;
           default:
               cont = false;
               break;
        }
    }    
    input.Match(';');
    return a; // we'd best return our result

} </highlightSyntax>

All right, so now we can calculate infinite amount of pluses and minuses. But how about multiplication and division? The thing that makes those more interesting is that they require operator precedence. But fourtunately, that's not as hard to deal with as it sounds. All we need to do is work backwards in our expression function. That is, our bottom level function, expression, will take care of the items at the lowest level of precedence. Our next level function, term, will take care of our next level of precedence. So, let's change term around.

<highlightSyntax language="cpp"> int Factor(CInput & input) {

   return input.ReadNumber(); // dead simple

}

int Term(CInput & input) {

   int a = Factor(input);
   int cont = true;
   while(cont)
   {
       switch(input.Look())
       {
           case('*'):
               input.Match('*');
               a = a * Factor(input);
               break;
           case('/'):
               input.Match('/');
               a = a / Factor(input);
               break;
           default:
               cont = false;
               break;
        }
    }    
    return a;

} </highlightSyntax>

The new factor is the same as the old term. Term is almost identical to expression. As a matter of fact, I copied expression and changed a few things to create Term. Now this similaritly bugs me. But I can't think of a good way to abstract the differences. So, for the time being we'll ignore them. Now that we've done that, let's see about paranthesis.

<highlightSyntax language="cpp"> int Number(CInput & input) {

   return input.ReadNumber(); // dead simple

} int Expression(CInput & input); int Factor(CInput & input) {

   if(input.Look() == '(')
   {
       input.Match('(');
       return Expression(input);
       input.Match(')');
   }else{
       return Number(input);
   }

} </highlightSyntax>

Once again, Number is the old factor. Now factor does a simple check. Is there a parenthesis coming up? If there is, it matches the first paranthesis, reads an expression and then matches the closing one. If there are no paranthesis then we'll assume it's a number. I've removed the input.Match(';') in Expression since that wouldn't work properly now that we are calling expression again from down below. If you are attached to that semicolon, move the match down into main.

Now, let's add simple variables. We'll just have single character vars. For this simple example, I'll create a global array to hold the values.

<highlightSyntax language="cpp">int values[26];</highlightSyntax>

Next, we'll need to be able to get values from the array, so we'll change number around a bit.

<highlightSyntax language="cpp"> int Number(CInput & input) {

   if(isalpha(input.Look()))
   {
       // we have a variable!
      char var = input.ReadChar();
      int index = tolower(var)-97;
      return values[index];     
   } else {    
   return input.ReadNumber(); // dead simple
   }    

} </highlightSyntax>

Basically, if we encounter a letter instead of a digit we'll grab the var and push that. Finally, we need a way to assign.

<highlightSyntax language="cpp"> void Action(CInput & input) {

   // there are two types of actions, expressions and assignments
   if(isalpha(input.Look()))
   {
      // tis a var!
      char var = input.ReadChar();
      int index = tolower(var)-97; // in theory this should result
                                  // in a usuable index
      input.Match('=');
      values[index] = Expression(input);
  } else {
      cout << Expression(input);
  }
  input.Match(';');

} </highlightSyntax>

Okay, simple enough situation. If we encounter a letter, then we are assigning, otherwise try to parse an expression.

Our main code changes a bit to.

<highlightSyntax language="cpp"> try {

 CInput input;
 while(input.Look() != ';')
 {
  Action(input);
 }   

} catch(string error) {

   cout << error << endl;

} </highlightSyntax>

Now, as long as the next character is not a semicolon we'll stop. This means to exit the program cleanly you need to do two semicolons. One to get out of action, the other to get out of the program.

But, we have a problem. Try assiging the variable a, and then the equation "a" The parser says, "= expected!" Not good. The problem is like this: our parser only looks at one character at a time, but in order to predict whether this is an assignment or a value-get we need to look one past the variable. Now, we could solve this by creating a double look ahead. That way we always read two characters past what we have just parsed. This way, we can check to see whether this is an assignment rather easily. Another way would be to allow infinite lookahead and maintain an internal buffer.

Or we could let the C++ classes take care of all the work for us. It just so happens that the cin objects has a putback method. This means we can add a method to CInput

<highlightSyntax language="cpp"> void Reverse(char c) {

  cin.putback(mLookAhead);
  mLookAhead =c

} </highlightSyntax>

After this call, the program will reverse one character if the character passed is the actual last character that was removed. With this in mind, let's change Action().

<highlightSyntax language="cpp"> void Action(CInput & input) {

   // there are two types of actions, expressions and assignments
   if(isalpha(input.Look()))
   {
      // tis a var!
      char var = input.ReadChar();
      int index = tolower(var)-97; // in theory this should result
                                  // in a usuable index
      if(input.Look() == '=')
      {
             input.Match('=');
             values[index] = Expression(input);
     } else {
         input.Reverse(var); // put our variable back.
         cout << Expression(input);          
     }            
  } else {
      cout << Expression(input);
  }
  input.Match(';');

} </highlightSyntax>

Basically, we read the variable, and then check to see if we had an equal sign. If we did not have an equal sign, we put the variable back and then attempt to parse it as a normal expression.

Well, that's it for now. But we can throw in a few hints as to what can be done to further the product. First, the input class needs to be rewritten. Instead of dealing with characters it should deal with tokens. In our case each important piece in the language was represented by a single character. But that won't last long. Consider variable names, keywords, operators like ==, && and the like. Instead, the input class needs to return an integer that represents what it found. So we have enum which can say we found a string, identifier, keyword, or number. The lookahead will still exist, but it will return the next token not the next letter.

Next, the language is probably going to outgrow ints. For an interpreter, it should probably be replaced by a CValue class. This class can hold any of the types your program supports and will be returned from function to function. However, if you want to write a compiler, you should have a node class. The node class represents an infinite amount of actions. It will have virtual functions that allow you to actually generate the code. It may also allow querying, so that one node can discover the types of it subnodes and perhaps gather information in order to optimize its action.

What I really like about this approach is the fact that I think it gives me better organization. When playing with yacc, and creating nodes I basically had to parse it in the yacc file and then have my action return a newed class. This way, each node can have the neccesary information in order to parse itself.