View Full Version: Compiler/Interpreters

C++ Learning Community > C++ Tips > Compiler/Interpreters


Title: Compiler/Interpreters
Description: How To


myork - September 22, 2004 01:10 PM (GMT)

Anybody want to know how to use lex(flex) and yacc(bison) to build the front end of a compiler/Interpretor?

Sombody else into the Assembley could then add a topic about how to convert the resulting data structures into an executable (or we could just dump out C).


C-Man - September 22, 2004 01:13 PM (GMT)
Yeah would be interesting , but not now ... :P

FrozenKnight - September 22, 2004 06:01 PM (GMT)
the best opne source example i can give you is OllyDbg it has a line by line assembler built in and is the best refrence i can give you on this topic. (and it's written in C so you shouldnt have too much troubble understanding it)

KTC - September 22, 2004 10:44 PM (GMT)
He already know how to do this, he's asking whether anyone want to know it :P

FrozenKnight - September 23, 2004 08:06 AM (GMT)
ya sure i'd like to know more about this. (i got lost in that source anyway LOL)

IllegalEnzyme - September 27, 2004 09:38 AM (GMT)
i'd like to know, its for an interpreter so i don't need any of the assembly stuff.

Consumed - September 27, 2004 05:31 PM (GMT)
I'd love to know how to do it. Always have wanted to make an interpreter. :)

IllegalEnzyme - October 5, 2004 09:06 AM (GMT)
myork - so are you actually going to tell us or not?

BloodGhst - November 15, 2004 08:56 AM (GMT)
I saw a chapter on interpreters in The Art of C++. Interesting book, but I'd be more than happy to learn more.

myork - June 8, 2005 09:30 PM (GMT)
Inspired by Dr V. and remirez.

I put together this:

Incubator - June 8, 2005 09:51 PM (GMT)
I get an error from tar in Dutch stating that it isnt a valid archive

(for those who read dutch: )

CODE

$ tar xvzf calc.tar.gz
tar: Dit ziet er niet uit als een tar-archief
tar: Overslaan tot volgende kop
tar: Afsluitcode uitgesteld na eerdere fouten


yes I have LANG="nl_BE@euro" :P

ramirez - June 8, 2005 09:56 PM (GMT)
Archive is corrupt. :o

Consumed - June 9, 2005 01:34 AM (GMT)
Thanks alot myork. :) One prob I'd like to point out though: Before I can build it properly I have to edit calc.tab.c and comment out
CODE
#if defined (__GNUC_MINOR__) && 2093 <= (__GNUC__ * 1000 + __GNUC_MINOR__)
 __attribute__ ((__unused__))
#endif
or I get this:
CODE
g++ main.cpp calc.tab.c lex.yy.c -o calc
calc.tab.c: In function `int yyparse()':
calc.tab.c:1164: error: expected primary-expression before "__attribute__"
calc.tab.c:1164: error: expected `;' before "__attribute__"
make: *** [calc] Error 1
It works great otherwise. I even extended it a bit with ^ (+ cmath + pow() :P) for power. :)

dr voodoo - June 9, 2005 01:18 PM (GMT)
Corrupt for me to. :(

myork - June 9, 2005 02:33 PM (GMT)
QUOTE (Consumed @ Jun 8 2005, 08:34 PM)
Thanks alot myork. :) One prob I'd like to point out though: Before I can build it properly I have to edit calc.tab.c and comment out
CODE
#if defined (__GNUC_MINOR__) && 2093 <= (__GNUC__ * 1000 + __GNUC_MINOR__)
 __attribute__ ((__unused__))
#endif


calc.tab.c is a generated file.
So this is a problem between your compiler and bison (yacc). You should make sure that the versions you have for each are compatable.

I have to go to a meeting I will post the source shortly.
Its Windows fault :o that the tar.gz was corrupted honest.

myork - June 9, 2005 02:57 PM (GMT)
OK. Here are five files:
CODE

calc.l:   Defines the tokens.
calc.y:   Defines the syntax
main.cpp: The main() and some utilities.
calc.h:   Some headers.
Makefile:



file: calc.l
CODE
%{
#include "calc.h"
#include "calc.tab.h"
#include <sstream>
#include <string>
#include <iomanip>
#include <iostream>


static double decodeNumber(int radex,const char* begin,int length);

%}

BinDigit      [0-1]
OctDigit      [0-7]
DecDigit      [0-9]
HexDigit      [0-9A-Fa-f]

BinNumber     0b{BinDigit}+
OctNumber     0{OctDigit}+
DecNumber     [1-9]{DecDigit}*
HexNumber     0x{HexDigit}+

FloatNumber   {DecNumber}?"."{DecNumber}

WhiteSpace    [ \t]+


%%


"+"             {return('+');}
"-"             {return('-');}
"*"             {return('*');}
"/"             {return('/');}
"%"             {return('%');}
"^"             {return('^');}

"("             {return('(');}
")"             {return(')');}

"\n"            {return('\n');}


{BinNumber}     {value = decodeNumber( 2,yytext+2,yyleng-2);return(NUMBER);}
{OctNumber}     {value = decodeNumber( 8,yytext,yyleng);return(NUMBER);}
{DecNumber}     {value = decodeNumber(10,yytext,yyleng);return(NUMBER);}
{HexNumber}     {value = decodeNumber(16,yytext+2,yyleng-2);return(NUMBER);}

{FloatNumber}   {value = decodeNumber(10,yytext,yyleng);return(NUMBER);}


{WhiteSpace}    {/*Ignore*/}
.               {
                 /*Unexpected Text that does not match anything above*/
                 std::cout << "ERROR: bad input (" << std::string(yytext,yytext+yyleng) << ")" << std::endl;
               }


%%

static double decodeNumber(int radex,const char* begin,int length)
{
 std::stringstream str(std::string(begin,begin+length));
 double result;
 str >> std::setbase(radex) >> result;
 return(result);
}


Explanation:
This file is the source for flex (fast lex). It defines the symbols (tokens/lexemes) of the language. In this case the symbols are very simple:

Arithmatic operations: + - * / % ^
Parathasis: ( )
Numbers

I also use '\n' to mark the end of an expression. That was just a design desision not a required choice.


The flex file looks like this:
CODE


DEFININITIONS

%%

TOKEN DEFINITION

%%

<Normal C/C++ source>


DEFINITIONS:
This is where NUMBER is defined. But it can be used to define any complex token type and assign it a name so it can be used in the TOKEN DEFINITION

The definitions Look like this:

CODE

Name      Definition (This are very similar to Regular expressions)

EBNF:
[ <STUFF> ]  Means match any character between the '[' and ']'
.           Match any 1 character.
+            Means match the previous symbol 1 or more times.
*            Means match the previous symbol 0 or more times.
?            Means match the previous symbol 0 or 1 time.
{NAME}        Means use a previously defined name.

So:

DecDigit        [0-9]

DecDigit will match 1 character that has a value in the range '0' -> '9'

DecNumber     [1-9]{DecDigit}*

DecNumber will match a sequence of 1 or more digits that does not start with 0.
                    [1-9] -> Match 1 character that has a value in the range '1' -> '9'
                    {DecDigit}* Match 0 or more DecDigits.



TOKEN DEFINITIONS
This section maps tokens to C/C++ source actions. This generally means returning the token number to the syntax parser so it can see what token comes next:

For tokens that are a single character long it is traditional to just convert the character code to an integer and return that value. For other values you should return a unique integer value for the token: (Traditionally you should start your unique values at 128 so that they never clash with single character tokens).

CODE
'+'      {return('+');}


In our parser the only non single character token is a number.

CODE
{DecNumber}     {value = decodeNumber(10,yytext,yyleng);return(NUMBER);}


This sets the global variable value with the decoded DecNumber and returns the token NUMBER to the syntax parser.

yytext: is a char pointer to start of the matching text.
yyleng: is the number of characters that match the expression.


If we were to extend the lexer to recognise other C/C++ keywords it would look somthing like this:
CODE
if       {return (STATMENT_IF);}
else     {return (STATMENT_ELSE);}
for      {return (STATMENT_FOR);}
while    {return (STATMENT_WHILE);}



The other things to note is that I ignore whitespace with:

CODE
{WhiteSpace}    {/*Ignore*/}

This is done by matching some text and not returning. This causes the parser to match the text execute the source then continue to find the next piece of matching text.

Any character that is not matched above is therefore an error.
This is caught by the following expression:

CODE
.               {
                 /*Unexpected Text that does not match anything above*/
                 std::cout << "ERROR: bad input (" << std::string(yytext,yytext+yyleng) << ")" << std::endl;
               }

myork - June 9, 2005 03:48 PM (GMT)
file: clac.y
CODE

%{
#include "calc.h"
#include "calc.tab.h"
#include <iostream>


%}
%token  NUMBER

%%

input:        expr  '\n'            {std::cout << "RESULT: " << $1 << std::endl;return(0);};
;

expr:         expr '+' term         {$$ = $1 + $3;}
         |   expr '-' term         {$$ = $1 - $3;}
         |   term                  {$$ = $1;}
;

term:         term '*' factor       {$$ = $1 * $3;}
         |   term '/' factor       {$$ = $1 / $3;}
         |   term '%' factor       {$$ = static_cast<double>(static_cast<int>($1) % static_cast<int>($3));}
         |   factor                {$$ = $1;}
;

factor:       '(' expr ')'          {$$ = $2;}
         |   '-' factor            {$$ = -$2;}
         |   '+' factor            {$$ = +$2;}
         |   NUMBER                {$$ = value;}
;



%%




Explanation:
This the bison (a hairy yacc) file. It has the following defineiton:

CODE


DECLARATIONS

%%

SYNTAX DEFINITION

%%

C/C++ source



DECLARATIONS:

Anything inside
CODE
%{

%}

Is copied to the output file:


CODE
%token NUMBER
This defines a valid token that can be returned by the lexer.


SYNTAX DEFINITION:

The first SYNTAX DEFINITION defines what we are looking for. All the other expressions are only reached via the first SYNTAX DEFINITION.

Lets us define some terms first:

CODE
Terminal:        A value returned by the lexer (or defined by %token above).
NonTerminal:     A definition that is defined in the SYNTAX DEFINITION section.
Token:           A Terminal or a NonTerminal.
Token List:      A list of tokens.


It is traditional the Terminals by either single characters ('+') or defined as all uppercase values. While NonTerminals are traditionally all lowercase.

The syntax:
CODE
<NonTerminal>:      <Token List>  {C/C++ Source} [ | <Token List> {C/C++ Source}]*;


What this means is that a NonTerminal can be defined as a list of Tokens. When the list of tokens has been matched the Code is executed.

Using the '|' operator defines an alternative set of tokens that can also be matched. You can have 0 or more alternative Token lists. Only one of the alternatives will be matched.

Each definition must be terminated with a ';'


Example:
CODE
input:        expr  '\n'            {return(0);};


This is the definition of the NonTerminal 'input'.
It is defined as an expr followed by a new line character.

Example:
CODE
expr:         expr '+' term         {$$ = $1 + $3;}
         |   expr '-' term         {$$ = $1 - $3;}
         |   term                  {$$ = $1;}
;


This is the definition of expr:
Here we define that 'expr' can take three forms:
CODE
expr '+' term
CODE
expr '-' term
CODE
term


We also see the use of the $ notation used by bision. This defines the resulting value of the non terminal.

CODE
$$  : The value of the defined NonTerminal.
$<X>: Is the value of the <X> token.
     For Terminals this is undefined.
     For NonTerminals it is the value set to $$ in that NonTerminals definition.



The definition of a C function would look somthing like this:

CODE

functionDef:   type funcName '(' paramList ')' functionBody {return(new func($1,$2,$4,$6));}
functionBody:  '{' statmentList '}'                         {return($2);}
funcName:      identifier                                   {return($1);}
paramList:                                                  {return(new ParameterList());}
           |   params                                       {return($1);}
params:        parameterDef                                 {return(new ParameterList($1));}
           |   params ',' parameterDef                      {$1->push_back($3);}

myork - June 9, 2005 03:59 PM (GMT)
file: main.cpp
CODE


#include "calc.h"
#include <iostream>

double value;

int main(int argc,char* argv[])
{
 std::cout << "Welcome to Expression Calculator (c) 2005 By myork ^^;;;" << std::endl <<
     "You can use CTRL+C to quit" << std::endl << std::endl;
 for(;;)
 {
   std::cout << "Enter Expression: ";
   yyparse();
 }
 return 0;
}

/*
* yyerror()
*  Generated diagnostic error message.
*/
void yyerror(const char* errorMsg)
{
 std::cout << "ERROR: " << errorMsg << std::endl;
}

/*
* yywrap()
*    Let the lexer know if there is more input after it has reached the end of the input stream.
*/
int yywrap()
{
 /*
  * When the input stream is empty this function is called.
  *
  * If you return 1 (non zero) then we have finished parsing and yyparse() will exit.
  *
  * If you return 0.
  * Then you must set yyin to an open FILE where more input can be read from.
  */
 return(1);
}

myork - June 9, 2005 04:01 PM (GMT)
file:calc.h
CODE


#ifndef CALC_H
#define CALC_H

#define YYSTYPE double
extern double value;

extern "C" int  yywrap();
extern "C" int  yylex();
extern "C" void yyerror(const char* errorMsg);
extern "C" int  yyparse();

#endif


file: Makefile
CODE

YACC  = bison
LEX   = flex

calc:   main.cpp calc.tab.c lex.yy.c
 g++ main.cpp calc.tab.c lex.yy.c -o calc


calc.tab.c: calc.y
 $(YACC) -d calc.y

lex.yy.c:   calc.l
 $(LEX) calc.l

ramirez - June 9, 2005 04:23 PM (GMT)
Nice example.
However, one little 'fix' in the explanation part..
QUOTE
Name      EBNF Definition (This is very similar to Regular expressions)

EBNF:
[ <STUFF> ]  Means match any character between the '[' and ']'
.           Match any 1 character.
+            Means match the previous symbol 1 or more times.
*            Means match the previous symbol 0 or more times.
?            Means match the previous symbol 0 or 1 time.
{NAME}        Means use a previously defined name.

So:

DecDigit        [0-9]

DecDigit will match 1 character that has a value in the range '0' -> '9'

DecNumber     [1-9]{DecDigit}*

DecNumber will match a sequence of 1 or more digits that does not start with 0.
                    [1-9] -> Match 1 character that has a value in the range '1' -> '9'
                    {DecDigit}* Match 0 or more DecDigits.

The flex doesn't use EBNF, flex uses extended regular expressions (EBNF is completely different, for example in EBNF there are no character classes). This is explained in flex manual: http://www.monmouth.com/~wstreett/lex-yacc/flex_1.html
Scroll down to the Patterns section to see the rules of the extended regexps flex uses.

Anyway, it's YACC/Bison that uses EBNF (EDIT. or BNF, not sure).

Other than that, nice.




Hosted for free by InvisionFree