When used alone, the lex program generator makes a lexical analyzer that recognizes simple, one-word input or receives statistical input. You can also use the lex program with a parser generator, such as the yacc command. The yacc command generates a program, called a parser, that analyzes the construction of more than one-word input. This parser program operates well with the lexical analyzers that the lex command generates. The parsers recognize many types of grammar with no regard to context. These parsers need a preprocessor to recognize input tokens such as the preprocessor that the lex command produces.
The lex program recognizes only extended regular expressions and formats them into character packages called tokens, as specified by the input file. When using the lex program to make a lexical analyzer for a parser, the lexical analyzer (created from the lex command) partitions the input stream. The parser (from the yacc command) assigns structure to the resulting pieces. You can also use other programs along with the programs generated by either the lex or yacc commands.
A token is the smallest independent unit of meaning as defined by either the parser or the lexical analyzer. A token can contain data, a language keyword, an identifier, or other parts of a language syntax.
The yacc program looks for a lexical analyzer subroutine named yylex, which is generated by the lex command. Normally, the default main program in the lex library calls the yylex subroutine. However, if the yacc command is loaded and its main program is used, yacc calls the yylex subroutine. In this case, each lex rule should end with:
return(token);
where the appropriate token value is returned.
The yacc command assigns an integer value to each token defined in the yacc grammar file through a #define preprocessor statement. The lexical analyzer must have access to these macros to return the tokens to the parser. Use the yacc -d option to create a y.tab.h file, and include the y.tab.h file in the lex specification file by adding the following lines to the definition section of the lex specification file:
%{ #include "y.tab.h" %}
Alternately, you can include the lex.yy.c file in the yacc output file by adding the following line after the second %% (percent sign, percent sign) delimiter in the yacc grammar file:
#include "lex.yy.c"
The yacc library should be loaded before the lex library to get a main program that invokes the yacc parser. You can generate lex and yacc programs in either order.
The yacc program creates parsers that define and enforce structure for character input to a computer program. To use this program, you must supply the following inputs:
When the yacc command gets a specification, it generates a file of C language functions called y.tab.c. When compiled using the cc command, these functions form the yyparse subroutine and return an integer. When called, the yyparse subroutine calls the yylex subroutine to get input tokens. The yylex subroutine continues providing input until either the parser detects an error or the yylex subroutine returns an end-marker token to indicate the end of operation. If an error occurs and the yyparse subroutine cannot recover, it returns a value of 1 to main. If it finds the end-marker token, the yyparse subroutine returns a value of 0 to main.
To use the yacc command to generate a parser, give it a grammar file that describes the input data stream and what the parser is to do with the data. The grammar file includes rules describing the input structure, code to be invoked when these rules are recognized, and a subroutine to do the basic input.
The yacc command uses the information in the grammar file to generate a parser that controls the input process. This parser calls an input subroutine (the lexical analyzer) to pick up the basic items (called tokens) from the input stream. The parser organizes these tokens according to the structure rules in the grammar file. The structure rules are called grammar rules. When the parser recognizes one of these rules, it executes the user code supplied for that rule. The user code is called an action. Actions return values and use the values returned by other actions.
Use the C programming language to write the action code and other subroutines. The yacc command uses many of the C language syntax conventions for the grammar file.
You must provide the main and yyerror subroutines for the parser. To ease the initial effort of using the yacc command, the yacc library contains simple versions of the main and yyerror subroutines. Include these subroutines by using the -ly argument to the ld command (or to the cc command). The source code for the main library program is:
#include <locale.h> main() { setlocale(LC_ALL, ""); yyparse(); }
The source code for the yyerror library program is:
#include <stdio.h> yyerror(s) char *s; { fprintf( stderr, "%s\n" ,s); }
The argument to the yyerror subroutine is a string containing an error message, usually the string syntax error.
These are very limited programs. You should provide more function in these subroutines. For example, keep track of the input line number and print it along with the message when a syntax error is detected. You may also want to use the value in the external integer variable yychar. This variable contains the look-ahead token number at the time the error was detected.
The input subroutine that you supply must be able to:
A token is a symbol or name that tells the parser which pattern is being sent to it by the input subroutine. A nonterminal symbol is the structure that the parser recognizes.
For example, if the input subroutine separates an input stream into the tokens of WORD, NUMBER, and PUNCTUATION, and it receives the following input:
I have 9 turkeys.
the program could choose to pass the following strings and tokens to the parser:
String | Token |
I | WORD |
have | WORD |
9 | NUMBER |
turkeys | WORD |
. | PUNCTUATION |
The parser must contain definitions for the tokens passed to it by the input subroutine. Using the -d option for the yacc command, it generates a list of tokens in a file called y.tab.h. This list is a set of #define statements that allow the lexical analyzer (yylex) to use the same tokens as the parser.
To avoid conflict with the parser, do not use names that begin with the letters yy.
You can use the lex command to generate the input subroutine or you can write the routine in the C language.
A yacc grammar file consists of three sections:
Two adjacent %% (double percent signs) separate each section of the grammar file. To make the file easier to read, put the %% on a line by themselves. A complete grammar file looks like:
declarations %% rules %% programs
The declarations section may be empty. If you omit the programs section, omit the second set of %%. Therefore, the smallest yacc grammar file is:
%% rules
The yacc command ignores blanks, tabs and new-line characters in the grammar file. Therefore, use these characters to make the grammar file easier to read. Do not, however, use blanks, tabs or new-lines in names or reserved symbols.
Put comments in the grammar file to explain what the program is doing. You can put comments anywhere in the grammar file you can put a name. However, to make the file easier to read, put the comments on lines by themselves at the beginning of functional blocks of rules. A comment in a yacc grammar file looks the same as a comment in a C language program. The comment is enclosed between /* (backslash, asterisk) and */ (asterisk, backslash). For example:
/* This is a comment on a line by itself. */
A literal string is one or more characters enclosed in '' (single quotes). As in the C language, the \ (backslash) is an escape character within literals, and all the C language escape codes are recognized. Thus, the yacc command accepts the symbols in the following table:
Never use \0 or 0 (the null character) in grammar rules.
Use the following guidelines to help make the yacc grammar file more readable:
The yacc command cannot produce a parser for all sets of grammar specifications. If the grammar rules contradict themselves or require matching techniques that are different from what the yacc command provides, the yacc command will not produce a parser. In most cases, the yacc command provides messages to indicate the errors. To correct these errors, redesign the rules in the grammar file, or provide a lexical analyzer (input program to the parser) to recognize the patterns that the yacc command cannot.
The declarations section of the yacc grammar file contains:
It is also possible to keep semantic information associated with the tokens currently on the parse stack in a user-defined C language union, if the members of the union are associated with the various names in the grammar file.
A declaration for a variable or constant uses the syntax of the C programming language:
TypeSpecifier Declarator ;
TypeSpecifier is a data type keyword and Declarator is the name of the variable or constant. Names can be any length and consist of letters, dots, underscores, and digits. A name cannot begin with a digit. Uppercase and lowercase letters are distinct.
Terminal (or token) names can be declared using the %token declaration and nonterminal names can be declared using the %type declaration. The %type declaration is not required for nonterminal names. Nonterminal names are defined automatically if they appear on the left side of at least one rule. Without declaring a name in the declarations section, you can use that name only as a nonterminal symbol. The #include statements are identical to C language syntax and perform the same function.
The yacc program has a set of keywords that define processing conditions for the generated parser. Each of the keywords begin with a % (percent sign) and is followed by a list of tokens or nonterminal names. These keywords are:
The %token, %left, %right, and %nonassoc keywords optionally support the name of a C union member (as defined by %union) called a <Tag> (literal angle brackets surrounding a union member name). The %type keyword requires a <Tag>. The use of <Tag> specifies that the tokens named on the line are to be of the same C type as the union member referenced by <Tag>. For example, the declaration:
%token [<Tag>] Name [Number] [Name [Number]]...
declares the Name parameter to be a token. If <Tag> is present, the C type for all tokens on this line are declared to be of the type referenced by <Tag>. If a positive integer, Number, follows the Name parameter, that value is assigned to the token.
All of the tokens on the same line have the same precedence level and associativity. The lines appear in the file in order of increasing precedence or binding strength. For example:
%left '+' '-' %left '*' '/'
describes the precedence and associativity of the four arithmetic operators. The + (plus sign) and - (minus sign) are left associative and have lower precedence than * (asterisk) and / (slash), which are also left associative.
To define variables to be used by some or all actions, as well as by the lexical analyzer, enclose the declarations for those variables between %{ (percent sign, left bracket) and %} (percent sign, right bracket) symbols. Declarations enclosed in these symbols are called global variables. For example, to make the var variable available to all parts of the complete program, use the following entry in the declarations section of the grammar file:
%{ int var = 0; %}
The parser recognizes a special symbol called the start symbol. The start symbol is the name of the rule in the rules section of the grammar file that describes the most general structure of the language to be parsed. Because it is the most general, this is the structure where the parser starts in its top-down analysis of the input stream. Declare the start symbol in the declarations section using the %start keyword. If you do not declare the name of the start symbol, the parser uses the name of the first grammar rule in the grammar file.
For example, when parsing a C language function, the most general structure for the parser to recognize is:
main() { code_segment }
The start symbol should point to the rule that describes this structure. All remaining rules in the file describe ways to identify lower-level structures within the function.
Token numbers are nonnegative integers that represent the names of tokens. If the lexical analyzer passes the token number to the parser instead of the actual token name, both programs must agree on the numbers assigned to the tokens.
You can assign numbers to the tokens used in the yacc grammar file. If you do not assign numbers to the tokens, the yacc grammar file assigns numbers using the following rules:
Note: Do not assign a token number of 0. This number is assigned to the endmarker token. You cannot redefine it.
To assign a number to a token (including literals) in the declarations section of the grammar file, put a positive integer (not 0) immediately following the token name in the %token line. This integer is the token number of the name or literal. Each token number must be unique. All lexical analyzers used with the yacc command must return a 0 or a negative value for a token when they reach the end of their input.
The rules section contains one or more grammar rules. Each rule describes a structure and gives it a name. A grammar rule has the form:
A : BODY;
where A is a nonterminal name, and BODY is a sequence of 0 or more names, literals, and semantic actions that can optionally be followed by precedence rules. Only the names and literals are required to form the grammar. Semantic actions and precedence rules are optional. The colon and the semicolon are required yacc punctuation.
Semantic actions allow you to associate actions to be performed each time a rule is recognized in the input process. An action can be an arbitrary C statement, and as such, perform input or output, call subprograms, or alter external variables. Actions can also refer to the actions of the parser; for example, shift and reduce.
Precedence rules are defined by the %prec keyword and change the precedence level associated with a particular grammar rule. The reserved symbol %prec can appear immediately after the body of the grammar rule and can be followed by a token name or a literal. The construct causes the precedence of the grammar rule to become that of the token name or literal.
If several grammar rules have the same nonterminal name, use the | (pipe symbol) to avoid rewriting the left side. In addition, use the ; (semicolon) only at the end of all rules joined by pipe symbols. For example, the grammar rules:
A : B C D ; A : E F ; A : G ;
can be given to the yacc command by using the pipe symbol as follows:
A : B C D | E F | G ;
Recursion is the process of using a function to define itself. In language definitions, these rules normally take the form:
rule : EndCase | rule EndCase
Therefore, the simplest case of the rule is the EndCase, but rule can also consist of more than one occurrence of EndCase. The entry in the second line that uses rule in the definition of rule is the recursion. The parser cycles through the input until the stream is reduced to the final EndCase.
When using recursion in a rule, always put the call to the name of the rule as the leftmost entry in the rule (as it is in the preceding example). If the call to the name of the rule occurs later in the line, such as in the following example, the parser may run out of internal stack space and stop.:
rule : EndCase | EndCase rule
The following example defines the line rule as one or more combinations of a string followed by a newline character (\n):
lines : line | lines line ; line : string '\n' ;
To indicate a nonterminal symbol that matches the empty string, use a ; (semicolon) by itself in the body of the rule. To define a symbol empty that matches the empty string, use a rule similar to the following rule:
empty : ; | x;
OR
empty : | x ;
When the lexical analyzer reaches the end of the input stream, it sends an end-of-input marker to the parser. This marker is a special token called endmarker, and has a token value of 0. When the parser receives an end-of-input marker, it checks to see that it has assigned all input to defined grammar rules and that the processed input forms a complete unit (as defined in the yacc grammar file). If the input is a complete unit, the parser stops. If the input is not a complete unit, the parser signals an error and stops.
The lexical analyzer must send the end-of-input marker at the appropriate time, such as the end of a file, or the end of a record.
With each grammar rule, you can specify actions to be performed each time the parser recognizes the rule in the input stream. An action is a C language statement that does input and output, calls subprograms, and alters external vectors and variables. Actions return values and obtain the values returned by previous actions. The lexical analyzer can also return values for tokens.
Specify an action in the grammar file with one or more statements enclosed in {} (braces). The following examples are grammer rules with actions:
A : '('B')' { hello(1, "abc" ); }
AND
XXX : YYY ZZZ { printf("a message\n"); flag = 25; }
To get values generated by other actions, an action can use the yacc parameter keywords that begin with a dollar sign ($1, $2, ... ). These keywords refer to the values returned by the components of the right side of a rule, reading from left to right. For example, if the rule is:
A : B C D ;
then $1 has the value returned by the rule that recognized B, $2 has the value returned by the rule that recognized C, and $3 the value returned by the rule that recognized D.
To return a value, the action sets the pseudo-variable $$ to some value. For example, the following action returns a value of 1:
{ $$ = 1;}
By default, the value of a rule is the value of the first element in it ($1). Therefore, you do not need to provide an action for rules that have the following form:
A : B ;
The following additional yacc parameter keywords beginning with a $ (dollar sign) allow for type checking:
$<Tag>Number imposes on the reference the type of the union member referenced by <Tag>. This adds .tag to the reference so that the union member identified by Tag is accessed. This construct is equivalent to specifying $$.Tag or $1.Tag. You may use this construct when you use actions in the middle of rules where the return type cannot be specified through a %type declaration. If a %type has been declared for a nonterminal name, do not use the <Tag> construct; the union reference will be done automatically
To get control of the parsing process before a rule is completed, write an action in the middle of a rule. If this rule returns a value through the $ keywords, actions that follow this rule can use that value. This rule can also use values returned by actions that precede it. Therefore, the rule:
A : B { $$ =1; } C { x = $2; y = $3; } ;
sets x to 1 and y to the value returned by C. The value of rule A is the value returned by B, following the default rule.
Internally, the yacc command creates a new nonterminal symbol name for the action that occurs in the middle. It also creates a new rule matching this name to the empty string. Therefore, the yacc command treats the preceding program as if it were written in the following form:
$ACT : /* empty */ { $$ = 1; } ; A : B $ACT C { x = $2; y = $3; } ;
where $ACT is an empty action.
When the parser reads an input stream, that input stream might not match the rules in the grammar file. The parser detects the problem as early as possible. If there is an error-handling subroutine in the grammar file, the parser can allow for entering the data again, skipping over the bad data, or initiating a cleanup and recovery action. When the parser finds an error, for example, it may need to reclaim parse tree storage, delete or alter symbol table entries, and set switches to avoid generating further output.
When an error occurs, the parser stops unless you provide error-handling subroutines. To continue processing the input to find more errors, restart the parser at a point in the input stream where the parser can try to recognize more input. One way to restart the parser when an error occurs is to discard some of the tokens following the error. Then try to restart the parser at that point in the input stream.
The yacc command has a special token name, error, to use for error handling. Put this token in the rules file at places that an input error might occur so that you can provide a recovery subroutine. If an input error occurs in this position, the parser executes the action for the error token, rather than the normal action.
The following macros can be placed in yacc actions to assist in error handling:
To prevent a single error from producing many error messages, the parser remains in error state until it processes 3 tokens following an error. If another error occurs while the parser is in the error state, the parser discards the input token and does not produce a message.
For example, a rule of the form:
stat : error ';'
tells the parser that when there is an error, it should skip over the token and all following tokens until it finds the next semicolon. All tokens after the error and before the next semicolon are discarded. After finding the semicolon, the parser reduces this rule and performs any cleanup action associated with it.
You can also allow the person entering the input stream in an interactive environment to correct any input errors by entering a line in the data stream again. For example:
input : error '\n' { printf(" Reenter last line: " ); } input { $$ = $4; } ;
is one way to do this. However, in this example the parser stays in the error state for 3 input tokens following the error. If the corrected line contains an error in the first 3 tokens, the parser deletes the tokens and does not give a message. To allow for this condition, use the yacc statement:
yyerrok;
When the parser finds this statement, it leaves the error state and begins processing normally. The error-recovery example then becomes:
input : error '\n' { yyerrok; printf(" Reenter last line: " ); } input { $$ = $4; } ;
The look-ahead token is the next token that the parser examines. When an error occurs, the look-ahead token becomes the token at which the error was detected. However, if the error recovery action includes code to find the correct place to start processing again, that code must also change the look-ahead token. To clear the look-ahead token include the following statement in the error-recovery action:
yyclearin ;
You must provide a lexical analyzer to read the input stream and send tokens (with values, if required) to the parser that the yacc command generates. To build a lexical analyzer that works well with the parser that yacc generates, use the lexlex command.
The lex command generates a lexical analyzer called yylex. The yylexprogram must return an integer that represents the kind of token that was read. The integer is called the token number. In addition, if a value is associated with the token, the lexical analyzer must assign that value to the yylval external variable.
The yacc command turns the grammar file into a C language program. That program, when compiled and executed, parses the input according to the grammar specification given.
The parser is a finite state machine with a stack. The parser can read and remember the look-ahead token. The current state is always the state at the top of the stack. The states of the finite state machine are represented by small integers. Initially, the machine is in state 0, the stack contains only 0, and no look-ahead token has been read.
The machine can perform one of four actions:
The parser performs the following actions during one process step:
The shift action is the most common action the parser takes. Whenever the parser does a shift, there is always a look-ahead token. For example, consider the following grammar specification rule:
IF shift 34
If the parser is in the state that contains this rule and the look-ahead token is IF, the parser:
The reduce action keeps the stack from growing too large. The parser uses reduce actions after matching the right side of a rule with the input stream. The parser is then ready to replace the characters in the input stream with the left side of the rule. The parser may have to use the look-ahead token to decide if the pattern is a complete match.
Reduce actions are associated with individual grammar rules. Because grammar rules also have small integer numbers, you can easily confuse the meanings of the numbers in the two actions, shift and reduce. For example, the following action refers to grammar rule 18:
. reduce 18
The following action refers to state 34:
IF shift 34
For example, to reduce the following rule, the parser pops off the top three states from the stack:
A : x y z ;
The number of states popped equals the number of symbols on the right side of the rule. These states are the ones put on the stack while recognizing x, y, and z. After popping these states, a state is uncovered, which is the state the parser was in before beginning to process the rule, that is, the state that needed to recognize rule A to satisfy its rule. Using this uncovered state and the symbol on the left side of the rule, the parser performs an action called goto, which is similar to a shift of A. A new state is obtained, pushed onto the stack, and parsing continues.
The goto action is different from an ordinary shift of a token. The look-ahead token is cleared by a shift but is not affected by a goto action. When the three states are popped, the uncovered state contains an entry such as:
A goto 20
This entry causes state 20 to be pushed onto the stack and become the current state.
The reduce action is also important in the treatment of user-supplied actions and values. When a rule is reduced, the parser executes the code that you included in the rule before adjusting the stack. Another stack running in parallel with the stack holding the states holds the values returned from the lexical analyzer and the actions. When a shift takes place, the yylval external variable is copied onto the stack holding the values. After executing the code that you provide, the parser performs the reduction. When the parser performs the goto action, it copies the yylval external variable onto the value stack. The yacc keywords that begin with $ refer to the value stack.
A set of grammar rules is ambiguous if any possible input string can be structured in two or more different ways. For example, the grammar rule:
expr : expr '-' expr
states a rule that forms an arithmetic expression by putting two other expressions together with a minus sign between them. Unfortunately, this grammar rule does not specify how to structure all complex inputs. For example, if the input is:
expr - expr - expr
a program could structure this input as either left associative:
( expr - expr ) - expr
expr - ( expr - expr )
and produce different results.
When the parser tries to handle an ambiguous rule, confusion occurs over which of its four actions to perform when processing the input. Two major types of conflict develop:
A shift/shift conflict is not possible. The shift/reduce and reduce/reduce conflicts result from a rule that is not completely stated. For example, using the ambiguous rule stated in the previous section, if the parser receives the input:
expr - expr - expr
after reading the first three parts the parser has:
expr - expr
which matches the right side of the preceding grammar rule. The parser can reduce the input by applying this rule. After applying the rule, the input becomes:
expr
which is the left side of the rule. The parser then reads the final part of the input:
- expr
and reduces it. This produces a left associative interpretation.
However, the parser can also look ahead in the input stream. If, when the parser receives the first three parts:
expr - expr
it reads the input stream until it has the next two parts, it then has the following input:
expr - expr - expr
Applying the rule to the rightmost three parts reduces them to expr. The parser then has the expression:
expr - expr
Reducing the expression once more produces a right associative interpretation.
Therefore, at the point where the parser has read only the first three parts, it can take two valid actions: a shift or a reduce. If the parser has no rule to decide between the two actions, a shift/reduce conflict results.
A similar situation occurs if the parser can choose between two valid reduce actions. That situation is called a reduce/reduce conflict.
When shift/reduce or reduce/reduce conflicts occur, the yacc command produces a parser by selecting one of the valid steps wherever it has a choice. If you do not provide a rule that makes the choice, yacc uses two rules:
Using actions within rules can cause conflicts if the action must be performed before the parser is sure which rule is being recognized. In such cases, the preceding rules result in an incorrect parser. For this reason, the yacc program reports the number of shift/reduce and reduce/reduce conflicts resolved using the preceding rules.
You can access the debugging code either by invoking the yacc command with the -t option or compiling the y.tab.c file with -DYYDEBUG.
For normal operation, the yydebug external integer variable is set to 0. However, if you set it to a nonzero value, the parser generates a description of:
Set this variable in one of the following ways:
int yydebug = 1;
Tools and Utilities Overview for Programmers
Example Program for the lex and yacc Programs
Creating an Input Language with the lex and yacc Commands
lex command, yacc command, ed command, sed command
printf subroutine