LLLLeeeexxxx ---- AAAA LLLLeeeexxxxiiiiccccaaaallll AAAAnnnnaaaallllyyyyzzzzeeeerrrr GGGGeeeennnneeeerrrraaaattttoooorrrr M. E. Lesk and E. Schmidt Bell Laboratories Murray Hill, New Jersey 07974 _AAA_BBB_SSS_TTT_RRR_AAA_CCC_TTT Lex helps write programs whose control flow is directed by instances of regular expressions in the input stream. It is well suited for editor- script type transformations and for segmenting input in preparation for a parsing routine. Lex source is a table of regular expressions and corresponding program fragments. The table is translated to a program which reads an input stream, copying it to an output stream and parti- tioning the input into strings which match the given expressions. As each such string is recog- nized the corresponding program fragment is exe- cuted. The recognition of the expressions is per- formed by a deterministic finite automaton gen- erated by Lex. The program fragments written by the user are executed in the order in which the corresponding regular expressions occur in the input stream. The lexical analysis programs written with Lex accept ambiguous specifications and choose the longest match possible at each input point. If necessary, substantial lookahead is performed on the input, but the input stream will be backed up to the end of the current partition, so that the user has general freedom to manipulate it. Lex can generate analyzers in either C or Ratfor, a language which can be translated automatically to portable Fortran. It is avail- able on the PDP-11 UNIX, Honeywell GCOS, and IBM OS systems. This manual, however, will only dis- cuss generating analyzers in C on the UNIX system, which is the only supported form of Lex under UNIX Version 7. Lex is designed to simplify interfac- ing with Yacc, for those with access to this compiler-compiler system. - 2 - _111. _III_nnn_ttt_rrr_ooo_ddd_uuu_ccc_ttt_iii_ooo_nnn. cessing programs in the same and often inappropriate Lex is a program gen- string handling language. erator designed for lexical processing of character Lex is not a complete input streams. It accepts a language, but rather a gen- high-level, problem oriented erator representing a new specification for character language feature which can string matching, and pro- be added to different pro- duces a program in a general gramming languages, called purpose language which ``host languages.'' Just as recognizes regular expres- general purpose languages sions. The regular expres- can produce code to run on sions are specified by the different computer hardware, user in the source specifi- Lex can write code in dif- cations given to Lex. The ferent host languages. The Lex written code recognizes host language is used for these expressions in an the output code generated by input stream and partitions Lex and also for the program the input stream into fragments added by the user. strings matching the expres- Compatible run-time sions. At the boundaries libraries for the different between strings program sec- host languages are also pro- tions provided by the user vided. This makes Lex are executed. The Lex adaptable to different source file associates the environments and different regular expressions and the users. Each application may program fragments. As each be directed to the combina- expression appears in the tion of hardware and host input to the program written language appropriate to the by Lex, the corresponding task, the user's background, fragment is executed. and the properties of local implementations. At The user supplies the present, the only supported additional code beyond host language is C, although expression matching needed Fortran (in the form of Rat- to complete his tasks, pos- for [2] has been available sibly including code written in the past. Lex itself by other generators. The exists on UNIX, GCOS, and program that recognizes the OS/370; but the code gen- expressions is generated in erated by Lex may be taken the general purpose program- anywhere the appropriate ming language employed for compilers exist. the user's program frag- ments. Thus, a high level Lex turns the user's expression language is pro- expressions and actions vided to write the string (called _sss_ooo_uuu_rrr_ccc_eee in this memo) expressions to be matched into the host general- while the user's freedom to purpose language; the gen- write actions is unimpaired. erated program is named This avoids forcing the user _yyy_yyy_lll_eee_xxx. The _yyy_yyy_lll_eee_xxx program who wishes to use a string will recognize expressions manipulation language for in a stream (called _iii_nnn_ppp_uuu_ttt in input analysis to write pro- this memo) and perform the - 3 - specified actions for each a newline character, and expression as it is executing the desired rule detected. See Figure 1. action. The first rule _______ matches all strings of Source -> |__L_e_x___| -> yylex blanks or tabs at the end of lines, and the second rule all remaining strings of _______ blanks or tabs. Input -> |_y_y_l_e_x__| -> Output Lex can be used alone An overview of Lex for simple transformations, Figure 1 or for analysis and statis- tics gathering on a lexical For a trivial example, level. Lex can also be used consider a program to delete with a parser generator to from the input all blanks or perform the lexical analysis tabs at the ends of lines. phase; it is particularly %% easy to interface Lex and [ \t]+$ ; Yacc [3]. Lex programs is all that is required. recognize only regular The program contains a %% expressions; Yacc writes delimiter to mark the begin- parsers that accept a large ning of the rules, and one class of context free gram- rule. This rule contains a mars, but require a lower regular expression which level analyzer to recognize matches one or more input tokens. Thus, a com- instances of the characters bination of Lex and Yacc is blank or tab (written \t for often appropriate. When visibility, in accordance used as a preprocessor for a with the C language conven- later parser generator, Lex tion) just prior to the end is used to partition the of a line. The brackets input stream, and the parser indicate the character class generator assigns structure made of blank and tab; the + to the resulting pieces. indicates ``one or more The flow of control in such ...''; and the $ indicates a case (which might be the ``end of line,'' as in QED. first half of a compiler, No action is specified, so for example) is shown in the program generated by Lex Figure 2. Additional pro- (yylex) will ignore these grams, written by other gen- characters. Everything else erators or by hand, can be will be copied. To change added easily to programs any remaining string of written by Lex. blanks or tabs to a single lexical grammar blank, add another rule: rules rules %% [ \t]+$ ; _________ _________ [ \t]+ printf(" "); |___L_e_x____| |__Y_a_c_c____| The finite automaton gen- erated for this source will _________ _________ scan for both rules at once, Input -> |__y_y_l_e_x___| -> |_y_y_p_a_r_s_e__| -> Parsed input observing at the termination of the string of blanks or Lex with Yacc tabs whether or not there is Figure 2 - 4 - Yacc users will realize that preted on the basis of one the name _yyy_yyy_lll_eee_xxx is what Yacc character lookahead. For expects its lexical analyzer example, if there are two to be named, so that the use rules, one looking for _aaa_bbb of this name by Lex simpli- and another for _aaa_bbb_ccc_ddd_eee_fff_ggg, and fies interfacing. the input stream is _aaa_bbb_ccc_ddd_eee_fff_hhh, Lex will recognize _aaa_bbb and Lex generates a deter- leave the input pointer just ministic finite automaton before _ccc_ddd. . . Such backup from the regular expressions is more costly than the pro- in the source [4]. The cessing of simpler automaton is interpreted, languages. rather than compiled, in order to save space. The _222. _LLL_eee_xxx _SSS_ooo_uuu_rrr_ccc_eee. result is still a fast analyzer. In particular, The general format of the time taken by a Lex pro- Lex source is: gram to recognize and parti- {definitions} tion an input stream is pro- %% portional to the length of {rules} the input. The number of %% Lex rules or the complexity {user subroutines} of the rules is not impor- where the definitions and tant in determining speed, the user subroutines are unless rules which include often omitted. The second forward context require a %% is optional, but the significant amount of re- first is required to mark scanning. What does the beginning of the rules. increase with the number and The absolute minimum Lex complexity of rules is the program is thus size of the finite automa- %% ton, and therefore the size (no definitions, no rules) of the program generated by which translates into a pro- Lex. gram which copies the input to the output unchanged. In the program written by Lex, the user's fragments In the outline of Lex (representing the _aaa_ccc_ttt_iii_ooo_nnn_sss to programs shown above, the be performed as each regular _rrr_uuu_lll_eee_sss represent the user's expression is found) are control decisions; they are gathered as cases of a a table, in which the left switch. The automaton column contains _rrr_eee_ggg_uuu_lll_aaa_rrr interpreter directs the con- _eee_xxx_ppp_rrr_eee_sss_sss_iii_ooo_nnn_sss (see section 3) trol flow. Opportunity is and the right column con- provided for the user to tains _aaa_ccc_ttt_iii_ooo_nnn_sss, program frag- insert either declarations ments to be executed when or additional statements in the expressions are recog- the routine containing the nized. Thus an individual actions, or to add subrou- rule might appear tines outside this action integer printf("found keyword INT"); routine. to look for the string _iii_nnn_ttt_eee_ggg_eee_rrr in the input stream Lex is not limited to and print the message source which can be inter- ``found keyword INT'' when- - 5 - ever it appears. In this _OOO_ppp_eee_rrr_aaa_ttt_ooo_rrr_sss. The opera- example the host procedural tor characters are language is C and the C " \ [ ] ^ - ? . * + | ( ) $ / { } % < > library function _ppp_rrr_iii_nnn_ttt_fff is and if they are to be used used to print the string. as text characters, an The end of the expression is escape should be used. The indicated by the first blank quotation mark operator (") or tab character. If the indicates that whatever is action is merely a single C contained between a pair of expression, it can just be quotes is to be taken as given on the right side of text characters. Thus the line; if it is compound, xyz"++" or takes more than a line, matches the string _xxx_yyy_zzz++ it should be enclosed in when it appears. Note that braces. As a slightly more a part of a string may be useful example, suppose it quoted. It is harmless but is desired to change a unnecessary to quote an number of words from British ordinary text character; the to American spelling. Lex expression rules such as "xyz++" colour printf("color"); is the same as the one mechanise printf("mechanize");above. Thus by quoting petrol printf("gas"); every non-alphanumeric char- would be a start. These acter being used as a text rules are not quite enough, character, the user can since the word _ppp_eee_ttt_rrr_ooo_lll_eee_uuu_mmm avoid remembering the list would become _ggg_aaa_sss_eee_uuu_mmm; a way above of current operator of dealing with this will be characters, and is safe described later. should further extensions to Lex lengthen the list. _333. _LLL_eee_xxx _RRR_eee_ggg_uuu_lll_aaa_rrr _EEE_xxx_ppp_rrr_eee_sss_sss_iii_ooo_nnn_sss. An operator character The definitions of reg- may also be turned into a ular expressions are very text character by preceding similar to those in QED [5]. it with \ as in A regular expression speci- xyz\+\+ fies a set of strings to be which is another, less read- matched. It contains text able, equivalent of the characters (which match the above expressions. Another corresponding characters in use of the quoting mechanism the strings being compared) is to get a blank into an and operator characters expression; normally, as (which specify repetitions, explained above, blanks or choices, and other tabs end a rule. Any blank features). The letters of character not contained the alphabet and the digits within [] (see below) must are always text characters; be quoted. Several normal C thus the regular expression escapes with \ are recog- integer nized: \n is newline, \t is matches the string _iii_nnn_ttt_eee_ggg_eee_rrr tab, and \b is backspace. wherever it appears and the To enter \ itself, use \\. expression Since newline is illegal in a57D an expression, \n must be looks for the string _aaa_555_777_DDD. used; it is not required to - 6 - escape tab and backspace. except a, b, or c, including Every character but blank, all special or control char- tab, newline and the list acters; or above is always a text char- [^a-zA-Z] acter. is any character which is not a letter. The \ charac- _CCC_hhh_aaa_rrr_aaa_ccc_ttt_eee_rrr _ccc_lll_aaa_sss_sss_eee_sss. ter provides the usual Classes of characters can be escapes within character specified using the operator class brackets. pair []. The construction [_aaa_bbb_ccc] matches a single char- _AAA_rrr_bbb_iii_ttt_rrr_aaa_rrr_yyy _ccc_hhh_aaa_rrr_aaa_ccc_ttt_eee_rrr. acter, which may be _aaa, _bbb, or To match almost any charac- _ccc. Within square brackets, ter, the operator character most operator meanings are . ignored. Only three charac- is the class of all charac- ters are special: these are ters except newline. Escap- \ - and ^. The - character ing into octal is possible indicates ranges. For exam- although non-portable: ple, [\40-\176] [a-z0-9<>_] matches all printable char- indicates the character acters in the ASCII charac- class containing all the ter set, from octal 40 lower case letters, the (blank) to octal 176 digits, the angle brackets, (tilde). and underline. Ranges may be given in either order. _OOO_ppp_ttt_iii_ooo_nnn_aaa_lll _eee_xxx_ppp_rrr_eee_sss_sss_iii_ooo_nnn_sss. Using - between any pair of The operator ? indicates an characters which are not optional element of an both upper case letters, expression. Thus both lower case letters, or ab?c both digits is implementa- matches either _aaa_ccc or _aaa_bbb_ccc. tion dependent and will get a warning message. (E.g., _RRR_eee_ppp_eee_aaa_ttt_eee_ddd _eee_xxx_ppp_rrr_eee_sss_sss_iii_ooo_nnn_sss. [0-z] in ASCII is many more Repetitions of classes are characters than it is in indicated by the operators * EBCDIC). If it is desired and +. to include the character - _aaa_*** in a character class, it is any number of consecutive should be first or last; _aaa characters, including thus zero; while [-+0-9] a+ matches all the digits and is one or more instances of the two signs. _aaa. For example, [a-z]+ In character classes, is all strings of lower case the ^ operator must appear letters. And as the first character after [A-Za-z][A-Za-z0-9]* the left bracket; it indi- indicates all alphanumeric cates that the resulting strings with a leading string is to be complemented alphabetic character. This with respect to the computer is a typical expression for character set. Thus recognizing identifiers in [^abc] computer languages. matches all characters - 7 - _AAA_lll_ttt_eee_rrr_nnn_aaa_ttt_iii_ooo_nnn _aaa_nnn_ddd _GGG_rrr_ooo_uuu_ppp_--- Lex by _sss_ttt_aaa_rrr_ttt _ccc_ooo_nnn_ddd_iii_ttt_iii_ooo_nnn_sss as _iii_nnn_ggg. The operator | indi- explained in section 10. If cates alternation: a rule is only to be exe- (ab|cd) cuted when the Lex automaton matches either _aaa_bbb or _ccc_ddd. interpreter is in start con- Note that parentheses are dition _xxx, the rule should be used for grouping, although prefixed by they are not necessary on the outside level; using the angle bracket ab|cd operator characters. If we would have sufficed. considered ``being at the Parentheses can be used for beginning of a line'' to be more complex expressions: start condition _OOO_NNN_EEE, then (ab|cd+)?(ef)* the ^ operator would be matches such strings as equivalent to _aaa_bbb_eee_fff_eee_fff, _eee_fff_eee_fff_eee_fff, _ccc_ddd_eee_fff, or _ccc_ddd_ddd_ddd; but not _aaa_bbb_ccc, _aaa_bbb_ccc_ddd, or Start conditions are _aaa_bbb_ccc_ddd_eee_fff. explained more fully later. _CCC_ooo_nnn_ttt_eee_xxx_ttt _sss_eee_nnn_sss_iii_ttt_iii_vvv_iii_ttt_yyy. _RRR_eee_ppp_eee_ttt_iii_ttt_iii_ooo_nnn_sss _aaa_nnn_ddd _DDD_eee_fff_iii_nnn_iii_--- Lex will recognize a small _ttt_iii_ooo_nnn_sss. The operators {} amount of surrounding con- specify either repetitions text. The two simplest (if they enclose numbers) or operators for this are ^ and definition expansion (if $. If the first character they enclose a name). For of an expression is ^, the example expression will only be {digit} matched at the beginning of looks for a predefined a line (after a newline string named _ddd_iii_ggg_iii_ttt and character, or at the begin- inserts it at that point in ning of the input stream). the expression. The defini- This can never conflict with tions are given in the first the other meaning of ^, com- part of the Lex input, plementation of character before the rules. In con- classes, since that only trast, applies within the [] opera- a{1,5} tors. If the very last looks for 1 to 5 occurrences character is $, the expres- of _aaa. sion will only be matched at the end of a line (when Finally, initial % is immediately followed by new- special, being the separator line). The latter operator for Lex source segments. is a special case of the / operator character, which _444. _LLL_eee_xxx _AAA_ccc_ttt_iii_ooo_nnn_sss. indicates trailing context. The expression When an expression ab/cd written as above is matched, matches the string _aaa_bbb, but Lex executes the correspond- only if followed by _ccc_ddd. ing action. This section Thus describes some features of ab$ Lex which aid in writing is the same as actions. Note that there is ab/\n a default action, which con- Left context is handled in sists of copying the input - 8 - to the output. This is per- want to know the actual text formed on all strings not that matched some expression otherwise matched. Thus the like [_aaa-_zzz]+. Lex leaves Lex user who wishes to this text in an external absorb the entire input, character array named without producing any out- _yyy_yyy_ttt_eee_xxx_ttt. Thus, to print the put, must provide rules to name found, a rule like match everything. When Lex [a-z]+ printf("%s", yytext); is being used with Yacc, will print the string in this is the normal situa- _yyy_yyy_ttt_eee_xxx_ttt. The C function tion. One may consider that _ppp_rrr_iii_nnn_ttt_fff accepts a format actions are what is done argument and data to be instead of copying the input printed; in this case, the to the output; thus, in gen- format is ``print string'' eral, a rule which merely (% indicating data conver- copies can be omitted. sion, and _sss indicating Also, a character combina- string type), and the data tion which is omitted from are the characters in the rules and which appears _yyy_yyy_ttt_eee_xxx_ttt. So this just places as input is likely to be the matched string on the printed on the output, thus output. This action is so calling attention to the gap common that it may be writ- in the rules. ten as ECHO: [a-z]+ ECHO; One of the simplest is the same as the above. things that can be done is Since the default action is to ignore the input. just to print the characters Specifying a C null state- found, one might ask why ment, ; as an action causes give a rule, like this one, this result. A frequent which merely specifies the rule is default action? Such rules [ \t\n] ; are often required to avoid which causes the three spac- matching some other rule ing characters (blank, tab, which is not desired. For and newline) to be ignored. example, if there is a rule which matches _rrr_eee_aaa_ddd it will Another easy way to normally match the instances avoid writing actions is the of _rrr_eee_aaa_ddd contained in _bbb_rrr_eee_aaa_ddd action character |, which or _rrr_eee_aaa_ddd_jjj_uuu_sss_ttt; to avoid this, indicates that the action a rule of the form [_aaa-_zzz]+ is for this rule is the action needed. This is explained for the next rule. The pre- further below. vious example could also have been written Sometimes it is more " " convenient to know the end "\t" of what has been found; "\n" hence Lex also provides a with the same result, count _yyy_yyy_lll_eee_nnn_ggg of the number although in different style. of characters matched. To The quotes around \n and \t count both the number of are not required. words and the number of characters in words in the In more complex input, the user might write actions, the user will often [a-zA-Z]+ {words++; chars += yyleng;} - 9 - which accumulates in _ccc_hhh_aaa_rrr_sss a string such as "_aaa_bbb_ccc_\\\"_ddd_eee_fff" the number of characters in first match the five charac- the words recognized. The ters "_aaa_bbb_ccc_\\\; then the call to last character in the string _yyy_yyy_mmm_ooo_rrr_eee() will cause the next matched can be accessed by part of the string, "_ddd_eee_fff, to yytext[yyleng-1] be tacked on the end. Note that the final quote ter- Occasionally, a Lex minating the string should action may decide that a be picked up in the code rule has not recognized the labeled ``normal process- correct span of characters. ing''. Two routines are provided to aid with this situation. The function _yyy_yyy_lll_eee_sss_sss() First, _yyy_yyy_mmm_ooo_rrr_eee() can be might be used to reprocess called to indicate that the text in various cir- next input expression recog- cumstances. Consider the C nized is to be tacked on to problem of distinguishing the end of this input. Nor- the ambiguity of ``=-a''. mally, the next input string Suppose it is desired to would overwrite the current treat this as ``=- a'' but entry in _yyy_yyy_ttt_eee_xxx_ttt. Second, print a message. A rule _yyy_yyy_lll_eee_sss_sss (_nnn) may be called to might be indicate that not all the =-[a-zA-Z] { characters matched by the printf("Op (=-) ambiguous\n"); currently successful expres- yyless(yyleng-1); sion are wanted right now. ... action for =- ... The argument _nnn indicates the } number of characters in which prints a message, _yyy_yyy_ttt_eee_xxx_ttt to be retained. returns the letter after the Further characters previ- operator to the input ously matched are returned stream, and treats the to the input. This provides operator as ``=-''. Alter- the same sort of lookahead natively it might be desired offered by the / operator, to treat this as ``= -a''. but in a different form. To do this, just return the minus sign as well as the _EEE_xxx_aaa_mmm_ppp_lll_eee: Consider a letter to the input: language which defines a =-[a-zA-Z] { string as a set of charac- printf("Op (=-) ambiguous\n"); ters between quotation (") yyless(yyleng-2); marks, and provides that to ... action for = ... include a " in a string it } must be preceded by a \. will perform the other The regular expression which interpretation. Note that matches that is somewhat the expressions for the two confusing, so that it might cases might more easily be be preferable to write written \"[^"]* { =-/[A-Za-z] if (yytext[yyleng-1] ==in'\t\h'e)first case and yymore(); =/-[A-Za-z] else in the second; no backup ... normal user prwoocuelsdsinbge required in the } rule action. It is not which will, when faced with necessary to recognize the - 10 - whole identifier to observe also necessary to match an the ambiguity. The possi- expression that is a prefix bility of ``=-3'', however, of another expression. See makes below for a discussion of =-/[^ \t\n] the character set used by a still better rule. Lex. The standard Lex library imposes a 100 char- In addition to these acter limit on backup. routines, Lex also permits access to the I/O routines Another Lex library it uses. They are: routine that the user will sometimes want to redefine 1) _iii_nnn_ppp_uuu_ttt() which returns is _yyy_yyy_www_rrr_aaa_ppp() which is called the next input charac- whenever Lex reaches an ter; end-of-file. If _yyy_yyy_www_rrr_aaa_ppp returns a 1, Lex continues 2) _ooo_uuu_ttt_ppp_uuu_ttt(_ccc) which writes with the normal wrapup on the character _ccc on the end of input. Sometimes, output; and however, it is convenient to arrange for more input to 3) _uuu_nnn_ppp_uuu_ttt(_ccc) pushes the arrive from a new source. character _ccc back onto In this case, the user the input stream to be should provide a _yyy_yyy_www_rrr_aaa_ppp read later by _iii_nnn_ppp_uuu_ttt(). which arranges for new input and returns 0. This By default these routines instructs Lex to continue are provided as macro defin- processing. The default itions, but the user can _yyy_yyy_www_rrr_aaa_ppp always returns 1. override them and supply private versions. These This routine is also a routines define the rela- convenient place to print tionship between external tables, summaries, etc. at files and internal charac- the end of a program. Note ters, and must all be that it is not possible to retained or modified con- write a normal rule which sistently. They may be recognizes end-of-file; the redefined, to cause input or only access to this condi- output to be transmitted to tion is through _yyy_yyy_www_rrr_aaa_ppp. In or from strange places, fact, unless a private ver- including other programs or sion of _iii_nnn_ppp_uuu_ttt() is supplied internal memory; but the a file containing nulls can- character set used must be not be handled, since a consistent in all routines; value of 0 returned by _iii_nnn_ppp_uuu_ttt a value of zero returned by is taken to be end-of-file. _iii_nnn_ppp_uuu_ttt must mean end of file; and the relationship between _555. _AAA_mmm_bbb_iii_ggg_uuu_ooo_uuu_sss _SSS_ooo_uuu_rrr_ccc_eee _RRR_uuu_lll_eee_sss. _uuu_nnn_ppp_uuu_ttt and _iii_nnn_ppp_uuu_ttt must be retained or the Lex look- Lex can handle ambigu- ahead will not work. Lex ous specifications. When does not look ahead at all more than one expression can if it does not have to, but match the current input, Lex every rule ending in + * ? chooses as follows: or $ or containing / implies lookahead. Lookahead is - 11 - 1) The longest match is tor will not match newline. preferred. Thus expressions like .* stop on the current line. 2) Among rules which Don't try to defeat this matched the same number with expressions like [._\\\_nnn]+ of characters, the rule or equivalents; the Lex gen- given first is pre- erated program will try to ferred. read the entire input file, causing internal buffer Thus, suppose the rules overflows. integer keyword action ...; [a-z]+ identifier action ...; Note that Lex is nor- to be given in that order. mally partitioning the input If the input is _iii_nnn_ttt_eee_ggg_eee_rrr_sss, it stream, not searching for is taken as an identifier, all possible matches of each because [_aaa-_zzz]+ matches 8 expression. This means that characters while _iii_nnn_ttt_eee_ggg_eee_rrr each character is accounted matches only 7. If the for once and only once. For input is _iii_nnn_ttt_eee_ggg_eee_rrr, both rules example, suppose it is match 7 characters, and the desired to count occurrences keyword rule is selected of both _sss_hhh_eee and _hhh_eee in an because it was given first. input text. Some Lex rules Anything shorter (e.g. _iii_nnn_ttt) to do this might be will not match the expres- she s++; sion _iii_nnn_ttt_eee_ggg_eee_rrr and so the he h++; identifier interpretation is \n | used. . ; where the last two rules The principle of ignore everything besides _hhh_eee preferring the longest match and _sss_hhh_eee. Remember that . makes rules containing does not include newline. expressions like .* Since _sss_hhh_eee includes _hhh_eee, Lex dangerous. For example, will normally _nnn_ooo_ttt recognize '.*' the instances of _hhh_eee included might seem a good way of in _sss_hhh_eee, since once it has recognizing a string in sin- passed a _sss_hhh_eee those charac- gle quotes. But it is an ters are gone. invitation for the program to read far ahead, looking Sometimes the user for a distant single quote. would like to override this Presented with the input choice. The action REJECT 'first' quoted string here, 'secomneda'nshere``go do the next the above expression will alternative.'' It causes match whatever rule was second 'first' quoted string here, 'secocnhdo'ice after the current which is probably not what rule to be executed. The was wanted. A better rule position of the input is of the form pointer is adjusted accord- '[^'\n]*' ingly. Suppose the user which, on the above input, really wants to count the will stop after '_fff_iii_rrr_sss_ttt'. included instances of _hhh_eee: The consequences of errors she {s++; REJECT;} like this are mitigated by he {h++; REJECT;} the fact that the . opera- \n | - 12 - . ; [a-z][a-z] { these rules are one way of digram[yytext[0]][yytext[1]]++; changing the previous exam- REJECT; ple to do just that. After } counting each expression, it . ; is rejected; whenever \n ; appropriate, the other where the REJECT is neces- expression will then be sary to pick up a letter counted. In this example, pair beginning at every of course, the user could character, rather than at note that _sss_hhh_eee includes _hhh_eee every other character. but not vice versa, and omit the REJECT action on _hhh_eee; in _666. _LLL_eee_xxx _SSS_ooo_uuu_rrr_ccc_eee _DDD_eee_fff_iii_nnn_iii_ttt_iii_ooo_nnn_sss. other cases, however, it would not be possible a Remember the format of priori to tell which input the Lex source: characters were in both {definitions} classes. %% {rules} Consider the two rules %% a[bc]+ { ... ; REJECT;} {user routines} a[cd]+ { ... ; REJECT;} So far only the rules have If the input is _aaa_bbb, only the been described. The user first rule matches, and on needs additional options, _aaa_ddd only the second matches. though, to define variables The input string _aaa_ccc_ccc_bbb for use in his program and matches the first rule for for use by Lex. These can four characters and then the go either in the definitions second rule for three char- section or in the rules sec- acters. In contrast, the tion. input _aaa_ccc_ccc_ddd agrees with the second rule for four charac- Remember that Lex is ters and then the first rule turning the rules into a for three. program. Any source not intercepted by Lex is copied In general, REJECT is into the generated program. useful whenever the purpose There are three classes of of Lex is not to partition such things. the input stream but to detect all examples of some 1) Any line which is not items in the input, and the part of a Lex rule or instances of these items may action which begins overlap or include each with a blank or tab is other. Suppose a digram copied into the Lex table of the input is generated program. desired; normally the Such source input prior digrams overlap, that is the to the first %% delim- word _ttt_hhh_eee is considered to iter will be external contain both _ttt_hhh and _hhh_eee. to any function in the Assuming a two-dimensional code; if it appears array named _ddd_iii_ggg_rrr_aaa_mmm to be immediately after the incremented, the appropriate first %%, it appears in source is an appropriate place %% for declarations in the - 13 - function written by Lex associated with the name. which contains the The name and translation actions. This material must be separated by at must look like program least one blank or tab, and fragments, and should the name must begin with a precede the first Lex letter. The translation can rule. then be called out by the {name} syntax in a rule. As a side effect of the Using {D} for the digits and above, lines which {E} for an exponent field, begin with a blank or for example, might abbrevi- tab, and which contain ate rules to recognize a comment, are passed numbers: through to the gen- D [0-9] erated program. This E [DEde][-+]?{D}+ can be used to include %% comments in either the {D}+ printf("integer"); Lex source or the gen- {D}+"."{D}*({E})? | erated code. The com- {D}*"."{D}+({E})? | ments should follow the {D}+{E} host language conven- Note the first two rules for tion. real numbers; both require a decimal point and contain an 2) Anything included optional exponent field, but between lines contain- the first requires at least ing only %{ and %} is one digit before the decimal copied out as above. point and the second The delimiters are dis- requires at least one digit carded. This format after the decimal point. To permits entering text correctly handle the problem like preprocessor posed by a Fortran expres- statements that must sion such as _333_555._EEE_QQQ._III, which begin in column 1, or does not contain a real copying lines that do number, a context-sensitive not look like programs. rule such as [0-9]+/"."EQ printf("integer"); 3) Anything after the could be used in addition to third %% delimiter, the normal rule for regardless of formats, integers. etc., is copied out after the Lex output. The definitions section may also contain other com- Definitions intended mands, including the selec- for Lex are given before the tion of a host language, a first %% delimiter. Any character set table, a list line in this section not of start conditions, or contained between %{ and %}, adjustments to the default and begining in column 1, is size of arrays within Lex assumed to define Lex sub- itself for larger source stitution strings. The for- programs. These possibili- mat of such lines is ties are discussed below name translation under ``Summary of Source and it causes the string Format,'' section 12. given as a translation to be - 14 - _777. _UUU_sss_aaa_ggg_eee. default main program on the Lex library calls this rou- There are two steps in tine, but if Yacc is loaded, compiling a Lex source pro- and its main program is gram. First, the Lex source used, Yacc will call must be turned into a gen- _yyy_yyy_lll_eee_xxx(). In this case each erated program in the host Lex rule should end with general purpose language. return(token); Then this program must be where the appropriate token compiled and loaded, usually value is returned. An easy with a library of Lex sub- way to get access to Yacc's routines. The generated names for tokens is to com- program is on a file named pile the Lex output file as _lll_eee_xxx._yyy_yyy._ccc. The I/O library part of the Yacc output file is defined in terms of the C by placing the line standard library [6]. # include "lex.yy.c" in the last section of Yacc The C programs gen- input. Supposing the gram- erated by Lex are slightly mar to be named ``good'' and different on OS/370, because the lexical rules to be the OS compiler is less named ``better'' the UNIX powerful than the UNIX or command sequence can just GCOS compilers, and does be: less at compile time. C yacc good programs generated on GCOS lex better and UNIX are the same. cc y.tab.c -ly -ll The Yacc library (-ly) _UUU_NNN_III_XXX. The library is should be loaded before the accessed by the loader flag Lex library, to obtain a -_lll_lll. So an appropriate set main program which invokes of commands is the Yacc parser. The gen- lex source cc lex.yy.c erations of Lex and Yacc -ll programs can be done in The resulting program is either order. placed on the usual file _aaa._ooo_uuu_ttt for later execution. _999. _EEE_xxx_aaa_mmm_ppp_lll_eee_sss. To use Lex with Yacc see below. Although the default As a trivial problem, Lex I/O routines use the C consider copying an input standard library, the Lex file while adding 3 to every automata themselves do not positive number divisible by do so; if private versions 7. Here is a suitable Lex of _iii_nnn_ppp_uuu_ttt, _ooo_uuu_ttt_ppp_uuu_ttt and _uuu_nnn_ppp_uuu_ttt source program are given, the library can %% be avoided. int k; [0-9]+ { _888. _LLL_eee_xxx _aaa_nnn_ddd _YYY_aaa_ccc_ccc. k = atoi(yytext); if (k%7 == 0) If you want to use Lex printf("%d", k+3); with Yacc, note that what else Lex writes is a program printf("%d",k); named _yyy_yyy_lll_eee_xxx(), the name } required by Yacc for its to do just that. The rule analyzer. Normally, the [0-9]+ recognizes strings of - 15 - digits; _aaa_ttt_ooo_iii converts the if (lengs[i] > 0) digits to binary and stores printf("%5d%10d\n",i,lengs[i]); the result in _kkk. The opera- return(1); tor % (remainder) is used to } check whether _kkk is divisible This program accumulates the by 7; if it is, it is incre- histogram, while producing mented by 3 as it is written no output. At the end of out. It may be objected the input it prints the that this program will alter table. The final statement such input items as _444_999._666_333 or _rrr_eee_ttt_uuu_rrr_nnn(_111); indicates that _XXX_777. Furthermore, it incre- Lex is to perform wrapup. ments the absolute value of If _yyy_yyy_www_rrr_aaa_ppp returns zero all negative numbers divisi- (false) it implies that ble by 7. To avoid this, further input is available just add a few more rules and the program is to con- after the active one, as tinue reading and process- here: ing. To provide a _yyy_yyy_www_rrr_aaa_ppp %% that never returns true int k; causes an infinite loop. -?[0-9]+ { k = atoi(yytextA)s; a larger example, printf("%dh"e,re are some parts of a k%7 == 0pr?ogkr+a3m :wkr)i;tten by N. L. } Schryer to convert double -?[0-9.]+ ECHO; precision Fortran to single [A-Za-z][A-Za-z0-9]+ ECHO; precision Fortran. Because Numerical strings containing Fortran does not distinguish a ``.'' or preceded by a upper and lower case letter will be picked up by letters, this routine begins one of the last two rules, by defining a set of classes and not changed. The including both cases of each _iii_fff-_eee_lll_sss_eee has been replaced by letter: a C conditional expression a [aA] to save space; the form b [bB] _aaa?_bbb:_ccc means ``if _aaa then _bbb c [cC] else _ccc''. ... z [zZ] For an example of An additional class recog- statistics gathering, here nizes white space: is a program which histo- W [ \t]* grams the lengths of words, The first rule changes where a word is defined as a ``double precision'' to string of letters. ``real'', or ``DOUBLE PRECI- int lengs[100S]I;ON'' to ``REAL''. %% {d}{o}{u}{b}{l}{e}{W}{p}{r}{e}{c}{i}{s}{i}{o}{n} { [a-z]+ lengs[yyleng]++; printf(yytext[0]=='d'? "real" : "REAL"); . | } \n ; Care is taken throughout %% this program to preserve the yywrap() case (upper or lower) of the { original program. The con- int i; ditional operator is used to printf("Length No. words\n"); select the proper form of for(i=0; i<100; i++) the keyword. The next rule - 16 - copies continuation card {d}{m}{i}{n}1 | indications to avoid confus- {d}{m}{a}{x}1 { ing them with constants: yytext[0] =+ 'a' - 'd'; ^" "[^ 0] ECHO; ECHO; In the regular expression, } the quotes surround the And one routine must have blanks. It is interpreted initial _ddd changed to initial as ``beginning of line, then _rrr: five blanks, then anything {d}1{m}{a}{c}{h} {yytext[0] =+ 'r' - 'd'; but blank or zero.'' Note the two different meanings of ^. There follow some To avoid such names as _ddd_sss_iii_nnn_xxx rules to change double pre- being detected as instances cision constants to ordinary of _ddd_sss_iii_nnn, some final rules floating constants. pick up longer words as [0-9]+{W}{d}{W}[+-]?{W}[0-9]+ id|entifiers and copy some [0-9]+{W}"."{W}{d}{W}[+-]?{W}[0-9s]u+rvivin|g characters: "."{W}[0-9]+{W}{d}{W}[+-]?{W}[0-9[]A+-Za-z]{[A-Za-z0-9]* | /* convert constants */ [0-9]+ | for(p=yytext; *p != 0; p++)\n | { . ECHO; if (*p == 'd' || *p ==N'oDt'e)that this program is *p=+ 'e'- 'd'; not complete; it does not ECHO; deal with the spacing prob- } lems in Fortran or with the After the floating point use of keywords as identif- constant is recognized, it iers. is scanned by the _fff_ooo_rrr loop _111_000. _LLL_eee_fff_ttt _CCC_ooo_nnn_ttt_eee_xxx_ttt _SSS_eee_nnn_sss_iii_--- to find the letter _ddd or _DDD. _ttt_iii_vvv_iii_ttt_yyy. The program than adds '_eee'-'_ddd', which converts it Sometimes it is desir- to the next letter of the able to have several sets of alphabet. The modified con- lexical rules to be applied stant, now single-precision, at different times in the is written out again. There input. For example, a com- follow a series of names piler preprocessor might which must be respelled to distinguish preprocessor remove their initial _ddd. By statements and analyze them using the array _yyy_yyy_ttt_eee_xxx_ttt the differently from ordinary same action suffices for all statements. This requires the names (only a sample of sensitivity to prior con- a rather long list is given text, and there are several here). ways of handling such prob- {d}{s}{i}{n} | lems. The ^ operator, for {d}{c}{o}{s} | example, is a prior context {d}{s}{q}{r}{t} | operator, recognizing {d}{a}{t}{a}{n} | immediately preceding left ... context just as $ recognizes {d}{f}{l}{o}{a}{t} printf("%s",iymymteedxita+t1e)l;y following right Another list of names must context. Adjacent left con- have initial _ddd changed to text could be extended, to initial _aaa: produce a facility similar {d}{l}{o}{g} | to that for adjacent right {d}{l}{o}{g}10 | context, but it is unlikely - 17 - to be as useful, since often letter _aaa, changing _mmm_aaa_ggg_iii_ccc to the relevant left context _sss_eee_ccc_ooo_nnn_ddd on every line which appeared some time earlier, began with the letter _bbb, and such as at the beginning of changing _mmm_aaa_ggg_iii_ccc to _ttt_hhh_iii_rrr_ddd on a line. every line which began with the letter _ccc. All other This section describes words and all other lines three means of dealing with are left unchanged. different environments: a simple use of flags, when These rules are so sim- only a few rules change from ple that the easiest way to one environment to another, do this job is with a flag: the use of _sss_ttt_aaa_rrr_ttt _ccc_ooo_nnn_ddd_iii_ttt_iii_ooo_nnn_sss int flag; on rules, and the possibil- %% ity of making multiple lexi- ^a {flag = 'a'; ECHO;} cal analyzers all run ^b {flag = 'b'; ECHO;} together. In each case, ^c {flag = 'c'; ECHO;} there are rules which recog- \n {flag = 0 ; ECHO;} nize the need to change the magic { environment in which the switch (flag) following input text is { analyzed, and set some case 'a': printf("first"); break; parameter to reflect the case 'b': printf("second"); break; change. This may be a flag case 'c': printf("third"); break; explicitly tested by the default: ECHO; break; user's action code; such a } flag is the simplest way of } dealing with the problem, should be adequate. since Lex is not involved at all. It may be more con- To handle the same venient, however, to have problem with start condi- Lex remember the flags as tions, each start condition initial conditions on the must be introduced to Lex in rules. Any rule may be the definitions section with associated with a start con- a line reading dition. It will only be %Start name1 name2 ... recognized when Lex is in where the conditions may be that start condition. The named in any order. The current start condition may word _SSS_ttt_aaa_rrr_ttt may be abbrevi- be changed at any time. ated to _sss or _SSS. The condi- Finally, if the sets of tions may be referenced at rules for the different the head of a rule with the environments are very dis- <> brackets: similar, clarity may be best expression achieved by writing several is a rule which is only distinct lexical analyzers, recognized when Lex is in and switching from one to the start condition _nnn_aaa_mmm_eee_111. another as desired. To enter a start condition, execute the action statement Consider the following BEGIN name1; problem: copy the input to which changes the start con- the output, changing the dition to _nnn_aaa_mmm_eee_111. To resume word _mmm_aaa_ggg_iii_ccc to _fff_iii_rrr_sss_ttt on every the normal state, line which began with the BEGIN 0; - 18 - resets the initial condition about it, by giving a trans- of the Lex automaton inter- lation table. This table preter. A rule may be must be in the definitions active in several start con- section, and must be brack- ditions: eted by lines containing only ``%T''. The table con- is a legal prefix. Any rule tains lines of the form not beginning with the <> {integer} {character string} prefix operator is always which indicate the value active. associated with each charac- ter. Thus the next example The same example as %T before can be written: 1 Aa %START AA BB CC 2 Bb %% ... ^a {ECHO; BEGIN AA;} 26 Zz ^b {ECHO; BEGIN BB;} 27 \n ^c {ECHO; BEGIN CC;} 28 + \n {ECHO; BEGIN 0;} 29 - magic printf("first"); 30 0 magic printf("second"); 31 1 magic printf("third"); ... where the logic is exactly 39 9 the same as in the previous %T method of handling the prob- lem, but Lex does the work Sample character table. rather than the user's code. maps the lower and upper case letters together into _111_111. _CCC_hhh_aaa_rrr_aaa_ccc_ttt_eee_rrr _SSS_eee_ttt. the integers 1 through 26, newline into 27, + and - The programs generated into 28 and 29, and the by Lex handle character I/O digits into 30 through 39. only through the routines Note the escape for newline. _iii_nnn_ppp_uuu_ttt, _ooo_uuu_ttt_ppp_uuu_ttt, and _uuu_nnn_ppp_uuu_ttt. If a table is supplied, Thus the character represen- every character that is to tation provided in these appear either in the rules routines is accepted by Lex or in any valid input must and employed to return be included in the table. values in _yyy_yyy_ttt_eee_xxx_ttt. For No character may be assigned internal use a character is the number 0, and no charac- represented as a small ter may be assigned a bigger integer which, if the stan- number than the size of the dard library is used, has a hardware character set. value equal to the integer value of the bit pattern _111_222. _SSS_uuu_mmm_mmm_aaa_rrr_yyy _ooo_fff _SSS_ooo_uuu_rrr_ccc_eee _FFF_ooo_rrr_--- representing the character _mmm_aaa_ttt. on the host computer. Nor- mally, the letter _aaa is The general form of a represented as the same form Lex source file is: as the character constant {definitions} '_aaa'. If this interpretation %% is changed, by providing I/O {rules} routines which translate the %% characters, Lex must be told {user subroutines} - 19 - The definitions section con- [xy] the character x or y. tains a combination of [x-z] the characters x, y or z. [^x] any character but x. 1) Definitions, in the . any character but newline. form ``name space ^x an x at the beginning of a line. translation''. x an x when Lex is in start condition y. x$ an x at the end of a line. 2) Included code, in the x? an optional x. form ``space code''. x* 0,1,2, ... instances of x. x+ 1,2,3, ... instances of x. 3) Included code, in the x|y an x or a y. form (x) an x. %{ x/y an x but only if followed by y. code {xx} the translation of xx from the %} definitions section. 4) Start conditions, given x{m,n} _mmm through _nnn occurrences of x in the form %S name1 name2 ... _111_333. _CCC_aaa_vvv_eee_aaa_ttt_sss _aaa_nnn_ddd _BBB_uuu_ggg_sss. 5) Character set tables, in the form There are pathological %T expressions which produce number space character-string exponential growth of the ... tables when converted to %T deterministic machines; for- 6) Changes to internal tunately, they are rare. array sizes, in the form REJECT does not rescan %_xxx _nnn_nnn_nnn the input; instead it where _nnn_nnn_nnn is a decimal remembers the results of the integer representing an previous scan. This means array size and _xxx that if a rule with trailing selects the parameter context is found, and REJECT as follows: executed, the user must not Letter Parameter have used _uuu_nnn_ppp_uuu_ttt to change p positions the characters forthcoming n states from the input stream. This e tree nodes is the only restriction on a transitions the user's ability to mani- k packed character classespulate the not-yet-processed o output array size input. Lines in the rules section _111_444. _AAA_ccc_kkk_nnn_ooo_www_lll_eee_ddd_ggg_mmm_eee_nnn_ttt_sss. have the form ``expression action'' where the action As should be obvious may be continued on succeed- from the above, the outside ing lines by using braces to of Lex is patterned on Yacc delimit it. and the inside on Aho's string matching routines. Regular expressions in Therefore, both S. C. John- Lex use the following opera- son and A. V. Aho are really tors: originators of much of Lex, x the character "x" as well as debuggers of it. "x" an "x", even if x is anMoapneyratthoarn.ks are due to both. \x an "x", even if x is an operator. - 20 - The code of the current Murray Hill, NJ 07974. version of Lex was designed, written, and debugged by Eric Schmidt. MH-1274-MEL-unix M. E. Lesk and E. Schmidt _111_555. _RRR_eee_fff_eee_rrr_eee_nnn_ccc_eee_sss. 1. B. W. Kernighan and D. M. Ritchie, _TTT_hhh_eee _CCC _PPP_rrr_ooo_--- _ggg_rrr_aaa_mmm_mmm_iii_nnn_ggg _LLL_aaa_nnn_ggg_uuu_aaa_ggg_eee, Prentice-Hall, N. J. (1978). 2. B. W. Kernighan, _RRR_aaa_ttt_--- _fff_ooo_rrr: _AAA _PPP_rrr_eee_ppp_rrr_ooo_ccc_eee_sss_sss_ooo_rrr _fff_ooo_rrr _aaa _RRR_aaa_ttt_iii_ooo_nnn_aaa_lll _FFF_ooo_rrr_ttt_rrr_aaa_nnn, Software - Practice and Experience, 5555, pp. 395-496 (1975). 3. S. C. Johnson, _YYY_aaa_ccc_ccc: _YYY_eee_ttt _AAA_nnn_ooo_ttt_hhh_eee_rrr _CCC_ooo_mmm_ppp_iii_lll_eee_rrr _CCC_ooo_mmm_ppp_iii_lll_eee_rrr, Computing Science Technical Report No. 32, 1975, Bell Laboratories, Mur- ray Hill, NJ 07974. 4. A. V. Aho and M. J. Corasick, _EEE_fff_fff_iii_ccc_iii_eee_nnn_ttt _SSS_ttt_rrr_iii_nnn_ggg _MMM_aaa_ttt_ccc_hhh_iii_nnn_ggg: _AAA_nnn _AAA_iii_ddd _ttt_ooo _BBB_iii_bbb_lll_iii_ooo_ggg_rrr_aaa_ppp_hhh_iii_ccc _SSS_eee_aaa_rrr_ccc_hhh, Comm. ACM _111_888, 333-340 (1975). 5. B. W. Kernighan, D. M. Ritchie and K. L. Thompson, _QQQ_EEE_DDD _TTT_eee_xxx_ttt _EEE_ddd_iii_--- _ttt_ooo_rrr, Computing Science Technical Report No. 5, 1972, Bell Labora- tories, Murray Hill, NJ 07974. 6. D. M. Ritchie, private communication. See also M. E. Lesk, _TTT_hhh_eee _PPP_ooo_rrr_ttt_aaa_bbb_lll_eee _CCC _LLL_iii_bbb_rrr_aaa_rrr_yyy, Computing Science Technical Report No. 31, Bell Laboratories,