LEX, Flex, JLEX - SisInf Lab
Transcript
LEX, Flex, JLEX - SisInf Lab
Formal Languages and Compilers Master’s Degree Course in Computer Engineering A.Y. 2015/2016 FORMAL LANGUAGES AND COMPILERS LEX, FLEX, JLEX Floriano Scioscia 1 Formal Languages and Compilers A.Y. 2015/2016 DEI – Politecnico di Bari LEX/FLEX/JLEX (1/2) • Since the conversion of regular expressions to deterministic finite-state automata and the implementation of the latter are mechanical (and boring) processes, automatic scanner generators are often used. • LEX is a well-known and widely used scanner generator in Unix. It was designed expressly to work with the parser generator YACC. Many assumptions in the code generated by LEX fit well with those of YACC. For example, the scanner produced by LEX is a C function named yylex(), which is exactly what YACC expects from the lexical analyzer. • LEX was developed by M. E. Lesk and E. Schmidt at AT&T Bell Laboratories. It generates a scanner in C from a set of regular expressions defining tokens. • The input is a specification: a text file containing token patterns as regular expressions. LEX produces a whole scanner module which can be compiled and linked with the other modules of a compiler. LEX, FLEX, JLEX - Floriano Scioscia 2 Formal Languages and Compilers A.Y. 2015/2016 DEI – Politecnico di Bari LEX/FLEX/JLEX (2/2) • FLEX (Fast LEXical analyzer generator), originally written in C by V. Paxson in 1987, is a free software alternative to LEX and represents a more recent and faster version. • It is often used with Bison, which is in turn a parser generator alternative to YACC. • LEX is distributed with the Unix operating system, while FLEX is a product of the Free Software Foundation. • JLEX is a Java version of LEX. Its regular expressions are very similar to the ones used by LEX/FLEX. JLEX generates a scanner in Java. It is often paired with CUP (Constructor of Useful Parsers), a Java alternative to YACC/BISON. • • JLEX: http://www.cs.princeton.edu/~appel/modern/java/JLex/ • CUP: http://www2.cs.tum.edu/projects/cup/ LEX, FLEX and JLEX are mostly non-procedural: one does not need to state how the tools must perform scanning. Stating what must be scanned is all it is needed, by means of a definition of valid tokens. This approach greatly simplifies scanner construction, since most scanning details (I/O, buffering, etc.) are managed automatically. LEX, FLEX, JLEX - Floriano Scioscia 3 Formal Languages and Compilers A.Y. 2015/2016 DEI – Politecnico di Bari • FLEX: description and operation Flex generator accepts in input (in a file with .l extension, i.e. test.l) a set of regular expressions (rules) and actions (as C code) associated to each expression and produces in output a scanning routine yylex() (in the file lex.yy.c) which can detect and return the individual lexemes admitted in a language. Regular expressions LEX C program • The file lex.yy.c, without main() (which is implicitly defined, for a standalone scanner, with the compiling option –lfl), contains the scanning routine yylex() with other auxiliary routines and macros. • The output file is compiled and linked with library fl to generate an executable file. lex test.l cc lex.yy.c -o test -lfl • When the executable (test) is run, it scans the input file(s) looking for occurrences of regular expressions complying with defined patterns. If one is detected, the associated C code is executed. LEX, FLEX, JLEX - Floriano Scioscia 4 Formal Languages and Compilers A.Y. 2015/2016 DEI – Politecnico di Bari Useful FLEX links • http://flex.sourceforge.net/ home page • http://flex.sourceforge.net/manual/ manual • http://www.quut.com/c/ANSI-C-grammar-l-1998.html ANSI C grammar (LEX specification) LEX, FLEX, JLEX - Floriano Scioscia 5 FLEX: description and operation Formal Languages and Compilers A.Y. 2015/2016 DEI – Politecnico di Bari • Dual significance – Language – Compiler (scanner generator) file.l FLEX compiler lex.yy.c C compiler lexer • Finally source file Lexer tokens LEX, FLEX, JLEX - Floriano Scioscia 6 Formal Languages and Compilers A.Y. 2015/2016 DEI – Politecnico di Bari FLEX: structure of the generated program • FLEX produces a C program without main() whose entry point is the function int yylex() • This function reads from file yyin and copies to file yyout the unrecognized text. • If not specified otherwise in the actions (by means of the return instruction), the function ends only when the whole input file has been analyzed. • After each action, the automaton returns to the start state to recognize new tokens. • As a default, files yyin and yyout are initialized to stdin and stdout respectively. • The user (programmer) can change this setting by re-initializing these global variables. LEX, FLEX, JLEX - Floriano Scioscia 7 Formal Languages and Compilers A.Y. 2015/2016 DEI – Politecnico di Bari Regular expressions in FLEX (1/4) • For the specification of the scanner, LEX uses regular expressions, a formalism more efficient but less powerful than context-free grammars. • The difference between CFGs and regular expressions lies in the fact that regular expressions cannot recognize recursive syntactic structures, while CFGs can. • A syntactic structure like balanced arithmetic expressions, requiring the same number of open and closed parentheses, cannot be recognized by a scanner. That’s why a parser must be used. • On the contrary, number constants, identifiers and keywords are recognized by a scanner. LEX, FLEX, JLEX - Floriano Scioscia 8 Formal Languages and Compilers A.Y. 2015/2016 DEI – Politecnico di Bari Regular expressions in FLEX (2/4) • Regular expressions describe ASCII character sequences and use a set of operators: “\[]^-?.*+|()$/{}%<> • Letters and numbers in the text are self-descriptive: – the regular expression val1 represents the sequence ‘v’ ‘a’ ‘l’ ‘1’ in the input text. • Non-alphanumeric characters are represented in LEX by enclosing them between double quotes, in order to avoid ambiguity with operators: – the expression xyz“++” represents the sequence ‘x’ ‘y’ ‘z’ ‘+’ ‘+’ in the input text. • Non-alphanumeric characters can be described also by means of a preceding \ symbol. – the expression xyz\+\+ represents the sequence ‘x’ ‘y’ ‘z’ ‘+’ ‘+’ in the input text. LEX, FLEX, JLEX - Floriano Scioscia 9 Formal Languages and Compilers A.Y. 2015/2016 DEI – Politecnico di Bari Regular expressions in FLEX (3/4) • Character classes are described through operators [] – the expression [0123456789] represents a digit in the input text. • In character class descriptions, the - symbol denotes a character range: – the expression [0-9] represents a digit in the input text. • To include the character - in a character class, it must be specified as the first or the last one: – the expression [-+0-9] represents a digit or a sign in the input text. • In character class descriptions, the ^ symbol at the beginning denotes a set of characters to exclude: – the expression [^0-9] represents any character except a digit in the input text. • The set of all characters except the new line one is denoted with . • The new line character is denoted with \n • The tabulation character is denoted with \t LEX, FLEX, JLEX - Floriano Scioscia 10 Formal Languages and Compilers A.Y. 2015/2016 DEI – Politecnico di Bari Regular expressions in FLEX (4/4) • The operator ? denotes that the preceding expression is optional: – ab?c represents both ac and abc • The operator * denotes that the preceding expression can repeat 0 or more times: – ab*c represents all the sequences starting with a, ending with c and having inside any number of bs • The operator + denotes that the preceding expression can repeat 1 or more times: – ab+c represents all sequences starting with a, ending with c and having inside at least one b. • The operator | denotes an alternative between two expressions: – ab|cd represents the sequence ab or the sequence cd • Parentheses ( ) allow to express precedence among operators: – (ab|cd+)?ef represents sequences such as ef, abef, cdddef. LEX, FLEX, JLEX - Floriano Scioscia 11 Formal Languages and Compilers A.Y. 2015/2016 DEI – Politecnico di Bari FLEX: main supported regular expressions x . [xyz] character 'x' any character but '\n' a character class; in this case, 'x', 'y' or 'z' [a-z] a class with a range; any character between 'a' and 'z' [^A-Z] r* r+ r? a negated class: any character NOT in the class zero or more r, with r a regular expression one or more r zero or one r r{2,5} between two and five r r{2,} r{4} two or more r exactly four r {name} (r) rs r|s r/s ^r r$ the expansion of the definition of name r, parenthesized for grouping concatenation: r followed by s alternative: r or s restriction: r but only if followed by s r but only at the beginning of a line r but only at the end of a line LEX, FLEX, JLEX - Floriano Scioscia 12 Formal Languages and Compilers A.Y. 2015/2016 DEI – Politecnico di Bari FLEX input file format (1/5) A LEX/FLEX input file consists in three distinct sections, separated by the %% symbol. Definitions % #include constant definitions scanner macros % basic definitions Rules %% token definitions and actions User code %% support procedures, C user code LEX, FLEX, JLEX - Floriano Scioscia 13 Formal Languages and Compilers A.Y. 2015/2016 DEI – Politecnico di Bari • FLEX input file format (2/5) Section 1 (which may be empty) contains: – definitions of custom constants and/or macros and library #include directives of the user program, enclosed within % and %; this text section will be literally copied into the output C program; when the scanner is used in combination with a YACCor Bison-generated parser, this section should contain a directive #include y.tab.h, which is the header file of the generated parser, containing the definition of multi-character tokens for parsing purposes; – the basic definitions used in the next section to describe regular expressions. • Section 2 contains the token definitions with the related actions to be executed, in the form pattern action – Actions must start on the same line where the pattern regular expression ends and are separated from it by means of blanks (whitespace or tabulations). • Section 3 (which may be empty) contains the support routines the developer intends to use in the actions defined in the previous section; if this section Is empty, the %% delimiter is omitted. LEX, FLEX, JLEX - Floriano Scioscia 14 Formal Languages and Compilers A.Y. 2015/2016 DEI – Politecnico di Bari FLEX input file format (3/5) • In the first section, every line starting with a non-whitespace character is a definition: number [+-]?[0-9]+ • Expressions defined that way can be used in section 2 by enclosing their name within braces: {number} printf(“number found\n”); • Code fragments can be inserted both in the first section (within %{ %}) and in the second section (within { } right after any regular expression to be recognized), and they are copied entirely into the output file. • If no action is specified next to a pattern, when a token of the corresponding type is recognized during lexical analysis it will be discarded. • Lines in the third section (for support routines) are also copied into the lex.yy.c output file generated by LEX. LEX, FLEX, JLEX - Floriano Scioscia 15 Formal Languages and Compilers A.Y. 2015/2016 DEI – Politecnico di Bari • FLEX input file format (4/5) Example 1 of input file content %{ #include <stdio.h> %} %% [0-9]+ printf(“INTEGER NUMBER\n"); [a-zA-Z][a-zA-Z0-9]* printf("IDENTIFIER\n"); %% • • • This file describes 2 patterns, i.e. 2 types of tokens: [0-9]+ with the associated action of printing INTEGER NUMBER and [a-zA-Z][a-zA-Z0-9]* with the associated action of printing IDENTIFIER Notice the presence, in section 1, of the directive #include <stdio.h> needed to enable the use of printf This simple example presumes LEX/FLEX is used independently of YACC/BISON. LEX, FLEX, JLEX - Floriano Scioscia 16 Formal Languages and Compilers A.Y. 2015/2016 DEI – Politecnico di Bari • FLEX input file format (5/5) Example 2 of input file content %{ int num_lines = 0, num_chars = 0; %} %% \n num_lines=num_lines+1; num_chars=num_chars+1; . num_chars=num_chars+1; %% main() { yylex(); printf( "# of lines = %d, # of chars = %d\n", num_lines, num_chars ); } • • This scanner counts the number of characters and of lines in the input file and prints the values of such counters. Notice that the first line declares two global variables, accessible to yylex() function as well to main() declared after the second %%. LEX, FLEX, JLEX - Floriano Scioscia 17 Formal Languages and Compilers A.Y. 2015/2016 DEI – Politecnico di Bari • Lexical ambiguities There are two types of lexical ambiguities: 1. A prefix of a character sequence recognized by a regular expression is matched also by another regular expression. – In this case, the scanner executes the action associated with the regular expression which has recognized the longest string (longest match or maximal munch rule). 2. The same character sequence is matched by two different regular expressions. – In this case the scanner executes the action associated with the regular expression declared first in the LEX/FLEX input file. • Example: consider the file %% for {return FOR_CMD;} format {return FORMAT_CMD;} [a-z]+ {return GENERIC_ID;} and the input string “format”, the yylex function returns FORMAT_CMD, preferring the second rule with respect to the first one because it matches a longest string, and also with respect to the third one because it is defined earlier in the input file. LEX, FLEX, JLEX - Floriano Scioscia 18 Formal Languages and Compilers A.Y. 2015/2016 DEI – Politecnico di Bari Solving lexical ambiguities • Given the LEX/FLEX ambiguity resolution strategies, it becomes necessary to define the rules for keywords before the ones for identifiers. • The longest-match principle requires caution: ’.*’ {return QUOTED_STRING;} seeks to recognize the second quote as far as possible: hence, with the following input ’first’ quoted string here, ’second’ here the scanner will take 36 characters instead of 7. • Then a better rule is this one: ’[^’\n]+’ {return QUOTED_STRING;} LEX, FLEX, JLEX - Floriano Scioscia 19 Formal Languages and Compilers A.Y. 2015/2016 DEI – Politecnico di Bari Actions associated with regular expressions in LEX • In LEX, each regular expression is associated with an action, executed upon recognition. • Actions are expressed in C code: if this code fragment includes more than one statement or spans more than one line, it must be enclosed in curly braces. • The simplest action is ignoring the recognized text: a void action is expressed with the ; character. • Recognized text is stored in the yytext variable, defined as char pointer. Working on this variable, more complex actions can be specified. The number of recognized characters is stored in the yyleng variable, defined as integer. • • A default action exists for text not recognized by any regular expression: the unrecognized text is copied in output, character by character. LEX, FLEX, JLEX - Floriano Scioscia 20 Formal Languages and Compilers A.Y. 2015/2016 DEI – Politecnico di Bari Examples of LEX input (1/3) 1. Generation of lines introduced by ordinal number. %{ #include <stdio.h> int l = 1; %} line .*\n %% {line} {printf(“%d %s”, l++, yytext);} %% main() { yylex(); return(0); } Regular expression Action Lexeme Compilable scanner code LEX, FLEX, JLEX - Floriano Scioscia 21 Formal Languages and Compilers A.Y. 2015/2016 DEI – Politecnico di Bari Examples of LEX input (2/3) 2. Replace numbers in decimal notation to hexadecimal notation and print the number of real substitutions. %{ #include <stdio.h> #include <stdlib.h> int count = 1; %} digit [0-9] num [digit]+ %% {num} { int n = atoi(yytext); printf(“%x”, n); if (n > 9) count++; } %% main() { yylex(); fprintf(stderr, “Substitution count = %d\n”, count); return(0); } LEX, FLEX, JLEX - Floriano Scioscia Note: Default action: when a string is not part of any token ECHO on output 22 Formal Languages and Compilers A.Y. 2015/2016 DEI – Politecnico di Bari Examples of LEX input (3/3) 3. Replace lines not starting or ending with character a. %{ #include <stdio.h> %} a_line a.*\n line_a .*a\n %% {a_line} ECHO; {line_a} ECHO; .*\n ; %% main() { yylex(); return(0); } Empty action Notice: Ambiguous rule set: a string can match several regular expression (e.g.: a) Built-in priority guidelines 1. Maximal munch principle. 2. If more rules match the string, select the earliest specified one. .*\n {a_line} {line_a} ; ECHO; ECHO; Empty output! {a_line} {line_a} ECHO; ECHO; Output = input LEX, FLEX, JLEX - Floriano Scioscia 23 Formal Languages and Compilers A.Y. 2015/2016 DEI – Politecnico di Bari • • FLEX: combined usage with BISON (1/3) The output of LEX/FLEX is a lex.yy.c file: a C program without main(), containing the yylex() scanning routine with other auxiliary routines and macros. By default, yylex() is declared as: int yylex() { ... various definitions and the actions in here ... } • When one combines lexical and syntax analysis, the lex.yy.c file (produced by the scanner generator) is typically included (by means of #include) in the source code generated by YACC. Many declarations, such as tokens and data structures to communicate with the parser, are declared in the source generated by YACC, y.tab.c. LEX, FLEX, JLEX - Floriano Scioscia 24 Formal Languages and Compilers A.Y. 2015/2016 DEI – Politecnico di Bari • FLEX: combined usage with BISON (2/3) Example: BASIC compiler bas.l: lexical rules bas.y: syntax rules with token definitions cc: command to create the compiler bas.exe: compiler y.tab.h: token definitions for LEX yacc –d bas.y # create y.tab.h, y.tab.c lex bas.l # create lex.yy.c cc lex.yy.c y.tab.c –o bas.exe # compile/link LEX, FLEX, JLEX - Floriano Scioscia 25 Formal Languages and Compilers A.Y. 2015/2016 DEI – Politecnico di Bari FLEX: combined usage with BISON (3/3) LEX, FLEX, JLEX - Floriano Scioscia 26 Formal Languages and Compilers A.Y. 2015/2016 DEI – Politecnico di Bari FLEX: output (1/3) • When the generated scanner is executed, it scans input recognizing character strings compliant with the specified patterns. • As already noted, if more than one possible match is found, the longest one is taken. In the case of 2 equal-length matches, the one is chosen which corresponds to the rule appearing earlier in the FLEX input file. • Once the match is found, the corresponding text (which represents a token) is made globally available through the char pointer (char *yytext), and its length is globally reported in the integer (int yyleng). • Then the scanner executes the action which corresponds to the found pattern and proceeds to scanning the remaining text. LEX, FLEX, JLEX - Floriano Scioscia 27 Formal Languages and Compilers A.Y. 2015/2016 DEI – Politecnico di Bari FLEX: output (2/3) • Once a token is recognized, various possibilities exist: it can be ignored (like usually done for whitespace) and go to the next token; or the code of the recognized token is returned. When a token is returned, yylex() function ends but it will be called again by the parser when it needs another token. • Some supplementary action can be required, besides returning or ignoring a token. For example, when a newline is found, the input line counter is increased. • Even more important is the fact that for some tokens something more must be known beyond their type. For example, it is not enough to know a variable has been found: we must know which variable it is. LEX, FLEX, JLEX - Floriano Scioscia 28 Formal Languages and Compilers FLEX: output (3/3) A.Y. 2015/2016 DEI – Politecnico di Bari • FLEX calls the yywrap() function at the end of its input and returns the global variable char *yytext, storing the characters of the current token, and the global variable int yyleng, storing the length of that string. • If yywrap() returns value 0 (false), this means the function predicts yyin must be set to another input file, so that scanning can continue on that file. If it returns a non-zero value (true), the scanner terminates and returns value 0 to the caller function. yylex() Input file scan EOF found yywrap() More files? no yes action return Readdress yyin file 1. Initialize file yyin 2. Symbol 3. End of (default: stdin) processing scanning LEX, FLEX, JLEX - Floriano Scioscia 29 Formal Languages and Compilers A.Y. 2015/2016 DEI – Politecnico di Bari Example exercises • Write the LEX specification for a scanner which, given a C program in input, produces an equal but comment-free program in output. • Change the above specification so that the scanner recognizes #include directives and, when it does, report an error and end the analysis. LEX, FLEX, JLEX - Floriano Scioscia 30 Formal Languages and Compilers A.Y. 2015/2016 DEI – Politecnico di Bari JLEX: LEX’s Java version • LEX generates scanners in C language. To generate scanners in Java language, the JLEX scanner generator can be used. • This tool, entirely written in Java, produces in output Java classes implementing methods to execute lexical analysis of an input string. The main produced class is Yylex, containing the yylex() method, which gets and analyzes the next input token. Another method of the class is yytext(), which returns the text recognized by yylex() • • • Also JLEX requires an input specification file, containing all details about the lexical analysis to be performed. • As said, JLEX is often used together with CUP (Constructor of Useful Parsers), a Java alternative to YACC/Bison. LEX, FLEX, JLEX - Floriano Scioscia 31
Documenti analoghi
Compilers - SisInf Lab
• augmented machine code
– Compilers generate code for a particular machine instruction set, augmented by
operating system and support routines: in order to execute such object code, the
target mac...
Introduction to Lex - Faculty of Computer Science
• If more than one match is found, it selects
the regular expression matching the
longest string.
• If it finds two or more matches of the same
length, the one listed first is selected.
• If no ...