Select this to skip to main content [CSUSB] >> [CNS] >> [Comp Sci Dept] >> [R J Botting] >> [CSci620] >> 06 [Source]
[Index] [Schedule] [Syllabi] [Text] [Labs] [Projects] [Resources] [Search] [Grading]
Mon Apr 26 16:06:08 PDT 2004


    CSCI620 Session 6


      [ 05.html ]


      Study Chapter 4 pages 65..87 lexical and syntactic translators Prepare some notes and at least one question you need to ask on it.


      1. lexemes::=plural of lexeme.
      2. lexeme::= a lexical unit, a sequence of characters that can be treated as a single item in recognizing, parsing, and interpreting a language.

      3. tokens::=plural of token
      4. token::=an object that represents a lexeme

      5. lexer::=a component of a translator that inputs characters, recognizes lexemes and produces a token on demand. Lexers can also enter tokens in a table of symbols.

      6. recognizer::=a program, automata, or component that returns true if the input fits a known pattern.

      7. parser::=a program, automata, or component that returns a data structure representing the input fits a known pattern.

      8. interpreter::=a program, automata, or component that executes a program in a given language.

      9. compiler::=a program, automata, or component that translates a program in a given high level language into a low level language.

      10. lookahead::=the number of extra items that a lexer or parser has to read to be able to choose the correct action next. For example a lexer for C++ has to lookahead at least one character to treat "++" correctly in strings like "i+++++j".

        Pseudo Code on page 66

        This is a recognizer. It merely reports on the correctness of the input vs the definition of
      11. S::= constant "+" S | constant.

        Here is a rough translation of the pseudo-pascal on page 66 into C++:

         class Token; //defined later
         Token accept();//global function gets next Token.
         Token lex;
         bool S()
            if (lex.get_type() == CONSTANT)
            	if( lex.get_value() == "+" )
            		return S(); //recursive. don't panic.
            	} else
            		return lex.endOfFile();
            } else
            return false;

        Here is a rough description of

         class Token {
            	TokenType t;
            	string value;
            	boolean eof;
            	TokenType get_type(){return t;}
            	string get_value(){return v;}
            	bool endOfFile(){return eof;}

         typedef enum {CONSTANT, VARIABLE, OPERATOR, EOF} TokenType;

      12. pop::=an operation that takes the top item from a stack.
      13. push::=an operation that puts an item on top of a stack.
      14. stack::=a standard data structure that stores items: Last in, First Out.

        Pages 76 to 81

        Syntax improvement is part of our compilers classes. I do not expect you to recall how to do this in the final. I'm much more interested in making sure you know how a good syntax is parsed.

        Pages 82 to 87

        The book starts with incomplete Pascal and grows it statement by statement. Do not expect the begins and ends to match until finished.

        The syntax of Pascal is defined on [ pascal.syntax.html ] with the lexemes in [ pascal.lexicon.html ] and I also have collected a lot of working examples [ ] , enjoy!

      . . . . . . . . . ( end of section Notes) <<Contents | Index>>


        What is a Just-In-Time (JIT) compiler

        A just-in-time compiler interprets a program and translates it piece by piece. It never translates a piece of the program unless it is about to be executed. As a result it avoids compiling code that is not needed on a give execution of the program. This can save time.

        The term came to fame when Sun released a JIT compiler for Java. The idea JIT was around in the 1990's as a way to do business where you aimed to not order and store raw materials ready to meet demand. Instead when the demand for a good arrived you ordered the raw materials and produced it "Just in time" to sell it.

        Define procedure and function

        In computer science (and in particular Algol, Pascal, and Ada) sub-programs are divided into two types:
      1. subprogram::=procedure | function.
      2. procedure::=a subprogram that does not return a value.
      3. function::=a subprogram that does return a value.

        This is different to the C/C++ terminology where all subprograms are functions, and functions that don't return anything are called void functions.

        Where are the assemblers?

        The theory of programming languages has been focused on high level languages for 3 or 4 decades. The theory of assemblers is not very interesting. In practice, assemblers have been used less and less. Assemblers are studied in other topics: System Programming, Machine Architectures.

        Does a lexer and parser use the same grammar?

        No. The original BNF for the language is split into two levels to prepare for writing a translator. The grammar for the lexer describes lexemes in terms of characters. It is typically a regular grammar and is often expressed as a set of regular expressions. So, in a way, the lexemes are at the top of the lexical analysis grammar.

        The grammar for the parser defines the language in terms of the lexemes. So the lexemes are at the bottom of this grammar. The start symbol, typically stands for program.

        On page 68 the function S has one push and two pops -- can it work

        Yes. Before the two pop() functions, there is
         			if( S() ){ Op2=pop(SAS); Op1=pop(SAS); ...}

        How does a quad table work -- pages 69-70

        A quad table is no more than an array of quads, and each quad is a struc/class with 4 items. The table becomes a list of instructions that can be executed or further translated into a specific machine code.

        Could Quads (page 69) be mapped into Byte Code

        Yes. I don't think the book's set of quads would map into the Java Virtual machine's byte code, but the idea is very similar. All we need is to allocate a certain number of bytes to each item in a quad, and to allocate numbers to represent the operators. The operands would also need to become numeric addresses in a virtual machine rather than symbolic variables.

        For example:

         class Quad{ public: char operator; int operand1, operand2, result};
         typedef vector<Quad> QuadTable(3000);
        We can use a list instead of a vector.

        On Page 73 where does the I come from and where does it go?

        The I in the production
         		S::=...... | I
        is short for identifier and fits with the set listed on the line but one above it. It becomes V again in the grammar G describing the lexemes for the language.

        On Page 75 why must we make 3 improvements before we write a compiler?

        The recursive descent, single lookahead compiler is simple to develop compared to the others. As a rule it is better to write a simple prgram than a complex one. Hence we may choose to change the grammar to get a simpler compiler/translator/interpreter.

        However recursive descent will fail to execute correctly if the grammar is not deliberately adjusted first. For example a definition like

         		train::=train carriage| engine carriage.
        would have code like this:
         		bool train()
         		{	if( train() )
         			  return carriage();
         			{ if (engine())
          		   return carriage();
         			  else return false;
        The recursion on the second line is reentered until the program runs out of stack space.

        Similarly, a common prefix is needed to distinguish to alternatives.

        Similarly with working out the selection sets.

        on page 76 I don't understand how the recursion is eliminated

        We don't eliminate all recursion! If we could we would have a Type 3 or regular language. No interesting languages are regular! In the example
         		E::=E P | P
        we replace it by
         		E::=P E'
         		E' ::= P E' | empty

        What we do is: replace left recursion by right recursion.

        We do this because left recursion does not lead to an infinite loop in the parser!

        Explain "deterministic ... one symbol lookahead parser" on page 77

        A deterministic program can evaluate all conditions before it makes a choice between alternatives. The one symbol lookahead means that the program can make a choice as to the next action by looking at the next symbol, and nothing else on the input.

        Programs that only look one symbol ahead to make choices are easier to write because there can be no backtracking.

        In the examples shown the production

         			T P' | ( E ) P' | () P'
        has two alternatives with a common first symbol("("). This means that looking at the "(" doesn't tell the program whether to parse "( E ) P'" or "() P').

        Similarly in the definition of T on page 77.

        Here is an extreme and silly example of a system that needs more than a single lookahead:

          As the railway climbs into the mountains the trains have to be switched between two tracks. One track is steep and short. Only trains with two engines can go this route. The other track goes the long way round and through some passes.... but only trains with a single engine can go by that route.

          You have to write a program that switches the trains correctly.

          Now the two difficulties. (1) the trains emerge from a long tunnel and the switch is close to the tunnel. You must make a choice the moment the front of the train is visible. (2) The second train is at the back of the train:

        1. train::= one_engined_train | two_engined_train.
        2. one_engined_train::= engine carriages caboose.
        3. two_engined_train::= engine carriages engine caboose.
        4. carriages::= empty | carriage carriages.

          Basically you need to negotiate a change of specification before you try to program this problem....

          This problem is courtesy Michael A Jackson's books, methods, and training.

        (End of Net)

        On page 78 is the T'' an error?

        Yes. I think it should be T'.

        Will we have to improve any syntax in the final

        No. However You need to recall that improvements can be needed and what they are likely to be.

        Explain the initial construction of follow sets

        Firstly: YAGNI (it won't be on the final or in a lab). It is done properly in the compilers electives in our degrees. Second, I've found that in most real languages it can be done automatically or intuitively. In this class I don't expect you to have mastered this computation.

        The idea is figuring out how to handle productions that have an empty in them. In essence you list the things that might be on the other side of the empty.

        On page 78 and 79 what is the dollar sign for?

        The dollar sign is the symbol use in compiler theory and UNIX practice to indicate the end of a string or file.

      . . . . . . . . . ( end of section Questions) <<Contents | Index>>



      . . . . . . . . . ( end of section Exercises) <<Contents | Index>>


      Research on lexical and syntactic analysis, [ lab06.html ] TBA


      Study Chapter 4 pages 87.. semantic interpretation. [ 07.html ]

    . . . . . . . . . ( end of section CSCI620 Session 6) <<Contents | Index>>


  1. BNF::="Backus-Naur Form", for syntax and grammar, developed by Backus and Naur.
  2. EBNF::="Extended " BNF.
  3. HTML::= "HyperText Markup Language", used on the WWW.
  4. HTML_page::syntax= "<HTML>" head body.
  5. Java::="An " OO " Language from Sun".
  6. LISP::= "LISt Processing Language".
  7. LRM::="Language Reference Manual".
  8. OO::="Object-Oriented".
  9. Prolog::="Programming in Logic".
  10. TBA::="To Be Announced".
  11. UML::="Unified Modeling Language".
  12. URL::=Universal_Resource_Locator,
  13. Universal_Resource_Locator::syntax= protocol ":" location, where
    1. protocol::= "http" | "ftp" | "mailto" | ... ,
    2. location::= O( "//" host) O(pathname).

    (End of Net)
  14. WWW::= See, index to web site for this class.
  15. XBNF::="eXtreme" BNF, developed by the teacher from EBNF, designed to ASCII input of syntax, semantics, and other formal specifications.

Formulae and Definitions in Alphabetical Order