Write an LR(1) Parser from scratch

Preface#

The steps for building a lexer tokenizer will not be elaborated here; you can refer to the tokenizer in the repo or write another type yourself, ensuring that the tokens returned by the tokenizer can connect with the tokens in the parser (which are instances of the Symbol class).

This project is implemented based on the Dragon Book[1]. A significant portion of the details in the Parser can be found in the corresponding sections of the Dragon Book, which can serve as examples to aid in understanding the book.

This project uses the Python language. Why? The flexibility of Python allows one to focus maximally on the implementation of program logic rather than the minutiae of syntax, making it very suitable for practical ideas, and it can leverage type annotations for static checking. If performance becomes a requirement later, a C++ version of the program can be built based on this.

Finally, thanks to the course instructor for presenting such a challenging experimental problem; otherwise, I wouldn't have thought of doing something like this, nor would I have immersed myself in it for about a week.

Route#

Preparation#

The development environment is Windows 11 + Jetbrains Pycharm + Python 3.9

The newer Python features used include:
- Since Python 3.6, the built-in type dict has become ordered, which is used to clearly print item set families.
- The usage of from __future__ import annotations supported since Python 3.7 is used for type annotations, relaxing the order of class definitions, allowing classes defined later to be used as type annotations earlier.
Programming habits:
- Try to place constants in a unified file, such as constants.py, to avoid magic numbers and magic strings.
- Use type annotations extensively to clarify variable types and facilitate static error checking.
- Define a debug_mode switch in classes to print debugging information.
- Perform unit testing.
  - Refer to the unit tests in the code in tester.py.
How to debug from the external main program main.py?
- Implement a create_cfg method in the external main program to define and modify the context-free grammar.
- Prepare the test program text, such as test.c.
  - Example CFG grammar can refer to c_generators in test.c.
- Define a Tokenizer instance to read all tokens.
- Define a Parser instance, enable debugging mode, and initialize the structure.
- You can use a Tester instance to encapsulate the testing process.
- Observe and analyze the output, continuously adjusting the implementation.
  - For convenience, you can redirect the output to a file.

Modeling Symbols, Productions, Items, Item Sets, etc.#

Modeling of symbols, productions, items, item sets, etc., is located in the grammar.py file.

Grammar Symbols -> Symbol class
- Member variables
  - sType: type of the token
    - Divided into terminal and non-terminal symbols.
  - name: type of the token
    - Divided into numeric constants, identifiers, etc.
    - For example, the name corresponding to an identifier is constants.
  - content: string representation of the token
    - For example, in int a;, a is of identifier type, with the property name as "constants" and the property content as "a".
- Member functions
  - __str__: converts to string representation for easy printing.
  - __hash__ and __eq__: facilitate the use of set types.
  - fromStr: converts from str type to Symbol type, requiring external specification of the terminal symbol set.
- How to represent an empty symbol?
  - Define a constant in constants.py, using ε to represent it.
  - It can be defined as either Symbol type or str.
Productions -> Generator class
- Member variables
  - start: non-terminal symbol on the left side of the production.
  - end: string of grammar symbols on the right side of the production.
- Member functions
  - __str__: converts to string representation for easy printing and debugging.
- What is ComposedGenerator for?
  - When manually defining CFG grammar, writing strings is much easier than defining Symbol, so ComposedGenerator provides a layer of conversion.
  - Additionally, the same left side of a production may correspond to multiple right sides, such as A -> a | b, reflected in the member end of ComposedGenerator being annotated as list[list[str]], which is "Composed".
    - Example: A -> a | b
      - Defined as ComposedGenerator is {start: 'A', end: [['a'], ['b']]}
      - Converting to Generator yields two instances of Generator type:
        
        {start: 'A', end: ['a']}
        
        {start: 'A', end: ['b']}
LR(1) Item -> Item class
- Member variables
  - start: non-terminal symbol on the left side of the production.
  - end: string of grammar symbols on the right side of the production.
  - point: the dot within the production body, indicating the position read.
  - lf_symbols: lookahead symbols, LR(1) only looks at one.
- Member functions
  - __hash__ and __eq__: distinguish instances, facilitating the use of set types.
  - __str__: converts to string representation for easy printing.
    - Special handling for the case where the production is empty.
LR(1) Item Set -> ItemSet class
- Member variables
  - _items: all items.
- Member functions
  - add: adds an item to the item set, automatically ignoring duplicates.
  - union: merges with another ItemSet type.
  - toList: converts to a list.
    - Facilitates subsequent indexing of LR automaton states.
  - __hash__ and __eq__: distinguish instances, facilitating the use of set types.
  - __iter__: returns an internal iterator for for traversal.
  - __len__: returns the size.
    - Needs to inherit Sized.
- Why model item sets? Can't we just use set(Item)?
  - Good question; I thought the same at first.
  - The problem mainly arises when computing the LR(1) canonical item set family.
    - It is necessary to determine when no new item sets are added; the first reaction is to wrap set(Item) in another layer, set(set(Item)).
    - However, set(Item) does not have a default definition of partial order and equivalence; it is a composite type.
      - "Composite type" can refer to Chapter 12 of the Elements of Programming[2].
    - Therefore, it is necessary to manually define partial order and equivalence for item sets.
  - Additionally, due to debugging and other requirements, modeling item sets is necessary.
Context-Free Grammar (CFG) -> GrammarCFG class
- start_mark: literal of the start symbol.
- end_mark: terminal symbol.
  - Generally the bottom of the stack '$'.
- end_symbols: set of terminal symbols.
- generators: set of productions.
  - From which the set of non-terminal symbols can be derived.

Implementing FIRST and FOLLOW Sets#

Starting from this step, the Parser is implemented, with corresponding reference implementations in parser.py, and detailed comments in the code refer to the relevant pages in the Dragon Book.

Relevant functions in the reference code:
- get_first
- get_chain_first (to compute first for a string of grammar symbols)
- get_follow
- _get_follow_once
- _get_follow_all
Relevant member variables in the reference code:
- self._map_follow
- self._map_first
Why start with computing FIRST and FOLLOW sets?
- Computing FIRST and FOLLOW sets is the foundation for subsequent algorithms.
- Moreover, you have already modeled the grammar; you just need to implement it according to the process in the Dragon Book, involving only grammar symbols and productions.
- Starting with simple operations and continuously debugging and iterating can gradually lead to a programming state.
How should it be computed?
- There are no pitfalls in computing FIRST sets; the function call process is a tree structure.
  - For example: for A -> B and B -> b
    - To compute first(A), call to compute first(B).
    - Compute first(B) to get {b}.
- However, computing FOLLOW sets has circular function calls.
  - For example: for A -> aB and B -> bA | b
    - According to the third rule for computing FOLLOW sets.
    - Compute follow(A), and find that follow(A) is all in follow(B).
    - Compute follow(B), and find that follow(B) is all in follow(A).
  - The solution in the reference implementation is quite straightforward.
    - Pre-calculate the FOLLOW sets for all non-terminal symbols and store them in the global variable self._map_follow.
    - Continuously scan all non-terminal symbols' FOLLOW sets until no new terminal symbols are added.
      - For example, check by unioning the FOLLOW sets of two non-terminal symbols and see if len changes, etc.
  - You can try to find a better implementation.

Implementing CLOSURE and GOTO#

With the tools for computing FIRST and FOLLOW sets, the next step can focus on the algorithms related to LR(1) items. The corresponding reference implementation is in parser.py, with detailed comments in the code referring to the relevant pages in the Dragon Book.

Relevant functions in the reference code:
- get_closure
- get_goto
Parts not mentioned in the book?
- According to the dependency order, implement closure first, then implement goto.
- Closure sets:
  - Need to pay attention to deduplication.
  - Store newly added items separately during one traversal, and only traverse newly added items next time.
  - Ensure that the last piece of newly added items is added at the end to prevent iterator invalidation.
  - My implementation in the reference code is quite poor; I now realize it only happened to work by chance, possibly due to being intoxicated at the time, so I recommend not using it as a reference.
- The goto set has no pitfalls.
  - The next symbol is X; just move the dot and add it.
  - Finally compute the closure to return.

Implementing LR(1) Item Sets#

Starting from the closure of the first production of the augmented grammar, with goto established, LR(1) item sets can be computed. Continuously compute goto until no new item sets are added. The corresponding reference implementation is in parser.py, with detailed comments in the code referring to the relevant pages in the Dragon Book.

You can use a method similar to computing FOLLOW sets to determine "no new item sets are added."
Relevant functions in the reference code:
- get_lr1_items
After successfully completing this step, by printing the process of computing goto and viewing each item set as a state, you will find that you have effectively completed the LR(1) item set automaton; at this point, you are already halfway successful.

Implementing the LR(1) Parsing Table#

Relevant functions in the reference code:
- get_lr
Requirements:
- Model an action to fill in the parsing table.
  - Members must include:
    - Type of action to be executed.
    - Information necessary for executing the action.
      - For example:
        
        Which production to reduce.
        
        Which state to move to next.
- Model an LR(1) parsing table to fill in one by one.
  - Corresponds to LRSheet.
  - One dimension is the index of the state, and the other dimension is the symbol.
- The canonical LR(1) item set family serves as all states.
You can fill in actions one by one according to the rules as in the reference code, or you can combine them to fill in actions.
What if a cell is filled in twice during the filling process?
- There is an error in the implementation.
- The grammar is not an LR(1) grammar.
Other:
- In the reference code, the reduction of empty productions is implemented by trying to place an empty symbol on the stack top after an error occurs.

Implementing the LR(1) Configuration#

Model the configuration.
- Assume all input is known.
- This does not strictly follow the definition on page 158[1].
- In addition to the input in the stack and the remaining input, there is also the index of the state in the stack for easy debugging and observation of the bottom-up processing.
- In the reference code, the execution of one action step is also placed in the configuration for easy state printing.
At this point, it is basically complete.

Adding Simple SDD Syntax-Directed Analysis#

The SDD here is limited to S-attribute SDD, defined in a recursive subroutine manner within the create_cfg() body, and is triggered only when the action is a production reduction operation. You can try modifying it to trigger during the middle of productions or when the action is a shift operation.

You can refer to the following steps to add SDD:
- Extend modeling such as Symbol to carry various semantic attributes.
- Implement simple type declarations.
- Implement simple expression evaluations.
- Implement simple type checking rules.
- Implement control flow SDD.
Note:
- Python performs shallow copies of objects, so a deep copy is needed before executing semantic actions in forward to prevent overwriting other Symbol instances' SDD.
How to extend?
- Agree on the format of recursive subroutines, for example:
  - The parameter list is (input_symbols: list[grammar.Symbol], reduction_symbol: grammar.Symbol, st: parser.SymbolTable).
  - Define the names of attributes in constants.py.
    - This reflects the benefits of unified constant definitions.
    - No need to repeatedly type literals, reducing the chance of typos and unexpected bugs.
- Bind a function as a recursive subroutine for each defined production.
- Also add this parameter in the Item class.
- When creating new Item instances in various methods of the Parser class, pass the bound function along.
- Execute semantic actions during the reduction action in forward.
- For specifics, see the Git differences in the reference code.

Existing Issues#

Initially, the modeling did not fully consider the equivalence of Symbol, etc.
- For example: a and b are both identifier in productions, so comparing them yields equal results.
- Ultimately, only the representation in the production is compared, not the string content.
Why have both sType and name separately? Wouldn't it be better to merge them since both represent token types?
- The initial design separated these two mainly due to interface integration issues.
- The types in the lexer tokenizer part were not divided into terminal and non-terminal symbols but were subdivided into five categories.
- Finally, when defining CFG grammar externally, they were divided again into terminal and non-terminal symbols, which was indeed poorly handled; it should have been divided well in the tokenizer.
Some functions within the Parser are public.
- For the convenience of unit testing, but not neat enough.
if-else handling is not done well, using special handling for dangling else.
- It can actually be resolved using grammar definitions, see page 134 of the Dragon Book.
No true panic mode is implemented; it simply discards tokens until the required token appears.
Efficiency issues in implementation were not considered.
- For example, the operations to determine when no new items or item sets are added during the computation of FOLLOW sets and LR(1) canonical item set families are very simple and crude.
The implementation of SDD and three-address code is not standardized.
High coupling between constants.py and the external create_cfg.
The ItemSet does not fully consider the issue of hash computation after modification.
The implementation used for deduplication in computing closure is poor.
- For example, using cur_items for deduplication instead of f_list.
- f_list mixes lookahead symbols and the right side of productions, posing potential bugs.

References#

[1] Compilers: Principles, Techniques and Tools, Second Edition, by Alfred V. A. et al.; translated by Zhao Jianhua et al. — Beijing: Mechanical Industry Press, 2009.1
[2] Elements of Programming, by Stepanov, A. and McJones, P.; translated by Qiu Zongyan. — Beijing: Mechanical Industry Press, 2011.12