Preface#
The steps for building a lexer tokenizer will not be elaborated here; you can refer to the tokenizer in the repo or write another type yourself, ensuring that the tokens returned by the tokenizer can connect with the tokens in the parser (which are instances of the Symbol class).
This project is implemented based on the Dragon Book[1]. A significant portion of the details in the Parser can be found in the corresponding sections of the Dragon Book, which can serve as examples to aid in understanding the book.
This project uses the Python language. Why? The flexibility of Python allows one to focus maximally on the implementation of program logic rather than the minutiae of syntax, making it very suitable for practical ideas, and it can leverage type annotations for static checking. If performance becomes a requirement later, a C++ version of the program can be built based on this.
Finally, thanks to the course instructor for presenting such a challenging experimental problem; otherwise, I wouldn't have thought of doing something like this, nor would I have immersed myself in it for about a week.
Route#
Preparation#
The development environment is Windows 11 + Jetbrains Pycharm + Python 3.9
- The newer Python features used include:
- Since Python 3.6, the built-in type
dicthas become ordered, which is used to clearly print item set families. - The usage of
from __future__ import annotationssupported since Python 3.7 is used for type annotations, relaxing the order of class definitions, allowing classes defined later to be used as type annotations earlier.
- Since Python 3.6, the built-in type
- Programming habits:
- Try to place constants in a unified file, such as
constants.py, to avoid magic numbers and magic strings. - Use type annotations extensively to clarify variable types and facilitate static error checking.
- Define a
debug_modeswitch in classes to print debugging information. - Perform unit testing.
- Refer to the unit tests in the code in
tester.py.
- Refer to the unit tests in the code in
- Try to place constants in a unified file, such as
- How to debug from the external main program main.py?
- Implement a
create_cfgmethod in the external main program to define and modify the context-free grammar. - Prepare the test program text, such as test.c.
- Example CFG grammar can refer to
c_generatorsin test.c.
- Example CFG grammar can refer to
- Define a Tokenizer instance to read all tokens.
- Define a Parser instance, enable debugging mode, and initialize the structure.
- You can use a Tester instance to encapsulate the testing process.
- Observe and analyze the output, continuously adjusting the implementation.
- For convenience, you can redirect the output to a file.
- Implement a
Modeling Symbols, Productions, Items, Item Sets, etc.#
Modeling of symbols, productions, items, item sets, etc., is located in the grammar.py file.
- Grammar Symbols ->
Symbolclass- Member variables
sType: type of the token- Divided into terminal and non-terminal symbols.
name: type of the token- Divided into numeric constants, identifiers, etc.
- For example, the
namecorresponding to an identifier isconstants.
content: string representation of the token- For example, in
int a;,ais of identifier type, with the propertynameas "constants" and the propertycontentas "a".
- For example, in
- Member functions
__str__: converts to string representation for easy printing.__hash__and__eq__: facilitate the use of set types.fromStr: converts fromstrtype toSymboltype, requiring external specification of the terminal symbol set.
- How to represent an empty symbol?
- Define a constant in
constants.py, usingεto represent it. - It can be defined as either
Symboltype orstr.
- Define a constant in
- Member variables
- Productions ->
Generatorclass- Member variables
start: non-terminal symbol on the left side of the production.end: string of grammar symbols on the right side of the production.
- Member functions
__str__: converts to string representation for easy printing and debugging.
- What is
ComposedGeneratorfor?- When manually defining CFG grammar, writing strings is much easier than defining
Symbol, soComposedGeneratorprovides a layer of conversion. - Additionally, the same left side of a production may correspond to multiple right sides, such as
A -> a | b, reflected in the memberendofComposedGeneratorbeing annotated aslist[list[str]], which is "Composed".- Example: A -> a | b
- Defined as
ComposedGeneratoris{start: 'A', end: [['a'], ['b']]} - Converting to
Generatoryields two instances ofGeneratortype:{start: 'A', end: ['a']}{start: 'A', end: ['b']}
- Defined as
- Example: A -> a | b
- When manually defining CFG grammar, writing strings is much easier than defining
- Member variables
- LR(1) Item ->
Itemclass- Member variables
start: non-terminal symbol on the left side of the production.end: string of grammar symbols on the right side of the production.point: the dot within the production body, indicating the position read.lf_symbols: lookahead symbols, LR(1) only looks at one.
- Member functions
__hash__and__eq__: distinguish instances, facilitating the use of set types.__str__: converts to string representation for easy printing.- Special handling for the case where the production is empty.
- Member variables
- LR(1) Item Set ->
ItemSetclass- Member variables
_items: all items.
- Member functions
add: adds an item to the item set, automatically ignoring duplicates.union: merges with anotherItemSettype.toList: converts to a list.- Facilitates subsequent indexing of LR automaton states.
__hash__and__eq__: distinguish instances, facilitating the use of set types.__iter__: returns an internal iterator forfortraversal.__len__: returns the size.- Needs to inherit
Sized.
- Needs to inherit
- Why model item sets? Can't we just use
set(Item)?- Good question; I thought the same at first.
- The problem mainly arises when computing the LR(1) canonical item set family.
- It is necessary to determine when no new item sets are added; the first reaction is to wrap
set(Item)in another layer,set(set(Item)). - However,
set(Item)does not have a default definition of partial order and equivalence; it is a composite type.- "Composite type" can refer to Chapter 12 of the Elements of Programming[2].
- Therefore, it is necessary to manually define partial order and equivalence for item sets.
- It is necessary to determine when no new item sets are added; the first reaction is to wrap
- Additionally, due to debugging and other requirements, modeling item sets is necessary.
- Member variables
- Context-Free Grammar (CFG) ->
GrammarCFGclassstart_mark: literal of the start symbol.end_mark: terminal symbol.- Generally the bottom of the stack '$'.
end_symbols: set of terminal symbols.generators: set of productions.- From which the set of non-terminal symbols can be derived.
Implementing FIRST and FOLLOW Sets#
Starting from this step, the Parser is implemented, with corresponding reference implementations in parser.py, and detailed comments in the code refer to the relevant pages in the Dragon Book.
- Relevant functions in the reference code:
get_firstget_chain_first(to compute first for a string of grammar symbols)get_follow_get_follow_once_get_follow_all
- Relevant member variables in the reference code:
self._map_followself._map_first
- Why start with computing FIRST and FOLLOW sets?
- Computing FIRST and FOLLOW sets is the foundation for subsequent algorithms.
- Moreover, you have already modeled the grammar; you just need to implement it according to the process in the Dragon Book, involving only grammar symbols and productions.
- Starting with simple operations and continuously debugging and iterating can gradually lead to a programming state.
- How should it be computed?
- There are no pitfalls in computing FIRST sets; the function call process is a tree structure.
- For example: for
A -> BandB -> b- To compute
first(A), call to computefirst(B). - Compute
first(B)to get{b}.
- To compute
- For example: for
- However, computing FOLLOW sets has circular function calls.
- For example: for
A -> aBandB -> bA | b- According to the third rule for computing FOLLOW sets.
- Compute
follow(A), and find thatfollow(A)is all infollow(B). - Compute
follow(B), and find thatfollow(B)is all infollow(A).
- The solution in the reference implementation is quite straightforward.
- Pre-calculate the FOLLOW sets for all non-terminal symbols and store them in the global variable
self._map_follow. - Continuously scan all non-terminal symbols' FOLLOW sets until no new terminal symbols are added.
- For example, check by unioning the FOLLOW sets of two non-terminal symbols and see if
lenchanges, etc.
- For example, check by unioning the FOLLOW sets of two non-terminal symbols and see if
- Pre-calculate the FOLLOW sets for all non-terminal symbols and store them in the global variable
- You can try to find a better implementation.
- For example: for
- There are no pitfalls in computing FIRST sets; the function call process is a tree structure.
Implementing CLOSURE and GOTO#
With the tools for computing FIRST and FOLLOW sets, the next step can focus on the algorithms related to LR(1) items. The corresponding reference implementation is in parser.py, with detailed comments in the code referring to the relevant pages in the Dragon Book.
- Relevant functions in the reference code:
get_closureget_goto
- Parts not mentioned in the book?
- According to the dependency order, implement closure first, then implement goto.
- Closure sets:
- Need to pay attention to deduplication.
- Store newly added items separately during one traversal, and only traverse newly added items next time.
- Ensure that the last piece of newly added items is added at the end to prevent iterator invalidation.
- My implementation in the reference code is quite poor; I now realize it only happened to work by chance, possibly due to being intoxicated at the time, so I recommend not using it as a reference.
- The goto set has no pitfalls.
- The next symbol is X; just move the dot and add it.
- Finally compute the closure to return.
Implementing LR(1) Item Sets#
Starting from the closure of the first production of the augmented grammar, with goto established, LR(1) item sets can be computed. Continuously compute goto until no new item sets are added. The corresponding reference implementation is in parser.py, with detailed comments in the code referring to the relevant pages in the Dragon Book.
- You can use a method similar to computing FOLLOW sets to determine "no new item sets are added."
- Relevant functions in the reference code:
get_lr1_items
- After successfully completing this step, by printing the process of computing goto and viewing each item set as a state, you will find that you have effectively completed the LR(1) item set automaton; at this point, you are already halfway successful.
Implementing the LR(1) Parsing Table#
-
Relevant functions in the reference code:
get_lr
-
Requirements:
- Model an action to fill in the parsing table.
- Members must include:
- Type of action to be executed.
- Information necessary for executing the action.
- For example:
- Which production to reduce.
- Which state to move to next.
- For example:
- Members must include:
- Model an LR(1) parsing table to fill in one by one.
- Corresponds to
LRSheet. - One dimension is the index of the state, and the other dimension is the symbol.
- Corresponds to
- The canonical LR(1) item set family serves as all states.
- Model an action to fill in the parsing table.
-
You can fill in actions one by one according to the rules as in the reference code, or you can combine them to fill in actions.
-
What if a cell is filled in twice during the filling process?
- There is an error in the implementation.
- The grammar is not an LR(1) grammar.
-
Other:
- In the reference code, the reduction of empty productions is implemented by trying to place an empty symbol on the stack top after an error occurs.
Implementing the LR(1) Configuration#
- Model the configuration.
- Assume all input is known.
- This does not strictly follow the definition on page 158[1].
- In addition to the input in the stack and the remaining input, there is also the index of the state in the stack for easy debugging and observation of the bottom-up processing.
- In the reference code, the execution of one action step is also placed in the configuration for easy state printing.
- At this point, it is basically complete.
Adding Simple SDD Syntax-Directed Analysis#
The SDD here is limited to S-attribute SDD, defined in a recursive subroutine manner within the create_cfg() body, and is triggered only when the action is a production reduction operation. You can try modifying it to trigger during the middle of productions or when the action is a shift operation.
- You can refer to the following steps to add SDD:
- Extend modeling such as Symbol to carry various semantic attributes.
- Implement simple type declarations.
- Implement simple expression evaluations.
- Implement simple type checking rules.
- Implement control flow SDD.
- Note:
- Python performs shallow copies of objects, so a deep copy is needed before executing semantic actions in
forwardto prevent overwriting otherSymbolinstances' SDD.
- Python performs shallow copies of objects, so a deep copy is needed before executing semantic actions in
- How to extend?
- Agree on the format of recursive subroutines, for example:
- The parameter list is
(input_symbols: list[grammar.Symbol], reduction_symbol: grammar.Symbol, st: parser.SymbolTable). - Define the names of attributes in
constants.py.- This reflects the benefits of unified constant definitions.
- No need to repeatedly type literals, reducing the chance of typos and unexpected bugs.
- The parameter list is
- Bind a function as a recursive subroutine for each defined production.
- Also add this parameter in the
Itemclass. - When creating new
Iteminstances in various methods of the Parser class, pass the bound function along. - Execute semantic actions during the reduction action in
forward. - For specifics, see the Git differences in the reference code.
- Agree on the format of recursive subroutines, for example:
Existing Issues#
- Initially, the modeling did not fully consider the equivalence of
Symbol, etc.- For example:
aandbare bothidentifierin productions, so comparing them yields equal results. - Ultimately, only the representation in the production is compared, not the string content.
- For example:
- Why have both
sTypeandnameseparately? Wouldn't it be better to merge them since both represent token types?- The initial design separated these two mainly due to interface integration issues.
- The types in the lexer tokenizer part were not divided into terminal and non-terminal symbols but were subdivided into five categories.
- Finally, when defining CFG grammar externally, they were divided again into terminal and non-terminal symbols, which was indeed poorly handled; it should have been divided well in the tokenizer.
- Some functions within the Parser are public.
- For the convenience of unit testing, but not neat enough.
if-elsehandling is not done well, using special handling for dangling else.- It can actually be resolved using grammar definitions, see page 134 of the Dragon Book.
- No true panic mode is implemented; it simply discards tokens until the required token appears.
- Efficiency issues in implementation were not considered.
- For example, the operations to determine when no new items or item sets are added during the computation of FOLLOW sets and LR(1) canonical item set families are very simple and crude.
- The implementation of SDD and three-address code is not standardized.
- High coupling between
constants.pyand the externalcreate_cfg. - The
ItemSetdoes not fully consider the issue of hash computation after modification. - The implementation used for deduplication in computing closure is poor.
- For example, using
cur_itemsfor deduplication instead off_list. f_listmixes lookahead symbols and the right side of productions, posing potential bugs.
- For example, using
References#
- [1] Compilers: Principles, Techniques and Tools, Second Edition, by Alfred V. A. et al.; translated by Zhao Jianhua et al. — Beijing: Mechanical Industry Press, 2009.1
- [2] Elements of Programming, by Stepanov, A. and McJones, P.; translated by Qiu Zongyan. — Beijing: Mechanical Industry Press, 2011.12