Preface#
The steps for building a lexer tokenizer will not be elaborated here; you can refer to the tokenizer in the repo or write another type yourself, ensuring that the tokens returned by the tokenizer can connect with the tokens in the parser (which are instances of the Symbol
class).
This project is implemented based on the Dragon Book[1]. A significant portion of the details in the Parser can be found in the corresponding sections of the Dragon Book, which can serve as examples to aid in understanding the book.
This project uses the Python language. Why? The flexibility of Python allows one to focus maximally on the implementation of program logic rather than the minutiae of syntax, making it very suitable for practical ideas, and it can leverage type annotations for static checking. If performance becomes a requirement later, a C++ version of the program can be built based on this.
Finally, thanks to the course instructor for presenting such a challenging experimental problem; otherwise, I wouldn't have thought of doing something like this, nor would I have immersed myself in it for about a week.
Route#
Preparation#
The development environment is Windows 11 + Jetbrains Pycharm + Python 3.9
- The newer Python features used include:
- Since Python 3.6, the built-in type
dict
has become ordered, which is used to clearly print item set families. - The usage of
from __future__ import annotations
supported since Python 3.7 is used for type annotations, relaxing the order of class definitions, allowing classes defined later to be used as type annotations earlier.
- Since Python 3.6, the built-in type
- Programming habits:
- Try to place constants in a unified file, such as
constants.py
, to avoid magic numbers and magic strings. - Use type annotations extensively to clarify variable types and facilitate static error checking.
- Define a
debug_mode
switch in classes to print debugging information. - Perform unit testing.
- Refer to the unit tests in the code in
tester.py
.
- Refer to the unit tests in the code in
- Try to place constants in a unified file, such as
- How to debug from the external main program main.py?
- Implement a
create_cfg
method in the external main program to define and modify the context-free grammar. - Prepare the test program text, such as test.c.
- Example CFG grammar can refer to
c_generators
in test.c.
- Example CFG grammar can refer to
- Define a Tokenizer instance to read all tokens.
- Define a Parser instance, enable debugging mode, and initialize the structure.
- You can use a Tester instance to encapsulate the testing process.
- Observe and analyze the output, continuously adjusting the implementation.
- For convenience, you can redirect the output to a file.
- Implement a
Modeling Symbols, Productions, Items, Item Sets, etc.#
Modeling of symbols, productions, items, item sets, etc., is located in the grammar.py
file.
- Grammar Symbols ->
Symbol
class- Member variables
sType
: type of the token- Divided into terminal and non-terminal symbols.
name
: type of the token- Divided into numeric constants, identifiers, etc.
- For example, the
name
corresponding to an identifier isconstants
.
content
: string representation of the token- For example, in
int a;
,a
is of identifier type, with the propertyname
as "constants" and the propertycontent
as "a".
- For example, in
- Member functions
__str__
: converts to string representation for easy printing.__hash__
and__eq__
: facilitate the use of set types.fromStr
: converts fromstr
type toSymbol
type, requiring external specification of the terminal symbol set.
- How to represent an empty symbol?
- Define a constant in
constants.py
, usingε
to represent it. - It can be defined as either
Symbol
type orstr
.
- Define a constant in
- Member variables
- Productions ->
Generator
class- Member variables
start
: non-terminal symbol on the left side of the production.end
: string of grammar symbols on the right side of the production.
- Member functions
__str__
: converts to string representation for easy printing and debugging.
- What is
ComposedGenerator
for?- When manually defining CFG grammar, writing strings is much easier than defining
Symbol
, soComposedGenerator
provides a layer of conversion. - Additionally, the same left side of a production may correspond to multiple right sides, such as
A -> a | b
, reflected in the memberend
ofComposedGenerator
being annotated aslist[list[str]]
, which is "Composed".- Example: A -> a | b
- Defined as
ComposedGenerator
is{start: 'A', end: [['a'], ['b']]}
- Converting to
Generator
yields two instances ofGenerator
type:{start: 'A', end: ['a']}
{start: 'A', end: ['b']}
- Defined as
- Example: A -> a | b
- When manually defining CFG grammar, writing strings is much easier than defining
- Member variables
- LR(1) Item ->
Item
class- Member variables
start
: non-terminal symbol on the left side of the production.end
: string of grammar symbols on the right side of the production.point
: the dot within the production body, indicating the position read.lf_symbols
: lookahead symbols, LR(1) only looks at one.
- Member functions
__hash__
and__eq__
: distinguish instances, facilitating the use of set types.__str__
: converts to string representation for easy printing.- Special handling for the case where the production is empty.
- Member variables
- LR(1) Item Set ->
ItemSet
class- Member variables
_items
: all items.
- Member functions
add
: adds an item to the item set, automatically ignoring duplicates.union
: merges with anotherItemSet
type.toList
: converts to a list.- Facilitates subsequent indexing of LR automaton states.
__hash__
and__eq__
: distinguish instances, facilitating the use of set types.__iter__
: returns an internal iterator forfor
traversal.__len__
: returns the size.- Needs to inherit
Sized
.
- Needs to inherit
- Why model item sets? Can't we just use
set(Item)
?- Good question; I thought the same at first.
- The problem mainly arises when computing the LR(1) canonical item set family.
- It is necessary to determine when no new item sets are added; the first reaction is to wrap
set(Item)
in another layer,set(set(Item))
. - However,
set(Item)
does not have a default definition of partial order and equivalence; it is a composite type.- "Composite type" can refer to Chapter 12 of the Elements of Programming[2].
- Therefore, it is necessary to manually define partial order and equivalence for item sets.
- It is necessary to determine when no new item sets are added; the first reaction is to wrap
- Additionally, due to debugging and other requirements, modeling item sets is necessary.
- Member variables
- Context-Free Grammar (CFG) ->
GrammarCFG
classstart_mark
: literal of the start symbol.end_mark
: terminal symbol.- Generally the bottom of the stack '$'.
end_symbols
: set of terminal symbols.generators
: set of productions.- From which the set of non-terminal symbols can be derived.
Implementing FIRST and FOLLOW Sets#
Starting from this step, the Parser is implemented, with corresponding reference implementations in parser.py
, and detailed comments in the code refer to the relevant pages in the Dragon Book.
- Relevant functions in the reference code:
get_first
get_chain_first
(to compute first for a string of grammar symbols)get_follow
_get_follow_once
_get_follow_all
- Relevant member variables in the reference code:
self._map_follow
self._map_first
- Why start with computing FIRST and FOLLOW sets?
- Computing FIRST and FOLLOW sets is the foundation for subsequent algorithms.
- Moreover, you have already modeled the grammar; you just need to implement it according to the process in the Dragon Book, involving only grammar symbols and productions.
- Starting with simple operations and continuously debugging and iterating can gradually lead to a programming state.
- How should it be computed?
- There are no pitfalls in computing FIRST sets; the function call process is a tree structure.
- For example: for
A -> B
andB -> b
- To compute
first(A)
, call to computefirst(B)
. - Compute
first(B)
to get{b}
.
- To compute
- For example: for
- However, computing FOLLOW sets has circular function calls.
- For example: for
A -> aB
andB -> bA | b
- According to the third rule for computing FOLLOW sets.
- Compute
follow(A)
, and find thatfollow(A)
is all infollow(B)
. - Compute
follow(B)
, and find thatfollow(B)
is all infollow(A)
.
- The solution in the reference implementation is quite straightforward.
- Pre-calculate the FOLLOW sets for all non-terminal symbols and store them in the global variable
self._map_follow
. - Continuously scan all non-terminal symbols' FOLLOW sets until no new terminal symbols are added.
- For example, check by unioning the FOLLOW sets of two non-terminal symbols and see if
len
changes, etc.
- For example, check by unioning the FOLLOW sets of two non-terminal symbols and see if
- Pre-calculate the FOLLOW sets for all non-terminal symbols and store them in the global variable
- You can try to find a better implementation.
- For example: for
- There are no pitfalls in computing FIRST sets; the function call process is a tree structure.
Implementing CLOSURE and GOTO#
With the tools for computing FIRST and FOLLOW sets, the next step can focus on the algorithms related to LR(1) items. The corresponding reference implementation is in parser.py
, with detailed comments in the code referring to the relevant pages in the Dragon Book.
- Relevant functions in the reference code:
get_closure
get_goto
- Parts not mentioned in the book?
- According to the dependency order, implement closure first, then implement goto.
- Closure sets:
- Need to pay attention to deduplication.
- Store newly added items separately during one traversal, and only traverse newly added items next time.
- Ensure that the last piece of newly added items is added at the end to prevent iterator invalidation.
- My implementation in the reference code is quite poor; I now realize it only happened to work by chance, possibly due to being intoxicated at the time, so I recommend not using it as a reference.
- The goto set has no pitfalls.
- The next symbol is X; just move the dot and add it.
- Finally compute the closure to return.
Implementing LR(1) Item Sets#
Starting from the closure of the first production of the augmented grammar, with goto established, LR(1) item sets can be computed. Continuously compute goto until no new item sets are added. The corresponding reference implementation is in parser.py
, with detailed comments in the code referring to the relevant pages in the Dragon Book.
- You can use a method similar to computing FOLLOW sets to determine "no new item sets are added."
- Relevant functions in the reference code:
get_lr1_items
- After successfully completing this step, by printing the process of computing goto and viewing each item set as a state, you will find that you have effectively completed the LR(1) item set automaton; at this point, you are already halfway successful.
Implementing the LR(1) Parsing Table#
-
Relevant functions in the reference code:
get_lr
-
Requirements:
- Model an action to fill in the parsing table.
- Members must include:
- Type of action to be executed.
- Information necessary for executing the action.
- For example:
- Which production to reduce.
- Which state to move to next.
- For example:
- Members must include:
- Model an LR(1) parsing table to fill in one by one.
- Corresponds to
LRSheet
. - One dimension is the index of the state, and the other dimension is the symbol.
- Corresponds to
- The canonical LR(1) item set family serves as all states.
- Model an action to fill in the parsing table.
-
You can fill in actions one by one according to the rules as in the reference code, or you can combine them to fill in actions.
-
What if a cell is filled in twice during the filling process?
- There is an error in the implementation.
- The grammar is not an LR(1) grammar.
-
Other:
- In the reference code, the reduction of empty productions is implemented by trying to place an empty symbol on the stack top after an error occurs.
Implementing the LR(1) Configuration#
- Model the configuration.
- Assume all input is known.
- This does not strictly follow the definition on page 158[1].
- In addition to the input in the stack and the remaining input, there is also the index of the state in the stack for easy debugging and observation of the bottom-up processing.
- In the reference code, the execution of one action step is also placed in the configuration for easy state printing.
- At this point, it is basically complete.
Adding Simple SDD Syntax-Directed Analysis#
The SDD here is limited to S-attribute SDD, defined in a recursive subroutine manner within the create_cfg()
body, and is triggered only when the action is a production reduction operation. You can try modifying it to trigger during the middle of productions or when the action is a shift operation.
- You can refer to the following steps to add SDD:
- Extend modeling such as Symbol to carry various semantic attributes.
- Implement simple type declarations.
- Implement simple expression evaluations.
- Implement simple type checking rules.
- Implement control flow SDD.
- Note:
- Python performs shallow copies of objects, so a deep copy is needed before executing semantic actions in
forward
to prevent overwriting otherSymbol
instances' SDD.
- Python performs shallow copies of objects, so a deep copy is needed before executing semantic actions in
- How to extend?
- Agree on the format of recursive subroutines, for example:
- The parameter list is
(input_symbols: list[grammar.Symbol], reduction_symbol: grammar.Symbol, st: parser.SymbolTable)
. - Define the names of attributes in
constants.py
.- This reflects the benefits of unified constant definitions.
- No need to repeatedly type literals, reducing the chance of typos and unexpected bugs.
- The parameter list is
- Bind a function as a recursive subroutine for each defined production.
- Also add this parameter in the
Item
class. - When creating new
Item
instances in various methods of the Parser class, pass the bound function along. - Execute semantic actions during the reduction action in
forward
. - For specifics, see the Git differences in the reference code.
- Agree on the format of recursive subroutines, for example:
Existing Issues#
- Initially, the modeling did not fully consider the equivalence of
Symbol
, etc.- For example:
a
andb
are bothidentifier
in productions, so comparing them yields equal results. - Ultimately, only the representation in the production is compared, not the string content.
- For example:
- Why have both
sType
andname
separately? Wouldn't it be better to merge them since both represent token types?- The initial design separated these two mainly due to interface integration issues.
- The types in the lexer tokenizer part were not divided into terminal and non-terminal symbols but were subdivided into five categories.
- Finally, when defining CFG grammar externally, they were divided again into terminal and non-terminal symbols, which was indeed poorly handled; it should have been divided well in the tokenizer.
- Some functions within the Parser are public.
- For the convenience of unit testing, but not neat enough.
if-else
handling is not done well, using special handling for dangling else.- It can actually be resolved using grammar definitions, see page 134 of the Dragon Book.
- No true panic mode is implemented; it simply discards tokens until the required token appears.
- Efficiency issues in implementation were not considered.
- For example, the operations to determine when no new items or item sets are added during the computation of FOLLOW sets and LR(1) canonical item set families are very simple and crude.
- The implementation of SDD and three-address code is not standardized.
- High coupling between
constants.py
and the externalcreate_cfg
. - The
ItemSet
does not fully consider the issue of hash computation after modification. - The implementation used for deduplication in computing closure is poor.
- For example, using
cur_items
for deduplication instead off_list
. f_list
mixes lookahead symbols and the right side of productions, posing potential bugs.
- For example, using
References#
- [1] Compilers: Principles, Techniques and Tools, Second Edition, by Alfred V. A. et al.; translated by Zhao Jianhua et al. — Beijing: Mechanical Industry Press, 2009.1
- [2] Elements of Programming, by Stepanov, A. and McJones, P.; translated by Qiu Zongyan. — Beijing: Mechanical Industry Press, 2011.12