Saved searches
Use saved searches to filter your results more quickly
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.
C compiler written in Python
pacocoursey/compiler
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Sign In Required
Please sign in to use Codespaces.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching Xcode
If nothing happens, download Xcode and try again.
Launching Visual Studio Code
Your codespace will open once ready.
There was a problem preparing your codespace, please try again.
Latest commit
Git stats
Files
Failed to load latest commit information.
README.md
Group Members: Charles Hermann, Izeah Olguin, Paco Coursey
$ git clone https://github.com/pacocoursey/compiler.git
Install the dependencies (requires pip3):
Our python compiler currently meets the standards expected of the third and final deliverable.
Our scanner reads in a file as a string of characters and produces a list of labeled tokens. These tokens are acquired by the parser and are used to produce a parse tree. The parse tree is used to construct a symbol table. Then the parse tree and symbol table are used to generate an intermediate representation of the program.
The parse tree is created using action and goto tables generated from our grammar rules. We use an LR(1) shift reduce parser. We then flatten out the recursive grammar rules of the parse tree using a recursive depth first traversal to remove redundant tree nodes.
The symbol table is created by scanning the parse tree in depth first order and creating new scopes for each function declaration. We error check for undefined identifiers and duplicate declarations here.
Our intermediate representation is also generated by a depth first traversal of the parse tree. We call an IR method for each node of the tree and save the result as a list of strings. We use three-address code in combination with temporary variables to represent the program.
The assembly code is generated by transforming each intermediate representation instruction into its relevant assembly instruction. We store all variables on the stack and only use registers for temporary storage or instructions requiring their use.
As a group, we have met numerous times to discuss our design, implementation and overall expectations of our compiler.
Our compiler uses Python 3.
This program returns the number of characters found in a given file. Run using:
$ python3 -m src.character_count FILENAME
Run the compiler without any flags to complete all stages of the compiler and generate a .s file (equivalent to gcc -O0 -S filename ):
$ python3 -m src.main FILENAME # compare with: $ gcc -O0 -S FILENAME
Logs additional output to a file in /logs .
Forcefully generate new action and goto tables from the specified grammar rules and save them in /tables .
Specify a grammar to use. Defaults to /grammars/main_grammar.txt . Run using:
$ python3 -m src.main -g GRAMMAR_FILENAME FILENAME # or $ python3 -m src.main --grammar GRAMMAR_FILENAME FILENAME
Tokenizes a C program and returns a list of the known tokens. Run using:
$ python3 -m src.main -s FILENAME # or $ python3 -m src.main --scan FILENAME
Converts a list of tokens generated by the scanner and constructs an abstract representation of the program using a pre-defined C grammar. Run using:
$ python3 -m src.main -p FILENAME # or $ python3 -m src.main --parse FILENAME
Generate a symbol table using the parse tree to keep a list of scopes and defined variables. Run using:
$ python3 src/main -t FILENAME # or $ python3 src/main --table FILENAME
Generate an Intermediate Representation. Run using:
$ python3 -m src.main -r FILENAME # or $ python3 -m src.main --ir FILENAME
Read in an IR instead of source file. Run using:
$ python3 -m src.main -i INPUT_IR # or $ python3 -m src.main --input INPUT_IR
Write out the IR to a file. Run using:
$ python3 -m src.main -r -o OUTPUT_FILENAME FILENAME # or $ python3 -m src.main --ir --output OUTPUT_FILENAME FILENAME
Generate assembly instructions from the IR. Run using:
$ python3 -m src.main -a FILENAME # or $ python3 -m src.main --asm FILENAME
Write out the assembly to a file. Run using:
$ python3 -m src.main -a -n OUTPUT_FILENAME FILENAME # or $ python3 -m src.main --asm --asmOutput OUTPUT_FILENAME FILENAME
Our scanner parses characters in a ‘chunk’, with a start an and end counter. We iterate through the chunk to match our tokens in the following order: Symbols, Operators, Keywords, Identifiers and Numbers.
We are not optimizing for speed, which is why we opted not to use regular expressions. This gives us more logical control over what tokens we recognize, and it is easier to handle edge cases like multi-line comments.
Our parser uses action and goto tables generated from the rules in grammars/main_grammar.txt . Because generating the tables takes so long, after the first generation they are saved in JSON format in the tables/ directory for future compiler executions. The parser outputs a parse tree consisting of instances of custom node classes defined in grammar.py . The parse tree nodes are highly abstracted and do not include unimportant tokens like brackets or parentheses.
After the initial creating of the parse tree, it is «flattened» by un-nesting recursive grammar nodes. This makes it easier to generate the symbol table and removes useless duplicate nodes from the tree.
Symbol Table Implementation
Our symbol table uses the parse tree to create a new scope for each function declaration. We save each variable declaration inside the appropriate scope, including a global scope. While generating the symbol table we also check for duplicate variable, function declaration and undefined identifiers.
The symbol table also keeps track of goto labels and checks for invalid code that calls goto labels that are never declared. This is done differently to undefined identifier and function checking, as goto labels can be declared anywhere in the program.
The intermediate representation is generated by visiting each node of the parse tree in depth first order and calling an IR method on each node. These IR methods are defined differently for each node in grammar.py . The IR methods currently return strings, and are saved in a simple list.
We use three-address code and intermediate variables to assist with our IR generation. The intermediate variables are uniquely generated so that we can avoid unclear assignments.
Our compiler can skip all of the above steps and start from an already generated IR file by using the -i or —input flags. You can dump the intermediate representation of a program to a file using the -o or —output flags.
The assembly code is generated by converting each line of the intermediate representation to equivalent assembly instructions. We keep all variables on the stack and assume all registers to be strictly temporary. There are no race conditions, however, as each IR instruction is parsed and converted as an atomic operation. Parameters to function calls are stored in the registers %r8d through %r15d . This means we can support function calls that have up to 8 parameters, but not more than that. Those registers are also temporary, and cannot be relied upon outside of the logic surrounding a callq statement.
At the start of every function, we grow the stack by 4 * number of local variables . This ensures there is no memory address overlap when calling other functions. Each function is declared in the ASM as the name given in the C code with an underscore prepended. The only exception is on Linux, where we leave the main function as main instead of _main to prevent errors.
As an example, the IR instruction i = 10 / 2 would be translated into the assembly instruction movl $5, -4(%rbp) (where the variable i is stored at -4(%rbp) ). As you can see in that example, we do a small optimization and pre-calculate any simple expressions in which both operands are numbers.
- Few files, things can be modified quickly
- Design and hierarchy is well established
- Good separation of concepts into modularized files
- Able to iterate quickly on all aspects of the compiler
- Does not support function calls with more than 8 parameters
- Cannot nest function calls i.e. sum(sum(2, 3), 5)
- Switch cases must be wrapped in curly braces <>
- Math expressions must be parenthesized to be evaluated correctly
- Grammar does not support chained variable declarations like int i, j
Required Features | Scanner | Parser | IR | ASM |
---|---|---|---|---|
Identifiers | ✔ | ✔ | ✔ | ✔ |
Variables | ✔ | ✔ | ✔ | ✔ |
Functions | ✔ | ✔ | ✔ | ✔ |
Keywords | ✔ | ✔ | ✔ | ✔ |
Arithmetic Expressions | ✔ | ✔ | ✔ | ✔ |
Assignment | ✔ | ✔ | ✔ | ✔ |
Boolean Expressions | ✔ | ✔ | ✔ | ✔ |
Goto Statements | ✔ | ✔ | ✔ | ✔ |
If/Else Control Flow | ✔ | ✔ | ✔ | ✔ |
Unary Operators | ✔ | ✔ | ✔ | ✔ |
Return Statements | ✔ | ✔ | ✔ | ✔ |
Break Statements | ✔ | ✔ | ✔ | ✔ |
While Loops | ✔ | ✔ | ✔ | ✔ |
Optional Features | Scanner | Parser | IR | ASM |
---|---|---|---|---|
Floats | ✔ | ✔ | ✔ | |
++, —, -=, +=, *=, /= | ✔ | ✔ | ✔ | ✔ |
For Loops | ✔ | ✔ | ✔ | |
&, |, ^ Operators | ✔ | ✔ | ✔ | ✔ |
>, ~ Operators | ✔ | ✔ | ✔ | ✔ |
Switch Statements | ✔ | ✔ | ✔ | ✔ |
Extra Features | Scanner | Parser | IR | ASM |
---|---|---|---|---|
Pointers | ||||
Arrays | ||||
Strings | ✔ | ✔ | ||
Include Statements | ✔ | ✔ | ||
Struct | ✔ | ✔ | ||
Enum | ✔ | ✔ | ||
Casting | ||||
Type Promotion | ||||
Type Specs |
Formatted using Black. Linted using PyLint.