tags : Programming Languages, Context Free Grammar (CFG), Parsing Expression Grammar (PEG)

Types

Top-Down Parser

  • From the root and move down to the tree’s leaf.
  • Sees a source code, which contains functions, which contains expression.
  • Eg. Recursive Descent Parser(handwritten) and LL(1) parsers, example of Recursive Descent parser for Ruby

Bottom-Up Parser

  • From leaves to the tree’s root.
  • Sees expression, which belongs to functions, which results the full source.
  • Eg. LR parser, SLR parser, CLR parser, etc.
  • Can be implemented using backtracking, but mostly uses some shift-reduce sort of algorithm

High level steps/concepts

Inputs and Outputs

Input

  • Stream of tokens from a Lexer

Output

  • parse-tree?: Not really, you can evaluates arithmetic expressions in-line with the parse. Subexpression is immediately evaluated, no tree is built.
  • parse tree traversal?: Probably. It gives us a way we’d traverse the tree.
    • Polish and Reverse Polish notation fully encode a tree structure and the steps you would take to traverse it.
    • While we can get a tree, we don’t really need to do that always
    • Because it gives a way to traverse, we can consume the output of the parser while parsing happens without generating the whole tree!
  • But if you think about it, the parser just adds a bunch of structure that is defined on top of the sequence of input tokens. i.e The leaf nodes are always exactly the input tokens themselves, in exactly the order of the input.
  • So in a very reduced manner, we can say parsers output a stream of tokens with rules added to appropriate places giving us pre/post-order traversal.

LL vs LR

  • LL parser outputs a pre-order traversal of the parse tree
    • Produces a left most derivation
    • Topdown
    • Predictive parsers
  • LR parser outputs a post-order traversal of the parse tree
    • Produces a reversed rightmost most derivation
    • Bottomup
    • Shift reducers parsers

In Practice

  • Not all parsers are purely based on formalism.
  • Usually hand-written parsers are not based on any formalism at all.
    • Language specifications are often defined in terms of a formalism like BNF
    • But it’s almost never the case that real parsers can be generated directly from this formalism.
    • Eg. GCC moved away from their Bison-based parser to a handwritten recursive descent parser. So did Go. While some notable language implementations do use Bison (like Ruby and PHP), many choose not to

Parsing techniques

  • See Parsers
  • Different parsing techniques can deal with different subsets of CFG (Some of these parsing techniques are beyond CFG too)
  • Each of these techniques have their own set of algorithms that implement them. Eg. The Operator Precedence Parser uses the Shunting yard algorithm
    • “LL parser” and “LR parser” are not actually specific algorithms at all, but rather generic terms referring to families of algorithms.

Classification (non-exhaustive)

  • TD: Top Down, BU: Bottom Up
  • A: Active, P: Passive
  • OP: Operator Precedence
  • PS: Push, PU: Pull
  • TD: Table Driven
  • INC: Incremental
  • PR: Predictive
TechniqueTypeSubsetUse/Example
Chart Parser{TD/BU}+{A/P}+INCAll of CFG
EarleyChart Parser (TD)All of CFG
CYK AlgorithmChart Parser (BU)All of CFG
GLRShift ReduceAll of CFGTree-sitter
Shift ReduceBU+TD
LR(k)Shift ReduceLR(k)
Simple PrecedenceShift ReduceSimple Precedence Grammar LR(1)
Operator PrecedenceShift Reduce+OPOperator Precedence GrammarCalculators
PackratShift ReducePEG
Recursive AscentShift Reduce (Recursive LALR)
LL(k)TD+PRLL(k)
Recursive DescentTD+PRNot limited to CFG, TDPL
PrattTD+OP

Earley Parser

Shift Reduce

  • table driven, bottom up parsing
  • LR parsing and its variations are shift-reduce
  • Simple Precedence Parsers are also shift-Reduce
  • Operator Precedence Parsers are also shift-Reduce
    • Can produce either a postfix notation string/RPN or an abstract syntax tree (AST)
  • Steps
    • shift: add this token to the stack for later reduction
    • reduce: pop tokens from the stack and form a syntactic construct
    • end, error: no known rule applies
    • conflict: does not know whether to shift or reduce

Recursive Descent

  • Real language grammars are hardly neatly classified into LL, LR et al. Hence the vast majority of languages today use handwritten adhoc Recursive Descent parsers.
  • Can be directly encoded in the host language
  • It can parse more than CFG. TDPL grammar can be viewed as an extremely minimalistic formal representation of a recursive descent parser
  • The Recursion in “recursive descent” happens when the parser “descends” into a nested rule.

Packrat

  • A form of parser similar to a recursive descent parser in construction
  • Except during the parsing process it memoizes the intermediate results of all invocations of the mutually recursive parsing functions
  • Ensuring that each parsing function is only invoked at most once at a given input position
  • Takes Parsing Expression Grammar (PEG) instead of LL

Recursive Ascent

LL(k)

  • (L)eft-to-right, (L)eftmost derivation, (k) lookahead
    • Parses the input from left to right
    • Performing leftmost derivation of the sentence
    • With k token of lookahead
  • Usually LL refers to LL(1)
  • Converts input grammar into parser tables OR can be parsed recursively

LL(1)

  • The traditional approach to constructing or generating an LL(1) parser is to produce a parse table which encodes the possible transitions between all possible states of the parser.
  • Such CFG can be converted into Pushdown Automata (PDA)

Programming Languages and LL(k)

  • Most Programming Languages have LL(1) grammars, they are never ambiguous. (They need to be unambiguous for what they do)
  • Not all programming languages can be parsed with LL(1). Some languages are designed for easy parsing, some grow adhoc.
    • C : You need an LR(1) parser for it.
    • C++ : You need a GLR/Earley/etc. parser to parse it (although gcc/clang use a recursive descent parser with some cool tricks)

Conflicts

  • first/first set
  • first/follow set

LR(k)

  • (L)eft-to-right, (R)rightmost derivation, (k) lookahead
  • Donald Knuth invented the LR parser (?)
  • All LR(k) grammars can be mechanically transformed into LR(1) grammars, with the consequence that there are only two categories of LR languages: LR(0) and LR(1)
  • Converts input grammar into parser tables

Variations

VariantsDescription
LALRLook Ahead LR, Simpler than LR(1)
SLRSimple LR, Simpler than LR(1)
CLRCanonical LR, canonical name for the LR(1) parser
GLRGeneralized LR, Can handle ambiguous grammar

Pratt Parsing (TDOP)

Concrete Syntax Tree (CST) / Parse tree

GrammarParse Tree
Transformational Generative GrammarPhase Markers
Phrase structure grammarConstituency-based parse trees
Dependency grammarDependency-based parse trees
  • See Parse Tree
  • Reflect the syntax of the input language

Abstract Syntax Tree (AST)

  • The syntax is “abstract” in the sense that it does not represent every detail appearing in the real syntax, but rather just the structural or content-related details.
  • Eg. For instance, grouping parentheses are implicit in the tree structure, so these do not have to be represented as separate nodes.
  • An AST usually contains extra information about the program, due to the consecutive stages of analysis by the compiler.
    • Eg. it may store the position of each element in the source code, allowing the compiler to print useful error messages.
  • Since the language parsed by the parser is usually more than context-free and falls in the realm of context-sensitive
    • AST helps in terms of it allows you to add that context. Eg. allows new types to be declared, duck typing, operator overloading etc.
  • Semantic Analysis
    • AST is used intensively during semantic analysis, where the compiler checks for correct usage of the elements of the program and the language.
    • The compiler also generates symbol tables based on the AST during semantic analysis.

Parse table

  • Encodes the possible transitions between all possible states of the parser.

Online and Offline

  • online: It consumes the input tokens in a specific order and performs as much work as possible after consuming each token
    • Can start producing output while they are still consuming input.
    • Both LL and LR are online parsers because of lookahead

Predictive

  • Predictive parsing is parsers that do not require backtracking
  • Predictive parsing is possible only for the class of LL(k) grammars

Generators/Tools

Parser Generators

Parser Combinators

A combinator is, classically speaking, a function which takes functions and returns functions.

  • Only Accepts several parsers as input and returns a new parser as its output.
    • Here parser is a function accepting strings as input and returning some structure as output, typically a parse tree
  • Parser Combinators let you build parsers easily, just by specifying the grammar.
    • Sometimes they let you specify the grammar in a way it looks like normal code
    • Sometimes they might also allow to use some special grammar notation
  • Eg. Parsec

FAQ

LL vs LR

  • There is no LL(0) parser, LL’s need a lookahead.
  • LR parsers(LR(0)) can do without a lookahead because LR’s lookahead starts from the back and it gets to see all of the rule’s tokens.
  • Since LR lookahead starts from the end of a rule, a LR(1) parser has strictly more information available to it when making a decision than an LL(1) parser.
  • LR parsers can handle left recursion

Resources