tags : Programming Languages, Computation and Computer Theory, Automata Theory

History

We get 2 definitions of regular languages from these 2 events.

  • 1951, Kleene: Regular language is a language which is recognized by a finite automata.
  • 1956, Chomsky: Regular languages defined by languages generated by Type 3 / regular grammars

Theory

  • RE describes language by an algebra
  • They describe “exactly” the regular language
  • L(E) is the language, E is the regex.

Operations

Union

  • Eg. {01,111,10} \cup {00, 01} = {01,111,10,00}

Concatenation

  • Eg. {01,111,10}{00, 01} = {0100, 0101, 11100, 11101, 1000, 1001}

Kleene Star

  • A* : Kleene closure / Kleene star
  • L* = {\epsilon} \cup L \cup LL \cup LLL \cup ...
  • Eg. {0,10}* = {ε, 0, 10, 00, 010, 100, 1010,…}

Lexical Analysis

  • tokens: substrings that together represent a unit.

Basics

  • The first thing a compiler does is break a program into tokens
  • We can write regex for each different kind of token
  • Each regex has a specific action associated
    • Eg. Just print out, put it into a symbol table etc.

Issues

  1. Convert RE for each token to ε-NFA
    • Eg. identifier ε-NFA
    • Eg. Reserved word ε-NFA
  2. Combine all RE by w new start state w ε moves to start state of each ε-NFA
  3. Convert to DFA
  4. Set priority in DFA. Eg. DFA accepting if reserved word should have higher priority than DFA accepting identifier if

Implementations

Modern Implementations

  • Modern regex engines are augmented with features that allow the recognition of non-regular languages
  • Characters like, [ and - have special meanings, so you need to escape them with \
  • Some operators
    • Concatenation: [a1,a2,…an] is shorthand for a1+a2+…+an
    • Union: | operator.
    • One or more: + , E+ = EE* (E concatenated w E*)
    • Zero or one of: ?, E? = E + ε, [ab]? = a + b + ε

Flavors

Learning Resources

Tutorials

Tools

Practical Concepts

Greedy and Lazy

Greedy

let regexp = /".+"/g;
 
let str = 'a "witch" and her "broom" is one';
 
alert( str.match(regexp) ); // "witch" and her "broom"
  • What happens?

    • For every position in the string
      • Try to match the pattern at that position.
      • If there’s no match, go to the next position.
      • If we found the match for the current part of the pattern then we try to find a match for the next part of the pattern
        • Eg. in ".*" , " will be first part, .* will be second and so on
      • If we reach the end of the string(no more characters!) and we find no match, we backtrack.
      • We keep backtracking till we find a match for the entire regex pattern.
        • Eg. In 'a "witch" and her "broom" is one' , the engine first goes till end of the string because of .+ but then backtrack till it finds the ending " in end of broom"

Lazy

  • ?
    • Usually ? is a quantifier by itself (zero or one)
    • If added after another quantifier (or even itself): It switches the matching mode from greedy to lazy
    • Eg. .*?, .+? lazy search for .* and .+
  • What happens?

    • For every position in the string
      • Try to match the pattern at that position.
      • If there’s no match go to the next position.
      • If we found the match for the current part of the pattern then we try to find a match for the next part of the pattern
        • Eg. in ".*?" , " will be first part, .*? will be second and so on
      • DIFFERENCE: Now, because .*? is lazy, engine try to match the part after .*? which is " . If it doesn’t find, it’ll just match with .*
        • This is the nature of the lazy, it’ll try to end the match as soon as possible.
      • If we reach the end of the string(no more characters!) and we find no match, we backtrack.
      • We keep backtracking till we find a match for the entire regex pattern.
        • Eg. In 'a "witch" and her "broom" is one' , the engine first goes till end of the string because of .+ but then backtrack till it finds the ending " in end of broom"

Lookaround

  • Lookaround = Lookahead + Lookbehind
  • This allow you to match some something but also tell the engine to make sure that this and that should be before and after what I want to match, if they are there then only it’ll be a match

Look ahead

  • (?=) : Postitive
  • (?!) : Negative

Look behind

When you want to negate certain characters in a string, you can use character class but when you want to negate more than one character in a particular sequence, you need to use negative look ahead

  • (?<=) : Postitive
  • (?<!) : Negative