Regular Expressions

tags : Programming Languages, Computation and Computer Theory, Automata Theory

History

We get 2 definitions of regular languages from these 2 events.

1951, Kleene: Regular language is a language which is recognized by a finite automata.
1956, Chomsky: Regular languages defined by languages generated by Type 3 / regular grammars

Theory

RE describes language by an algebra
They describe “exactly” the regular language
L(E) is the language, E is the regex.

Operations

Union

Eg. {01,111,10} \cup {00, 01} = {01,111,10,00}

Concatenation

Eg. {01,111,10}{00, 01} = {0100, 0101, 11100, 11101, 1000, 1001}

Kleene Star

A* : Kleene closure / Kleene star
L* = {\epsilon} \cup L \cup LL \cup LLL \cup ...
Eg. {0,10}* = {ε, 0, 10, 00, 010, 100, 1010,…}

Lexical Analysis

tokens: substrings that together represent a unit.

Basics

The first thing a compiler does is break a program into tokens
We can write regex for each different kind of token
Each regex has a specific action associated
- Eg. Just print out, put it into a symbol table etc.

Issues

Convert RE for each token to ε-NFA
- Eg. identifier ε-NFA
- Eg. Reserved word ε-NFA
Combine all RE by w new start state w ε moves to start state of each ε-NFA
Convert to DFA
Set priority in DFA. Eg. DFA accepting if reserved word should have higher priority than DFA accepting identifier if

Implementations

Modern Implementations

Modern regex engines are augmented with features that allow the recognition of non-regular languages
Characters like, [ and - have special meanings, so you need to escape them with \
Some operators
- Concatenation: [a1,a2,…an] is shorthand for a1+a2+…+an
- Union: | operator.
- One or more: + , E+ = EE* (E concatenated w E*)
- Zero or one of: ?, E? = E + ε, [ab]? = a + b + ε

Flavors

PCRE
Javascript’s regex in certain cases the engine is stateful
- See javascript - Why does a RegExp with global flag give wrong results?

Learning Resources

Tutorials

Tools

Practical Concepts

Greedy and Lazy

Greedy

let regexp = /".+"/g;
 
let str = 'a "witch" and her "broom" is one';
 
alert( str.match(regexp) ); // "witch" and her "broom"

What happens?
- For every position in the string
  - Try to match the pattern at that position.
  - If there’s no match, go to the next position.
  - If we found the match for the current part of the pattern then we try to find a match for the next part of the pattern
    - Eg. in ".*" , " will be first part, .* will be second and so on
  - If we reach the end of the string(no more characters!) and we find no match, we backtrack.
  - We keep backtracking till we find a match for the entire regex pattern.
    - Eg. In 'a "witch" and her "broom" is one' , the engine first goes till end of the string because of .+ but then backtrack till it finds the ending " in end of broom"

Lazy

?
- Usually ? is a quantifier by itself (zero or one)
- If added after another quantifier (or even itself): It switches the matching mode from greedy to lazy
- Eg. .*?, .+? lazy search for .* and .+

What happens?
- For every position in the string
  - Try to match the pattern at that position.
  - If there’s no match go to the next position.
  - If we found the match for the current part of the pattern then we try to find a match for the next part of the pattern
    - Eg. in ".*?" , " will be first part, .*? will be second and so on
  - DIFFERENCE: Now, because .*? is lazy, engine try to match the part after .*? which is " . If it doesn’t find, it’ll just match with .*
    - This is the nature of the lazy, it’ll try to end the match as soon as possible.
  - If we reach the end of the string(no more characters!) and we find no match, we backtrack.
  - We keep backtracking till we find a match for the entire regex pattern.
    - Eg. In 'a "witch" and her "broom" is one' , the engine first goes till end of the string because of .+ but then backtrack till it finds the ending " in end of broom"

Lookaround

Lookaround = Lookahead + Lookbehind
This allow you to match some something but also tell the engine to make sure that this and that should be before and after what I want to match, if they are there then only it’ll be a match

Look ahead

(?=) : Postitive
(?!) : Negative

Look behind

When you want to negate certain characters in a string, you can use character class but when you want to negate more than one character in a particular sequence, you need to use negative look ahead

(?<=) : Postitive
(?<!) : Negative

🐏 mogoz

Table of Contents

Regular Expressions

History

Theory

Operations

Union

Concatenation

Kleene Star

Lexical Analysis

Basics

Issues

Implementations

Modern Implementations

Flavors

Learning Resources

Tutorials

Tools

Practical Concepts

Greedy and Lazy

Greedy

Lazy

Lookaround

Look ahead

Look behind

Graph View

Backlinks

🐏 mogoz

Table of Contents

Regular Expressions

History §

Theory §

Operations §

Union §

Concatenation §

Kleene Star §

Lexical Analysis §

Basics §

Issues §

Implementations §

Modern Implementations §

Flavors §

Learning Resources §

Tutorials §

Tools §

Practical Concepts §

Greedy and Lazy §

Greedy §

Lazy §

Lookaround §

Look ahead §

Look behind §

Graph View

Backlinks

History

Theory

Operations

Union

Concatenation

Kleene Star

Lexical Analysis

Basics

Issues

Implementations

Modern Implementations

Flavors

Learning Resources

Tutorials

Tools

Practical Concepts

Greedy and Lazy

Greedy

Lazy

Lookaround

Look ahead

Look behind