tags : Programming Languages, Computation and Computer Theory, Automata Theory
History
We get 2 definitions of regular languages from these 2 events.
- 1951, Kleene: Regular language is a language which is recognized by a finite automata.
- 1956, Chomsky: Regular languages defined by languages generated by Type 3/regular grammars
Theory
- RE describes language by an algebra
- They describe “exactly” the regular language
- L(E)is the- language,- Eis the regex.
Operations
Union
- Eg. {01,111,10} \cup {00, 01}={01,111,10,00}
Concatenation
- Eg. {01,111,10}{00, 01}={0100, 0101, 11100, 11101, 1000, 1001}
Kleene Star
- A*: Kleene closure / Kleene star
- L* = {\epsilon} \cup L \cup LL \cup LLL \cup ...
- Eg. {0,10}* = {ε, 0, 10, 00, 010, 100, 1010,…}
Lexical Analysis
tokens: substrings that together represent a unit.
Basics
- The first thing a compiler does is break a program into tokens
- We can write regexfor each different kind oftoken
- Each regexhas a specific action associated- Eg. Just print out, put it into a symbol table etc.
 
Issues

- Convert REfor eachtokento ε-NFA- Eg. identifier ε-NFA
- Eg. Reserved word ε-NFA
 
- Combine all REby w new start state w ε moves to start state of each ε-NFA
- Convert to DFA
- Set priority in DFA. Eg. DFA accepting ifreserved word should have higher priority than DFA accepting identifierif
Implementations
Modern Implementations
- Modern regex engines are augmented with features that allow the recognition of non-regular languages
- Characters like, [and-have special meanings, so you need to escape them with\
- Some operators
- Concatenation: [a1,a2,…an] is shorthand for a1+a2+…+an
- Union: | operator.
- One or more: +,E+=EE*(E concatenated w E*)
- Zero or one of: ?, E? = E + ε,[ab]?= a + b + ε
 
Flavors
- PCRE
- Javascript’s regex in certain cases the engine is stateful
Learning Resources
Tutorials
- Why you really can parse HTML (and anything else) with regular expressions – Neil Madden 🌟
- Regular expression Denial of Service - ReDoS
- A Visual Guide to Regular Expression
- Python Tutorial: re Module - How to Write and Match Regular Expressions (Regex) - YouTube
- Emacs: basics of regular expressions (regexp) - YouTube
- https://github.com/ziishaned/learn-regex
- My most useful RegExp trick — surma.dev
- Python 3.11: possessive quantifiers and atomic grouping added to re module
- Building regex.help
- https://javascript.info/regular-expressions
Tools
- https://github.com/aloisdg/awesome-regex
- https://regex101.com/
- https://regexr.com/
- https://regexone.com/
- https://projects.lukehaas.me/regexhub/
- https://remram44.github.io/regex-cheatsheet/regex.html
Practical Concepts
Greedy and Lazy
Greedy
let regexp = /".+"/g;
 
let str = 'a "witch" and her "broom" is one';
 
alert( str.match(regexp) ); // "witch" and her "broom"- 
What happens? - For every position in the string
- Try to match the pattern at that position.
- If there’s no match, go to the next position.
- If we found the match for the current part of the patternthen we try to find a match for thenext part of the pattern- Eg. in ".*","will be first part,.*will be second and so on
 
- Eg. in 
- If we reach the end of the string(no more characters!) and we find no match, we backtrack.
- We keep backtracking till we find a match for the entire regex pattern.
- Eg. In 'a "witch" and her "broom" is one', the engine first goes till end of the string because of.+but thenbacktracktill it finds the ending"in end ofbroom"
 
- Eg. In 
 
 
- For every position in the string
Lazy
- ?- Usually ?is a quantifier by itself (zero or one)
- If added after another quantifier (or even itself): It switches the matching mode from greedytolazy
- Eg. .*?,.+?lazy search for.*and.+
 
- Usually 
- 
What happens? - For every position in the string
- Try to match the pattern at that position.
- If there’s no match go to the next position.
- If we found the match for the current part of the patternthen we try to find a match for thenext part of the pattern- Eg. in ".*?","will be first part,.*?will be second and so on
 
- Eg. in 
- DIFFERENCE: Now, because .*?is lazy, engine try to match the part after.*?which is". If it doesn’t find, it’ll just match with.*- This is the nature of the lazy, it’ll try to end the match as soon as possible.
 
- If we reach the end of the string(no more characters!) and we find no match, we backtrack.
- We keep backtracking till we find a match for the entire regex pattern.
- Eg. In 'a "witch" and her "broom" is one', the engine first goes till end of the string because of.+but thenbacktracktill it finds the ending"in end ofbroom"
 
- Eg. In 
 
 
- For every position in the string
Lookaround
- Lookaround = Lookahead + Lookbehind
- This allow you to match some something but also tell the engine to make sure that this and that should be before and after what I want to match, if they are there then only it’ll be a match
Look ahead
- (?=) : Postitive
- (?!) : Negative
Look behind
When you want to negate certain characters in a string, you can use character class but when you want to negate more than one character in a particular sequence, you need to use negative look ahead
- (?<=) : Postitive
- (?<!) : Negative