tags : Programming Languages, Computation and Computer Theory, Automata Theory
History
We get 2 definitions of regular languages
from these 2 events.
- 1951, Kleene: Regular language is a language which is recognized by a finite automata.
- 1956, Chomsky: Regular languages defined by languages generated by
Type 3
/regular grammars
Theory
- RE describes language by an algebra
- They describe “exactly” the regular language
L(E)
is thelanguage
,E
is the regex.
Operations
Union
- Eg.
{01,111,10} \cup {00, 01}
={01,111,10,00}
Concatenation
- Eg.
{01,111,10}{00, 01}
={0100, 0101, 11100, 11101, 1000, 1001}
Kleene Star
A*
: Kleene closure / Kleene star- L* =
{\epsilon} \cup L \cup LL \cup LLL \cup ...
- Eg. {0,10}* = {ε, 0, 10, 00, 010, 100, 1010,…}
Lexical Analysis
tokens
: substrings that together represent a unit.
Basics
- The first thing a compiler does is break a program into
tokens
- We can write
regex
for each different kind oftoken
- Each
regex
has a specific action associated- Eg. Just print out, put it into a symbol table etc.
Issues
- Convert
RE
for eachtoken
to ε-NFA- Eg. identifier ε-NFA
- Eg. Reserved word ε-NFA
- Combine all
RE
by w new start state w ε moves to start state of each ε-NFA - Convert to DFA
- Set priority in DFA. Eg. DFA accepting
if
reserved word should have higher priority than DFA accepting identifierif
Implementations
Modern Implementations
- Modern regex engines are augmented with features that allow the recognition of non-regular languages
- Characters like,
[
and-
have special meanings, so you need to escape them with\
- Some operators
- Concatenation: [a1,a2,…an] is shorthand for a1+a2+…+an
- Union: | operator.
- One or more:
+
,E+
=EE*
(E concatenated w E*) - Zero or one of:
?
, E? = E + ε,[ab]?
= a + b + ε
Flavors
- PCRE
- Javascript’s regex in certain cases the engine is stateful
Learning Resources
Tutorials
- Why you really can parse HTML (and anything else) with regular expressions – Neil Madden 🌟
- Regular expression Denial of Service - ReDoS
- A Visual Guide to Regular Expression
- Python Tutorial: re Module - How to Write and Match Regular Expressions (Regex) - YouTube
- Emacs: basics of regular expressions (regexp) - YouTube
- https://github.com/ziishaned/learn-regex
- My most useful RegExp trick — surma.dev
- Python 3.11: possessive quantifiers and atomic grouping added to re module
- Building regex.help
- https://javascript.info/regular-expressions
Tools
- https://github.com/aloisdg/awesome-regex
- https://regex101.com/
- https://regexr.com/
- https://regexone.com/
- https://projects.lukehaas.me/regexhub/
- https://remram44.github.io/regex-cheatsheet/regex.html
Practical Concepts
Greedy and Lazy
Greedy
let regexp = /".+"/g;
let str = 'a "witch" and her "broom" is one';
alert( str.match(regexp) ); // "witch" and her "broom"
-
What happens?
- For every position in the string
- Try to match the pattern at that position.
- If there’s no match, go to the next position.
- If we found the match for the
current part of the pattern
then we try to find a match for thenext part of the pattern
- Eg. in
".*"
,"
will be first part,.*
will be second and so on
- Eg. in
- If we reach the end of the string(no more characters!) and we find no match, we
backtrack
. - We keep backtracking till we find a match for the entire regex pattern.
- Eg. In
'a "witch" and her "broom" is one'
, the engine first goes till end of the string because of.+
but thenbacktrack
till it finds the ending"
in end ofbroom"
- Eg. In
- For every position in the string
Lazy
?
- Usually
?
is a quantifier by itself (zero or one) - If added after another quantifier (or even itself): It switches the matching mode from
greedy
tolazy
- Eg.
.*?
,.+?
lazy search for.*
and.+
- Usually
-
What happens?
- For every position in the string
- Try to match the pattern at that position.
- If there’s no match go to the next position.
- If we found the match for the
current part of the pattern
then we try to find a match for thenext part of the pattern
- Eg. in
".*?"
,"
will be first part,.*?
will be second and so on
- Eg. in
- DIFFERENCE: Now, because
.*?
is lazy, engine try to match the part after.*?
which is"
. If it doesn’t find, it’ll just match with.*
- This is the nature of the lazy, it’ll try to end the match as soon as possible.
- If we reach the end of the string(no more characters!) and we find no match, we
backtrack
. - We keep backtracking till we find a match for the entire regex pattern.
- Eg. In
'a "witch" and her "broom" is one'
, the engine first goes till end of the string because of.+
but thenbacktrack
till it finds the ending"
in end ofbroom"
- Eg. In
- For every position in the string
Lookaround
- Lookaround = Lookahead + Lookbehind
- This allow you to match some something but also tell the engine to make sure that this and that should be before and after what I want to match, if they are there then only it’ll be a match
Look ahead
- (?=) : Postitive
- (?!) : Negative
Look behind
When you want to negate certain characters in a string, you can use character class but when you want to negate more than one character in a particular sequence, you need to use negative look ahead
- (?<=) : Postitive
- (?<!) : Negative