Notes from CodeStock 2014′s “Regular Expressions” presentation

Brian Friesen’s talk on Regular Expressions was probably my favorite of the conference. You gotta admire a guy who builds a regular expression engine to properly demo and train folks up on the ‘devil’s language’ (as I and others I’ve known have called it)… Folks that hate regex and those that live and breathe the stuff all got something from the session. Good stuff.

Regular Expressions – now you have two problems

Brian Friesen
@brianfriesen

Works at Quicken Loans (side note, I used QL last year – hands down the best UX I’ve ever seen in a mortgage product/service)

github.com/QuickenLoans/RegExpose

Regexper.com

Regular-expressions.info

The Bible:
Mastering Regular Expressions by Jeffery Friedl
– can read the first 2/5 of the book and you have enough
———————————————————————–

1. RegEx are hierarchical

Root node = the whole expression

ABC would tree out like this:

ABC – root
A – character literal
B – character literal
C – character literal

2. RegEx do their thing sequentially

A RegEx will match if each of its child nodes matches in sequence
After a match, the RegEx engine will continue trying to find further matches until it has covered the entire string.

RegEx are by default case-sensitive

Character classes are surrounded by square brackets
– for a range, use a dash.
– if you need a dash in the match, put it at the beginning of your set within square brackets
– you can also include specific characters or numbers to match against
– can include a-z and A-Z, or you can pass an additional param to ignore case
– a negated character class = add a “^” carat character before a match. ie [^a-f0-9] would ignore a-f

Shorthand matches
– \d is the same as [0-9]
– \D is the same as [^0-9]
– \s matches whitespace chars
– \w is the same as any word character, meaning [A-Za-z0-9_]
– . matches any character, depending on options (ie except for new line character)

Alternation
(it’s a pipe dream)

| character means “or”
– linos|tigers|bears ‘lions’ would match, but regExp doesn’t know if its the BEST match, so it saves state (a breadcrumb) and moves to check the other choices.
– if a match hits on the last option in a set of choices, no state will be saved, no breadcrumbs etc.

Quantifiers (quantifiers are always AFTER)
(Because sometimes, quantity trumps quality)

Greedy Quantifiers (greedy means quantifier always wants more, ie. will keep going)
——————
– ? = optional
– * = will match zero, will match many
– PO*P would match “PP” as well as “POOP”
– + = must match at least once to succeed
– NO+! would match “NO!” as well as “NOOOOOOOOOOO!”
– {} = match a specific number of things
– \d{3} = this means match exactly 3 digits
– \d{3,15} = this means match at least 3, but up to 15
– \d{3,} = this means match at least 3 with no high end limit

Ultimate lazy quantifier: .*

Causion against using “*.” – can lead to a match failing since the greedy ‘any character as many as possible’ matching could lead to skipping more specific matches after the *.

Lazy Quantifiers – will only match as much as is needed, without going overboard
Once a match is made it will pass control to the next match paramater until it’s needed again…
—————-
.*? = Lazy
{3,5}? = also Lazy (in this case, once 3 digits were matched, the next node in regexp would be matched)

– ab.*?cd would match “abc12345cd” with the lazy quantifier returning to the ‘c’ character repeatedly before going back to the next number character

More alternation
—————-
(?:white|dog|brick) house
– match against “dog house”

Quantifers + grouping
———————
(?:NaN)+
– match against “NaNNaNNaNNaNNaN”