Regular Expression

No one likes to use it but man is it powerful. I recently undertook learning Regular Expression and https://en.wikipedia.org/wiki/Regular_expression was a very dry read indeed, but it lays out the basics in a fairly easy to understand way. Just reading it never helps anyone learn so I suggest playing around with it much like I have and will continue to do.

There are multiple parts of RegEx that can be used seperatly or together to form your search string. Those are Boolean, Grouping, Quantification, and the Wildcard; an understanding of one or more of these parts will let you begin using RegEx in the various tools that incorporate it.

Boolean - Using a | (SHFT+\ on a US keyboard) lets you specify an OR clause. for example "cat|dog" would pull all results that include cat OR dog, however "c(a|o)t" would give you all results that include cat OR cot. This incorporates grouping which we will get into next, but to break it down, "c(a|o)t" searches for words that include a c followed by an a Or an o followed by a t. What that means is that it will not be limited by just cat OR cot, it could include tomcat or apricot as well, etc.

Grouping - Grouping is invoked by an opening parenthesis "(" and a closing parenthesis ")". It informs the scope of the operators you are calling. For instance take "(cat|cot)", this is the same as "cat|cot" becuase the scope is the same; the | is operating on the same set of data. Now "c(a|o)t" will return the same results, however, the scope of what your operating on has changed to only look at an a OR an o instead of a cat OR a cot. You can expand this to multiple ORs "c(a|o|u)t" to get cat OR cot OR cut, or you multiple groups as well. "c(a|o)(a|o)t" would give you results of caat, caot, coat, coot.

Quantification - Quantification is just a fnacy word for how many. Lets take the last example and play with it with some bits from the quantification section. I have copied the table of most used quantifiers from the wiki article above for ease of access.

  • ? The question mark indicates zero or one occurrences of the preceding element. For example, colou?r matches both "color" and "colour".
  • * The asterisk indicates zero or more occurrences of the preceding element. For example, ab*c matches "ac", "abc", "abbc", "abbbc", and so on.
  • + The plus sign indicates one or more occurrences of the preceding element. For example, ab+c matches "abc", "abbc", "abbbc", and so on, but not "ac".
  • {n}The preceding item is matched exactly n times.
  • {min,}The preceding item is matched min or more times.
  • {min,max}The preceding item is matched at least min times, but not more than max times.

Now the example we used earlier has a fairly barbaric way to quantify "c(a|o)(a|o)t" but it works, right? I made small list of items to play around with for this.

Here is the list

$ ls -1
    apricot.txt
    caat.txt
    caaat.txt
    caot.txt
    cat.txt
    coat.txt
    coot.txt
    cooot.txt
    cot.txt
    scaat.txt
    scoot.txt
    tomcat.txt

We can see in this first search of "c(a|o)t" that we are pulling all terms that match c followed by an a or an o followed by a t, but not the terms that have 2 As or 2 Os, since we didnt say we wanted that.

$ ls | grep -E "c(a|o)t"
    apricot.txt
    cat.txt
    cot.txt
    tomcat.txt

Next is our term we wanted to look at "c(a|o)(a|o)t". It pulls all instances that contain c followed by 2 letters that are either a or o followed by a t. This works but it is a bit verbose. When working on limited space you want to save as much as you can and that is where the quantifiers will come into play.

$ ls | grep -E "c(a|o)(a|o)t"
    caat.txt
    caot.txt
    coat.txt
    coot.txt
    scaat.txt
    scoot.txt

Here we try the + quantifier "c(a|o)+t" to say we want 1 or more instances of (a|o). While it does what we asked, it's not what we want. The + modifier to a RegEx token says find all instances where this occurs at least once. In this instance we said find (a|o) AND (a|o)(a|o) AND (a|o)(a|o)(a|o) etc to inifinity. This is very broad and encompasses way more than we want.

$ ls | grep -E "c(a|o)+t"
    apricot.txt
    caat.txt
    caaat.txt
    caot.txt
    cat.txt
    coat.txt
    coot.txt
    cooot.txt
    cot.txt
    scaat.txt
    scoot.txt
    tomcat.txt

Here's the one! In this example we specify that we want the previous token to happen exactly two times. Which is the same scope as "c(a|o)(a|o)t" but takes two less characters to express. It may not seem like much but when your expressions get super long, 2 wasted characters could become 4 or 8 or 100.

$ ls | grep -E "c(a|o){2}t"
    caat.txt
    caot.txt
    coat.txt
    coot.txt
    scaat.txt
    scoot.txt

There is nothing wrong using either form, its just good to know that there are multiple ways of expressing these, and in some circumstances there is a clear benefit to one over the other.

Wildcards - By definition these characters will match any character in the sequence so that a search query of "a.2" would result in any number of results (aa2, ab2, a12,a32,a?2,a&2, etc until every possible iteration of "a{single instance of anything here}2" is found.

The two wildcards are . and *, the . meaning anycharacter and the * meaning any number of the previous token. Combined you can add .* to your query to say any number of anycharacter. This becomes powerful when mixed some of the meta characters I'll get into later, but some simple examples are:

"a.2" - find any iteration of "a{any single instance of any character}2 , results include a12,aa2,ag2,a=2 etc.

"a.*2" - find all instances where an a occurs and a 2 occurs sometime later, results include asdklfjhsdlfkj2, a2, ab2, a0987652, etc.

"^(file).*\\..*t$" - find all instances that start the line with "file" contain any number of any characters and have a file extension that ends in "t". This example will be explained later.