REGEX Functions

A regular expression (shortened as regex or regexp also referred to as rational expression) is a sequence of characters that define a search pattern. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings. Sheetkraft contains various useful regex functions that can be helpful for matching, replacing, searching strings. SheetKraft supports Perl Compatible dialect of regex to create patterns for any operation (Search, Match, Replace, Split):

The pattern used are as follows:

. matches any single character, not just an actual dot character. Dot is thus a special character in a pattern.

\. matches a literal dot. The back-slash is an escape character. It escapes the meaning that the following character would have if the back-slash was not present. Since dot matches any character, escaping this meaning makes it match the literal dot only.

\\ matches a single literal back-slash. The first back-slash escapes the meaning of the second back-slash so that the second back-slash is no longer an escape character.

abc matches the literal text abc. All letters and digits match themselves. They do not have a special meaning.

\s matches a single space character. Other than the normal space (character code 32 there are various other characters like tab, non-breaking space etc that are considered spaces. \s matches any such space character.

\S matches any single character that is not a space. In general, if a back-slash followed by a lower-case letter matches something, a back-slash followed by an upper-case letter matches anything other than that something.

\d matches any single digit.

\w matches any single word character. A word character matches a letter or digit or underscore.

\d\d matches two consecutive digits.

\d\d\.\d\d matches two consecutive digits immediately followed by a literal dot immediately followed by two consecutive digits.

[a-z] matches any single character in the range lower case a to lower case z

[a-z0-9] matches any single character in the range a to z or 0 to 9.

[0-9_] matches any single character in the range 0 to 9 or an underscore.

[-+] matches any single character that is a literal minus or a literal plus. Minus does not have the same meaning as in the examples above because there is no character between the opening square bracket and itself.

[0\-9] matches any single character that is a 0 or a literal minus or a 9. Minus does not form a range because it is escaped.

[[\]] matches any single character that is an opening square bracket or a closing square bracket. Note that the second opening square bracket does not need to be escaped but the first closing square bracket needs to be escaped. This same pattern could also be written as [\[\]] where the second opening square bracket is escaped unnecessarily. This works because there is no meaning defined for an escaped opening square bracket in this context so the escape character is effectively ignored.

\d* matches zero or more consecutive digits

\d+ matches one or more consecutive digits

[a-z]+ matches one or more consecutive characters each of which is in the range a to z.

\d{1,3} matches at least 1 and at most 3 consecutive digits

\d{3,} matches 3 or more consecutive digits

v\d{3,} matches a literal v followed by 3 or more consecutive digits

\d{5} matches exactly 5 consecutive digits

? is a shortcut for {0,1}, just as * is a shortcut for {0,} and + is a shortcut for {1,}

abc|def matches either abc or def. | has very low precedence. So this is not ab followed by either of c or d followed by ef. It is either of abc or def.

We can restrict where the match occurs by using anchors ^ and $

^abc matches any text starting with abc. It does not match xabc

abc$ matches any text ending with abc. It does not match abcy

^a.c$ matches any text starting with a followed by any single character and ending with c. It matches abc but does not match xabc or abcy

^^abc is equivalent to ^abc. To get a literal ^, it will need to be escaped as in ^\^abc which matches ^abc.

Capture Groups: Regex can be used to not just match text but also to extract matching portions. To extract the entire match, no change to the pattern is required. To extract parts of the match, the parts need to be put in parenthesis. This is called a capturing group.

Examples:

(\d{3})\.(\d{4}) matches a sequence of 3 digits followed by a literal dot followed by a sequence of 4 digits.

(?:abc){2}(\d{3}) matches two consecutive occurrences of abc followed by three digits. , for example abcabc121. Parenthesis create a capturing group, but sometimes we want to group without capturing the grouped part. Here abc is grouped but not captured. Grouping it allows specification of the quantifier {2} that applies to the whole group. \d{3} is grouped and captured. Since there is no capturing group before it, it is numbered 1. Non-capturing groups are not numbered as there would be no point to numbering them.

((?:abc){2}|(?:def){2})xyz matches either abcabcxyz or defdefxyz. This contains a single capturing group.

((abc){2}|(def){2})xyz also matches either abcabcxyz or defdefxyz. But this contains three capturing groups. The number of the capturing group is based on the order in which the opening parenthesis of the capturing group occurs in the pattern. The first capturing group is (abc){2}|(def){2}, the second group is abc and the third is def.

[^\\/]+ matches a sequence of at least one character that is not a forward or backward slash. ^ negates the character class. This can be used to match a part of a file system path. For example [^\\/]+$ can be used to extract the file name from a path.

In regex functions, “$1”, “$2” is used, which depicts the 1st capturing group and 2nd capturing group, respectively. This can be used in 3rd argument of Regexsearch and RegexSearchAll functions, Replace functions.


Let us consider a simple case:

(Note: This is a simple application of regex functions. This can be implemented using various other search patterns. During Implementation you can come across various problem statements, for example: file path parsing, date extraction, Account number extraction from a bank statement, which can be achieved effectively using regex functions.)

Question: You need to extract the buyer, seller, number of units and date from the given data. Then from the original data you need to remove the Date to form a new data. Check whether units were not lost from the new data. After that finding the day, month, year from date. All the above tasks should be done using regex functions.

  • Regex Search: This function returns a first matched sub-string, given the regex search pattern, pattern formats and original string as inputs.

We can see the function arguments from UI of Microsoft Excel (See Figure)

First argument is input text. It can be a single value or a vertical array/range of values. Second argument is the regex pattern. There is an optional third argument i.e. capture groups, that can be specified to customize the result using replacement patterns.

Use search pattern with RegexSearch to extract buyer, Seller, No of units, Date.

  • Regex Replace: This function finds the sub-string matching the regex search pattern and replaces with the expression based on replace pattern.

We can see the function arguments from UI of Microsoft Excel (See Figure)

RegexReplace takes a text input (or vertical array), a pattern and a replacement pattern. There can be more than one match in the input. Each matched portion in the text input is replaced by using the replacement pattern. Any text before / between / after the matches is retained.

Replace Pattern:

RegexReplace.SK("abc123def456","\D\d","[$&]") returns ab[c1]23de[f4]56.

Here $& in the replacement pattern represents the full match. The match occurs twice. Characters outside the match are retained. As with RegexSearch, $n or ${n} can also be used in the replacement pattern. The syntax for the replacement pattern is the same.

A more common use of RegexReplace is to simply remove matched portions of the input text:

RegexReplace.SK("abc123def456","\D\d","") returns ab23de56.


Use RegexReplace to remove the date from the original text.

  • Regex Match: This function finds the substring matching the regex search pattern from the main string.

First argument is input text, second argument is the pattern. It returns true if the text matches the pattern.

The output of this function is either TRUE or FALSE. Use RegexMatch to check whether there are numbers in the new sentence.

  • Regex Split: This function splits the main string into substrings based on the delimiter and width as inputs.

We can see the function arguments from UI of Microsoft Excel (See Figure)

First argument is the input text (or a vertical array). Result will contain one row per value in the first argument.

Second argument is the pattern for the delimiter.

If third argument is true and there are consecutive occurrences of the delimiter, empty strings in the result will be skipped. In other words, consecutive delimiters will be treated as a single delimiter.

Fourth argument is the number of columns in the result. If this is omitted, the number of columns is determined dynamically. This can cause problems because the formula may expand to the right.

Use RegexSplit to extract the day, month, year from the Date. The 4th argument is width which is kept 3 for day, month and year.

  • Let us fill down for all the data in source column:

  • Note: Use of Filldown.sk should be avoided while using regex functions. FillDown is problematic because regexes are expensive to build. With FillDown, you pay the cost on each cell, whereas with RegionDown, the same regex is used for all values. RegexMatch does not allow array arguments. So if you want to match many different input values, you have to do a FillDown. This is inefficient as the regex pattern will have to be parsed and processed in each call to RegexMatch.

  • Regex Search All: A very similar function to Regexsearch.sk, the only difference is that regexsearch.sk returns the first matched sub-string, whereas, regexsearchall.sk returns all the sub-strings, that satisfies the regex search pattern from the original string. Let us see the example in an HTML file (See Figure).

We want to store all the details in the database. We will use lazy regex here instead of greedy. We will fetch table data using Regex 2a and Regex 2b between two strings (See Figure).

Additional Example: Regex Search with Capture Group

If the pattern has multiple capture groups, then the 3rd argument is used (if required).

For example: We want to maintain database of employees’ vehicle in some database (assuming 1 vehicle per head) in multiple capture groups such as State/Union Territory, RTO.

First capture group is State which is in text format. 2nd capture group is RTO format of which if not provided will capture data in text format only. As we want RTO data in number format, we need to specify its format as number.

Regex inside other functions/ (Indirect use of Regex through Sheetkraft UI)

  • In Filter: For Example, we want to filter employees based on their pan cards if entered in a correct manner. For that we can filter the pan using regex (See figures)

Final Output will look like this:

  • Regex Replace in Clean Text:

The same example in Regex Replace can be seen here. Regex replace can be used using sheetkraft UI in “More” options. (See Figure)