A Concise Guide for Strings and Regular Expressions in R
Tame your strings in an effortless way
Finding data is easy nowadays. Finding a high-quality data is harder than ever. One of the chronic traits of low-quality data is that it is messy and kept imprecisely. No matter how much we data professionals love to talk about algorithms and model validation what takes most of our time is cleaning and tidying the data.
In that sense, dealing with strings requires a slightly different skill set than, let’s say a data.frame
or a list. The topic of this article, as you might have guessed already is: how to manipulate strings and tame them as effortlessly as possible. Let’s get started!
Pasting and Splitting
Pasting and splitting pieces of strings are two of the most regular tasks we frequently face.
For these two simple task, we have two equally simple function: paste()
and strsplit()
.
paste('Ugurcan' , 'Demir' , sep = " ")## [1] "Ugurcan Demir"## [[1]]
## [1] "Ugurcan" "Demir"strsplit("Ugurcan Demir" , split = " ")paste("The","United" ,"States" ,"of" ,"America" , sep = " ")## [1] "The United States of America"unlist(strsplit("The United States of America" , split = " "))## [1] "The" "United" "States" "of" "America"
Number of Characters and Slicing
R and Python users tend to intersect a lot. Both of these languages are easy to pick up and with their versatile libraries they both target statisticians, machine learning practitioners, or anyone interested in scientific computing in general. But if you come from Python the examples we are about to show you may seem a little bit strange.
To find the total number of characters, for instance , length()
would be an intuitive choice. But unlike Python, it is not how it is done in R.
length("The United States of America")## [1] 1nchar("The United States of America")## [1] 28
Slicing is not regular slicing either. We have got two functions for that : substr()
and substring()
. These two identical twins work exactly the same if you specify both start and stop parameters. However, substring()
has got a default stop value while substr()
hasn’t.
substr("The United States of America" , start = 10 , stop = 20)## [1] "d States of"substring("The United States of America" , first = 10 , last = 20)## [1] "d States of"
Here is what happens when we don’t pass an argument to “stop” or “last” parameter.
substring("The United States of America" , first = 10 )## [1] "d States of America"substr("The United States of America" , start = 10 )## Error in substr("The United States of America", start = 10): argument "stop" is missing, with no default
regexec() , gregexpr() and grep()
I can hear you saying outloud “ How are we supposed to know the indices in the first place to pass as arguments”. Well, it is easy to count with your finger when you have one piece of a string but it is highly impractical to do that when you have millions or even tens of rows of data. Luckily we are equipped with two beautiful functions to do that.
Our first function regexec()
is used to find the first occurrence of a substring inside of a larger piece of string.
regexec(pattern = 'United' , text = "The United States of America" )## [[1]]
## [1] 5
## attr(,"match.length")
## [1] 6
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUEregexec(pattern = 'U' , text = "The United States of America" )## [[1]]
## [1] 5
## attr(,"match.length")
## [1] 1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
gregexpr()
, on the other hand, finds all occurrences of a substring inside of a larger piece of string.
gregexpr(pattern = 'e' , text = "The United States of America" )## [[1]]
## [1] 3 9 16 24
## attr(,"match.length")
## [1] 1 1 1 1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
grep()
function takes its second argument, not just as a piece of string but a vector of strings, and returns the indices of those elements which contain the substring. If you set the value parameter TRUE, it returns the elements themselves.
grep(pattern = 'wigh' , x = c('Michael' , "Jim" , "Dwight" , "Pam") )## [1] 3grep(pattern = 'm' , x = c('Michael' , "Jim" , "Dwight" , "Pam") )## [1] 2 4grep(pattern = 'm' , x = c('Michael' , "Jim" , "Dwight" , "Pam") , value = T)## [1] "Jim" "Pam"
sub() and gsub()
sub()
and gsub()
take it to the next level and replace the part of a larger string that matches the given substring with another string passed as an argument.
sub(pattern = 'm' , replacement = "n" , x = c('Michael',"Jim","Dwight","Pam"))## [1] "Michael" "Jin" "Dwight" "Pan"sub(pattern = "i" , replacement = "a" , x = "The United States of America")## [1] "The Unated States of America"
If you noticed the word ‘America’ stayed the same. It is because sub() replaces only the first occurrence of the substring. To replace all of them we apply gsub()
.
gsub(pattern = "i" , replacement = "a" , x = "The United States of America")## [1] "The Unated States of Ameraca"
Regular Expressions (REGEX)
Until now, we have searched for simple pieces of substrings inside other strings. The substring to search for may not be as simple as we have shown so far in these examples. We may not even know what particular substring we need to find but rather may need to find anything that fits into the blueprint we give. In these cases, we take advantage of regular expressions: or REGEX in short.
Regular expressions can be found in many programming languages with slightly different implementations. Their main job is to search for a pattern of string in a given larger string.
Regular expressions are not given the exact substring to look for but they rather look for a pattern that resembles the substring that they are given. They have their own mini-language to do that job. We will give a detailed explanation of all the metacharacters and rules that govern regular expressions.
Metacharacters
Before everything else, we should state that the symbols of the mini-language of regular expressions are called metacharacters
and they are the backbones of regular expressions.
- “$”
- “*”
- “+”
- “.”
- “?”
- “[ ]”
- “^”
- “{ }”
- “|”
- “( )”
- “\ ”
Now we shall explain what each of these metacharacters
does.
Quantifiers
“?” , “*” , “+” and “{ }” are called quantifiers among metacharacters
because they indicate how many times we wish to see a given pattern.
- “*” : matches if the item before appears 0 or more times
- “+” : matches if the item before appears 1 or more times
- “?” : matches if the item before appears 0 or 1 times
- “{ ,m}” : matches if the item before appears m or fewer times
- “{n, }” : matches if the item before appears n or more times
- “{n , m}” : matches if the item before appears between n and m times
- “{m}” : matches if the item before appears exactly m times
letter_vector <- c(
"AACACA","BBCCBC","CCABBB","ABABAA","ACBCAA","BCACBC",
"BABABA","CACABA","BBABAB","BCCBAB","CAABCC","BCCBCA",
"CAAABA","BAABCB","CCABBC","ABABBA","CABAAC","CAABCC",
"CABCAC","AABCAA","CAAACB","BBACCA","BCAAAB","BBACBC",
"CCCCBC","ACABCA","BCBBBC","AABBCC","CCBBBB","BBABBA","BBCAAC"
)grep(pattern = "ABC" , x = letter_vector , value = T)## [1] "CAABCC" "BAABCB" "CAABCC" "CABCAC" "AABCAA" "ACABCA"grep(pattern = "AB*C" , x = letter_vector , value = T)## [1] "AACACA" "ACBCAA" "BCACBC" "CACABA" "CAABCC" "BAABCB" "CCABBC" "CABAAC"
## [9] "CAABCC" "CABCAC" "AABCAA" "CAAACB" "BBACCA" "BBACBC" "ACABCA" "AABBCC"
## [17] "BBCAAC"grep(pattern = "AB+C" , x = letter_vector , value = T)## [1] "CAABCC" "BAABCB" "CCABBC" "CAABCC" "CABCAC" "AABCAA" "ACABCA" "AABBCC"grep(pattern = "AB?C" , x = letter_vector , value = T)## [1] "AACACA" "ACBCAA" "BCACBC" "CACABA" "CAABCC" "BAABCB" "CABAAC" "CAABCC"
## [9] "CABCAC" "AABCAA" "CAAACB" "BBACCA" "BBACBC" "ACABCA" "BBCAAC"grep(pattern = "AB{,2}C" , x = letter_vector , value = T)## [1] "AACACA" "ACBCAA" "BCACBC" "CACABA" "CAABCC" "BAABCB" "CCABBC" "CABAAC"
## [9] "CAABCC" "CABCAC" "AABCAA" "CAAACB" "BBACCA" "BBACBC" "ACABCA" "AABBCC"
## [17] "BBCAAC"grep(pattern = "AB{2,}C" , x = letter_vector , value = T)## [1] "CCABBC" "AABBCC"grep(pattern = "AB{1,2}C" , x = letter_vector , value = T)## [1] "CAABCC" "BAABCB" "CCABBC" "CAABCC" "CABCAC" "AABCAA" "ACABCA" "AABBCC"grep(pattern = "AB{2}C" , x = letter_vector , value = T)## [1] "CCABBC" "AABBCC"
Beginning and Ending Metacharacters
“^” and “$” , respectively , represent the beginning and end of the string. They are also called as anchors in other resources. Anchors do not match any character.
grep(pattern = "^A" , x = letter_vector , value = T)## [1] "AACACA" "ABABAA" "ACBCAA" "ABABBA" "AABCAA" "ACABCA" "AABBCC"grep(pattern = "C$" , x = letter_vector , value = T)## [1] "BBCCBC" "BCACBC" "CAABCC" "CCABBC" "CABAAC" "CAABCC" "CABCAC" "BBACBC"
## [9] "CCCCBC" "BCBBBC" "AABBCC" "BBCAAC"
Placeholder
“.” is our next metacharacter and it matches ANY character where it is used. The example below searches for any pattern that starts with “C” , ends with “A” , and any but any two characters between them.
grep(pattern = "C..A" , x = letter_vector , value = T)## [1] "AACACA" "ACBCAA" "CACABA" "BCCBAB" "BCCBCA" "CAAABA" "CABAAC" "CAAACB"
## [9] "BCAAAB"
Sequences
The metacharacter “\ ” , when it is used with a set of sequence key letters, is used to define a certain sequence of characters in a string and it matches those sequence of characters when it is used with our string functions. Below you can find a detailed list of the key letters that are frequently used with “\ ”.
- “\d” = Digit
- “\D” = Not a digit
- “\w” = Word Character (a-z, A-Z, 0???9)
- “\W” = Not a word character
- “\s” = Whitespace
- “\S” = Not whitespace
- “\b” = Word Boundary
- “\B” = Not a word boundary
Let’s see some examples.
string1 <- 'My name is Ugurcan and I am 25.'gsub(pattern = "\\d" , replacement = "-" , x = string1)## [1] "My name is Ugurcan and I am --."gsub(pattern = "\\s" , replacement = "-" , x = string1)## [1] "My-name-is-Ugurcan-and-I-am-25."gsub(pattern = "\\w" , replacement = "-" , x = string1)## [1] "-- ---- -- ------- --- - -- --."gsub(pattern = "\\b" , replacement = "-" , x = string1)## [1] "-M-y- -n-a-m-e- -i-s- -U-g-u-r-c-a-n- -a-n-d- -I- -a-m- -2-5-.-"
Character Classes
“[ ]” is another metacharacter and it is frequently used to form sophisticated patterns to analyze complicated and unstructured text data. We can pass several characters inside the square brackets but it will match only those characters and it will match only one single character. The order doesn’t matter and a we can use a hyphen to search for a range of characters or numbers.
grep(pattern = "[zp]" , x = state.name , value = T)## [1] "Arizona" "Mississippi" "New Hampshire"grep(pattern = "[b-d]" , x = state.name , value = T)## [1] "Alabama" "Colorado" "Connecticut" "Florida"
## [5] "Idaho" "Indiana" "Kentucky" "Maryland"
## [9] "Massachusetts" "Michigan" "Nebraska" "Nevada"
## [13] "New Mexico" "Rhode Island" "Wisconsin"grep(pattern = "[od]$" , x = state.name , value = T)## [1] "Colorado" "Idaho" "Maryland" "New Mexico" "Ohio"
## [6] "Rhode Island"
For certain types of occasions you can you built-in class names with square brackets too. Below we give a complete list of those class names.
[:alnum:]
= Alphanumeric characters: [:alpha:] and [:digit:][:alpha:]
= Alphabetic characters: [:lower:] and [:upper:][:blank:]
= Blank characters: space and tab, and possibly other locale-dependent characters such as non-breaking space[:cntrl:]
= Control characters In ASCII, these characters have octal codes 000 through 037, and 177 (DEL) In another character set, these are the equivalent characters, if any[:digit:]
= Digits: 0 1 2 3 4 5 6 7 8 9[:graph:]
= Graphical characters: [:alnum:] and [:punct:][:lower:]
= Lower-case letters in the current locale[:print:]
= Printable characters: [:alnum:], [:punct:] and space[:punct:]
= Punctuation characters: ! “ # $ % & ’ ( ) * + , — / : ; < = > ? @ [ ] ^ _ ` { | } ~. “[:space:]
= Space characters: tab, newline, vertical tab, form feed, carriage return, space and possibly other locale-dependent characters[:upper:]
= Upper-case letters in the current locale[:xdigit:]
= Hexadecimal digits: 0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f
Grouping and OR Operator
Our last two metacharacters are “( )” and “|” and they are usually used together. Grouping metacharacter “( )” separates different sets of patterns and OR operator works like a regular OR operator. Let’s illustrate them with examples.
grep(pattern = "(th|la)" , x = state.name , value = T)## [1] "Alabama" "Alaska" "Delaware" "Maryland"
## [5] "North Carolina" "North Dakota" "Oklahoma" "Rhode Island"
## [9] "South Carolina" "South Dakota"grep(pattern = "^New (Y|J)" , x = state.name , value = T)## [1] "New Jersey" "New York"
Escaping Metacharacters
We have seen all of the metacharacters, but what if our search pattern includes one of these metacharacters?
This is a question you are very welcome to ask. In that case, we need to tell R that we don’t use these characters as metacharacters but as regular characters.
In order to do that we add a backward slash right before the metacharacter so that we can escape it. Since backward slash is a metacharacter too we add another backward slash to escape it too. Here are some examples.
string2 <- c("Lionel Messi\ PSG" , 'file_name$' , "{2022}")grep(pattern = "\\$" , x = string2 , value = T)## [1] "file_name$"grep(pattern = "\\{" , x = string2 , value = T)## [1] "{2022}"
In this article, we first discussed all of the functions we use with strings. Then we covered eleven metacharacters with which we can create complex patterns. I hope this article can be a reference guide for you all.