Symbols & Patterns

Published

2024-04-26

Caution

This section is under development. Thank you for your patience.

Regular expressions are powerful tools for searching and manipulating text data. A regular expression (or ‘regex’) is made up of special symbols that define specific patterns to be identified or transformed. The regex patterns are not the data themselves, but rather a framework for locating or modifying text data. In this chapter, we’ll explore wildcards, regular expressions, and other special characters.

Wildcards

Wildcards (also known as glob patterns) are mostly used in commands to match filenames, paths, or filter text (ls, cp, mv, rm, etc.). Arguments can include wildcards, which the shell expands into a list of files or directories that match the pattern.

Asterisk: *

* is a wildcard for matching zero or more characters.

Example

ls *.md lists all files in the data/ directory that end with .md:

ls data/*.md
# data/README.md

Question Mark: ?

? is the wildcard for matching exactly one character.

Example

ls myfile?.txt lists files like myfile2.txt, but not myfile.txt and my file 3.txt:

ls myfile?.txt
# myfile2.txt

Square brackets: []

[abc]: Matches any one character listed (a, b, or c).

Example

[a-z]: Matches any one character (n, e, or w).

ls [new]*.txt
# newfile.txt

Example

Natch any one character in range (a to p).

ls data/[a-p]*
# data/ajperlis_epigrams.txt
# data/music_vids.tsv
# data/pwrds.csv
# data/pwrds.tsv

Regular Expressions

Regular expressions (or the singular ‘regex’) are powerful tools for searching and manipulating text data. A regex is made up of special symbols that define specific patterns to be identified or transformed.

Regular expressions operate on text–the sequence of characters that can include letters, digits, punctuation, and other character types. Text serves as the ‘data’ or ‘medium’ for which the patterns the regex describes are searched.

Regular expressions are more complex than wildcards, and are typically used with tools like grep (global regular expression print), sed (stream editor), and awk.

Dot: .

. matches any single character except a newline.

Examples

Matches lines containing “password” or similar patterns where any character stands between ‘p’ and ‘ssword’.

grep "p.ssword" data/pwrds.csv
# password,rank,strength,online_crack
# password,1,8,6.91 years

Replaces “password” where any character is between ‘p’ and ‘ssword’ with “p@ssword”.

sed 's/p.ssword/p@ssword/' data/pwrds.csv | head -n2
# p@ssword,rank,strength,online_crack
# p@ssword,1,8,6.91 years

Select records where “password” or similar patterns appear with any character between ‘p’, ‘ssw and rd’.

awk '/p.ssw.rd/' data/pwrds.csv
# password,rank,strength,online_crack
# password,1,8,6.91 years
# passw0rd,500,28,92.27 years

Asterisk: *

* matches zero or more of the preceding element.

Examples

We can use grep to find lines where “i” is followed by zero or more “l”s (including none):

grep 'il*' data/wu_tang.txt
# RZA   Robert Diggs
# GZA   Gary Grice
# Method Man    Clifford Smith
# Ghostface Killah  Dennis Coles
# U-God     Lamont Hawkins
# Masta Killa   Jamel Irief
# Cappadonna    Darryl Hill
# Ol Dirty Bastard  Russell Tyrone Jones

We can use sed to replace two or more "l"s with 11:

sed 's/lll*/11/g' data/wu_tang.txt
# Member    Name
# RZA   Robert Diggs
# GZA   Gary Grice
# Method Man    Clifford Smith
# Raekwon the Chef  Corey Woods
# Ghostface Ki11ah  Dennis Coles
# Inspectah Deck    Jason Hunter
# U-God     Lamont Hawkins
# Masta Ki11a   Jamel Irief
# Cappadonna    Darryl Hi11
# Ol Dirty Bastard  Russe11 Tyrone Jones

Print lines that start with one or more "R"s

awk '/^ *R/' data/wu_tang.txt
# RZA   Robert Diggs
# Raekwon the Chef  Corey Woods

Plus: +

+ matches one or more occurrences of the preceding element.

Examples

Use grep with extended regular expressions to find ‘i’ followed by one or more ’l’s:

grep -E 'il+' data/wu_tang.txt
# Ghostface Killah  Dennis Coles
# Masta Killa   Jamel Irief
# Cappadonna    Darryl Hill

Replace one or more "a"s with the @:

sed -E 's/a+/@/g' data/wu_tang.txt
# Member    N@me
# RZA   Robert Diggs
# GZA   G@ry Grice
# Method M@n    Clifford Smith
# R@ekwon the Chef  Corey Woods
# Ghostf@ce Kill@h  Dennis Coles
# Inspect@h Deck    J@son Hunter
# U-God     L@mont H@wkins
# M@st@ Kill@   J@mel Irief
# C@pp@donn@    D@rryl Hill
# Ol Dirty B@st@rd  Russell Tyrone Jones

The + operator needs the -E option to enable extended regular expressions.

Print lines with text containing one or more "Z"s:

awk '/Z+/' data/wu_tang.txt
# RZA   Robert Diggs
# GZA   Gary Grice

Question Mark: ?

? makes the preceding element optional (matches zero or one occurrence).

Examples

Use grep with extended regular expressions to find lines with ‘Killah’ or ‘Killah’:

grep -E 'Kill?' data/wu_tang.txt
# Ghostface Killah  Dennis Coles
# Masta Killa   Jamel Irief

sed: Replace Ghostface with Ghost Face:

sed -E 's/Ghostface?/Ghost Face/' data/wu_tang.txt
# Member    Name
# RZA   Robert Diggs
# GZA   Gary Grice
# Method Man    Clifford Smith
# Raekwon the Chef  Corey Woods
# Ghost Face Killah Dennis Coles
# Inspectah Deck    Jason Hunter
# U-God     Lamont Hawkins
# Masta Killa   Jamel Irief
# Cappadonna    Darryl Hill
# Ol Dirty Bastard  Russell Tyrone Jones

awk: Print lines with one or more digits.

awk '/[0-9]+/' data/wu_tang.txt

Character Set: [abc]

[abc] matches any single character listed in the set.

Example

Use grep to find lines containing ‘a’, ‘b’, or ‘c’:

grep '[abc]' filename.txt

Caret: ^

^ matches the start of a line.

Example

Use grep to find lines that start with ‘start’:

grep '^start' filename.txt

Dollar: $

$ matches the end of a line.

Example

Use grep to find lines that end with ‘end’:

grep 'end$' filename.txt

These patterns are extremely powerful in scripting and command-line operations for filtering and manipulating text data efficiently. Here’s how you might use them in combination across different tools:

  • sed for substitution: Replace ‘foo’ with ‘bar’ only if ‘foo’ appears at the beginning of a line:
sed 's/^foo/bar/' filename.txt
  • awk for selection: Print lines where the first field matches ‘start’:
awk '/^start/ {print $0}' filename.txt
  • perl for advanced manipulation: Increment numbers found at the end of each line:
perl -pe 's/(\d+)$/ $1+1 /e' filename.txt

Special Characters

Special Characters: Characters such as spaces, quotes, and others have special meanings in the shell. They need to be treated carefully when used within arguments.

Braces: {}

Brace Expansion: Similar to wildcards, brace expansion ({}) allows the creation of multiple text strings from a pattern containing braces.

Example

cat wu_tang.{txt,csv}

cat data/wu_tang.{tsv,dat}
# Member    Name
# RZA   Robert Diggs
# GZA   Gary Grice
# Method Man    Clifford Smith
# Raekwon the Chef  Corey Woods
# Ghostface Killah  Dennis Coles
# Inspectah Deck    Jason Hunter
# U-God Lamont Hawkins
# Masta Killa   Jamel Irief
# Cappadonna    Darryl Hill
# Ol Dirty Bastard  Russell Tyrone Jones
# |Member           |Name                 |
# |RZA              |Robert Diggs         |
# |GZA              |Gary Grice           |
# |Method Man       |Clifford Smith       |
# |Raekwon the Chef |Corey Woods          |
# |Ghostface Killah |Dennis Coles         |
# |Inspectah Deck   |Jason Hunter         |
# |U-God            |Lamont Hawkins       |
# |Masta Killa      |Jamel Irief          |
# |Cappadonna       |Darryl Hill          |
# |Ol Dirty Bastard |Russell Tyrone Jones |

Expands into:

cat data/wu_tang.tsv 
cat data/wu_tang.dat
# Member    Name
# RZA   Robert Diggs
# GZA   Gary Grice
# Method Man    Clifford Smith
# Raekwon the Chef  Corey Woods
# Ghostface Killah  Dennis Coles
# Inspectah Deck    Jason Hunter
# U-God Lamont Hawkins
# Masta Killa   Jamel Irief
# Cappadonna    Darryl Hill
# Ol Dirty Bastard  Russell Tyrone Jones
# |Member           |Name                 |
# |RZA              |Robert Diggs         |
# |GZA              |Gary Grice           |
# |Method Man       |Clifford Smith       |
# |Raekwon the Chef |Corey Woods          |
# |Ghostface Killah |Dennis Coles         |
# |Inspectah Deck   |Jason Hunter         |
# |U-God            |Lamont Hawkins       |
# |Masta Killa      |Jamel Irief          |
# |Cappadonna       |Darryl Hill          |
# |Ol Dirty Bastard |Russell Tyrone Jones |

Backslash: \

\ escapes the following character, nullifying its special meaning

Example

echo "File name with spaces \& special characters" prints the text with spaces and the ampersand:

echo "File name with spaces \& special characters"
# File name with spaces & special characters

Single quotes: ''

Single quotes (' ') treat every character literally, ignoring the special meaning of all characters.

Example

echo '$HOME' prints $HOME, not the path to the home directory:

echo '$HOME'
# $HOME

Double quotes: ""

Double quotes (" ") allow for the inclusion of special characters in an argument, except for the dollar sign ($), backticks (` `), and backslash (\).

Example

echo "$HOME" prints the path to the home directory:

echo "$HOME"
#> /Users/username

Tilde: ~

~ represents the home directory of the current user.

Example

List the items in the user’s home directory:

ls ~
#> Applications
#> Creative Cloud Files
#> Desktop
#> Documents
#> Downloads
#> Dropbox
#> Fonts
#> Library
#> Movies
#> Music
#> Pictures
#> Public
#> R
#> Themes

Dollar Sign: $

$ indicates a variable.

Example

echo $PATH prints the value of the PATH environment variable:

echo $PATH

Ampersand: &

& runs a command in the background.

Example

firefox & opens Firefox in the background, allowing the terminal to be used for other commands.

firefox &

Semicolon: ;

; separates multiple commands to be run in sequence.

Example

cd data; ls changes the directory to data and then lists its contents:

cd data; ls
# README.md
# ajperlis_epigrams.txt
# music_vids.tsv
# pwrds.csv
# pwrds.tsv
# roxanne.txt
# roxanne_orig.txt
# roxanne_rev.txt
# trees.tsv
# vg_hof.tsv
# who_tb_data.tsv
# who_tb_data.txt
# wu_tang.csv
# wu_tang.dat
# wu_tang.tsv
# wu_tang.txt

Greater Than: >

Redirection operators: > directs output to a file or a device.

Example

echo "This is my 2nd file" > myfile2.txt writes "This is my 2nd file" into myfile2.txt:

echo "This is my 2nd file" > myfile2.txt

Less Than: <

Redirection operators: < takes input from a file or a device.

Example

Then wc < myfile2.txt counts the words in myfile2.txt:

wc < myfile2.txt
#        1       5      20

Parentheses: ()

Parentheses can be used to group commands or for command substitution with $( ).

Example

(cd /data; ls) runs ls in /data without changing the current directory:

(cd data; ls)
# README.md
# ajperlis_epigrams.txt
# music_vids.tsv
# pwrds.csv
# pwrds.tsv
# roxanne.txt
# roxanne_orig.txt
# roxanne_rev.txt
# trees.tsv
# vg_hof.tsv
# who_tb_data.tsv
# who_tb_data.txt
# wu_tang.csv
# wu_tang.dat
# wu_tang.tsv
# wu_tang.txt

$(command) uses the output of command.

See a typo, error, or something missing?

Please open an issue on GitHub.