ls data/*.md
# data/README.md
Symbols & Patterns
Regular expressions are powerful tools for searching and manipulating text data. A regular expression (or ‘regex’) is made up of special symbols that define specific patterns to be identified or transformed. The regex patterns are not the data themselves, but rather a framework for locating or modifying text data. In this chapter, we’ll explore wildcards, regular expressions, and other special characters.
Wildcards
Wildcards (also known as glob patterns) are mostly used in commands to match filenames, paths, or filter text (ls
, cp
, mv
, rm
, etc.). Arguments can include wildcards, which the shell expands into a list of files or directories that match the pattern.
Asterisk: *
*
is a wildcard for matching zero or more characters.
Example
ls *.md
lists all files in the data/
directory that end with .md
:
Question Mark: ?
?
is the wildcard for matching exactly one character.
Example
ls myfile?.txt
lists files like myfile2.txt
, but not myfile.txt
and my file 3.txt
:
ls myfile?.txt
# myfile2.txt
Square brackets: []
[abc]
: Matches any one character listed (a, b, or c).
Example
[a-z]
: Matches any one character (n
, e
, or w
).
ls [new]*.txt
# newfile.txt
Example
Natch any one character in range (a
to p
).
ls data/[a-p]*
# data/ajperlis_epigrams.txt
# data/music_vids.tsv
# data/pwrds.csv
# data/pwrds.tsv
Regular Expressions
Regular expressions (or the singular ‘regex’) are powerful tools for searching and manipulating text data. A regex is made up of special symbols that define specific patterns to be identified or transformed.
Regular expressions operate on text–the sequence of characters that can include letters, digits, punctuation, and other character types. Text serves as the ‘data’ or ‘medium’ for which the patterns the regex describes are searched.
Regular expressions are more complex than wildcards, and are typically used with tools like grep
(global regular expression print), sed
(stream editor), and awk
.
Dot: .
.
matches any single character except a newline.
Examples
Matches lines containing “password” or similar patterns where any character stands between ‘p
’ and ‘ssword
’.
grep "p.ssword" data/pwrds.csv
# password,rank,strength,online_crack
# password,1,8,6.91 years
Replaces “password
” where any character is between ‘p
’ and ‘ssword
’ with “p@ssword
”.
sed 's/p.ssword/p@ssword/' data/pwrds.csv | head -n2
# p@ssword,rank,strength,online_crack
# p@ssword,1,8,6.91 years
Select records where “password
” or similar patterns appear with any character between ‘p
’, ‘ssw
and rd
’.
awk '/p.ssw.rd/' data/pwrds.csv
# password,rank,strength,online_crack
# password,1,8,6.91 years
# passw0rd,500,28,92.27 years
Asterisk: *
*
matches zero or more of the preceding element.
Examples
We can use grep
to find lines where “i” is followed by zero or more “l”s (including none):
grep 'il*' data/wu_tang.txt
# RZA Robert Diggs
# GZA Gary Grice
# Method Man Clifford Smith
# Ghostface Killah Dennis Coles
# U-God Lamont Hawkins
# Masta Killa Jamel Irief
# Cappadonna Darryl Hill
# Ol Dirty Bastard Russell Tyrone Jones
We can use sed
to replace two or more "l"
s with 11
:
sed 's/lll*/11/g' data/wu_tang.txt
# Member Name
# RZA Robert Diggs
# GZA Gary Grice
# Method Man Clifford Smith
# Raekwon the Chef Corey Woods
# Ghostface Ki11ah Dennis Coles
# Inspectah Deck Jason Hunter
# U-God Lamont Hawkins
# Masta Ki11a Jamel Irief
# Cappadonna Darryl Hi11
# Ol Dirty Bastard Russe11 Tyrone Jones
Print lines that start with one or more "R"
s
awk '/^ *R/' data/wu_tang.txt
# RZA Robert Diggs
# Raekwon the Chef Corey Woods
Plus: +
+
matches one or more occurrences of the preceding element.
Examples
Use grep
with extended regular expressions to find ‘i
’ followed by one or more ’l
’s:
grep -E 'il+' data/wu_tang.txt
# Ghostface Killah Dennis Coles
# Masta Killa Jamel Irief
# Cappadonna Darryl Hill
Replace one or more "a"
s with the @
:
sed -E 's/a+/@/g' data/wu_tang.txt
# Member N@me
# RZA Robert Diggs
# GZA G@ry Grice
# Method M@n Clifford Smith
# R@ekwon the Chef Corey Woods
# Ghostf@ce Kill@h Dennis Coles
# Inspect@h Deck J@son Hunter
# U-God L@mont H@wkins
# M@st@ Kill@ J@mel Irief
# C@pp@donn@ D@rryl Hill
# Ol Dirty B@st@rd Russell Tyrone Jones
The +
operator needs the -E
option to enable extended regular expressions.
Print lines with text containing one or more "Z"
s:
awk '/Z+/' data/wu_tang.txt
# RZA Robert Diggs
# GZA Gary Grice
Question Mark: ?
?
makes the preceding element optional (matches zero or one occurrence).
Examples
Use grep
with extended regular expressions to find lines with ‘Killah
’ or ‘Killah
’:
grep -E 'Kill?' data/wu_tang.txt
# Ghostface Killah Dennis Coles
# Masta Killa Jamel Irief
sed: Replace Ghostface
with Ghost Face
:
sed -E 's/Ghostface?/Ghost Face/' data/wu_tang.txt
# Member Name
# RZA Robert Diggs
# GZA Gary Grice
# Method Man Clifford Smith
# Raekwon the Chef Corey Woods
# Ghost Face Killah Dennis Coles
# Inspectah Deck Jason Hunter
# U-God Lamont Hawkins
# Masta Killa Jamel Irief
# Cappadonna Darryl Hill
# Ol Dirty Bastard Russell Tyrone Jones
awk: Print lines with one or more digits.
awk '/[0-9]+/' data/wu_tang.txt
Character Set: [abc]
[abc]
matches any single character listed in the set.
Example
Use grep
to find lines containing ‘a’, ‘b’, or ‘c’:
grep '[abc]' filename.txt
Caret: ^
^
matches the start of a line.
Example
Use grep
to find lines that start with ‘start’:
grep '^start' filename.txt
Dollar: $
$
matches the end of a line.
Example
Use grep
to find lines that end with ‘end’:
grep 'end$' filename.txt
These patterns are extremely powerful in scripting and command-line operations for filtering and manipulating text data efficiently. Here’s how you might use them in combination across different tools:
sed
for substitution: Replace ‘foo’ with ‘bar’ only if ‘foo’ appears at the beginning of a line:
sed 's/^foo/bar/' filename.txt
awk
for selection: Print lines where the first field matches ‘start’:
awk '/^start/ {print $0}' filename.txt
perl
for advanced manipulation: Increment numbers found at the end of each line:
perl -pe 's/(\d+)$/ $1+1 /e' filename.txt
Special Characters
Special Characters: Characters such as spaces, quotes, and others have special meanings in the shell. They need to be treated carefully when used within arguments.
Braces: {}
Brace Expansion: Similar to wildcards, brace expansion ({}
) allows the creation of multiple text strings from a pattern containing braces.
Example
cat wu_tang.{txt,csv}
cat data/wu_tang.{tsv,dat}
# Member Name
# RZA Robert Diggs
# GZA Gary Grice
# Method Man Clifford Smith
# Raekwon the Chef Corey Woods
# Ghostface Killah Dennis Coles
# Inspectah Deck Jason Hunter
# U-God Lamont Hawkins
# Masta Killa Jamel Irief
# Cappadonna Darryl Hill
# Ol Dirty Bastard Russell Tyrone Jones
# |Member |Name |
# |RZA |Robert Diggs |
# |GZA |Gary Grice |
# |Method Man |Clifford Smith |
# |Raekwon the Chef |Corey Woods |
# |Ghostface Killah |Dennis Coles |
# |Inspectah Deck |Jason Hunter |
# |U-God |Lamont Hawkins |
# |Masta Killa |Jamel Irief |
# |Cappadonna |Darryl Hill |
# |Ol Dirty Bastard |Russell Tyrone Jones |
Expands into:
cat data/wu_tang.tsv
cat data/wu_tang.dat
# Member Name
# RZA Robert Diggs
# GZA Gary Grice
# Method Man Clifford Smith
# Raekwon the Chef Corey Woods
# Ghostface Killah Dennis Coles
# Inspectah Deck Jason Hunter
# U-God Lamont Hawkins
# Masta Killa Jamel Irief
# Cappadonna Darryl Hill
# Ol Dirty Bastard Russell Tyrone Jones
# |Member |Name |
# |RZA |Robert Diggs |
# |GZA |Gary Grice |
# |Method Man |Clifford Smith |
# |Raekwon the Chef |Corey Woods |
# |Ghostface Killah |Dennis Coles |
# |Inspectah Deck |Jason Hunter |
# |U-God |Lamont Hawkins |
# |Masta Killa |Jamel Irief |
# |Cappadonna |Darryl Hill |
# |Ol Dirty Bastard |Russell Tyrone Jones |
Backslash: \
\
escapes the following character, nullifying its special meaning
Example
echo "File name with spaces \& special characters"
prints the text with spaces and the ampersand:
echo "File name with spaces \& special characters"
# File name with spaces & special characters
Single quotes: ''
Single quotes (' '
) treat every character literally, ignoring the special meaning of all characters.
Example
echo '$HOME'
prints $HOME
, not the path to the home directory:
echo '$HOME'
# $HOME
Double quotes: ""
Double quotes (" "
) allow for the inclusion of special characters in an argument, except for the dollar sign ($
), backticks (` `), and backslash (\
).
Example
echo "$HOME"
prints the path to the home directory:
echo "$HOME"
#> /Users/username
Tilde: ~
~
represents the home directory of the current user.
Example
List the items in the user’s home directory:
ls ~
#> Applications
#> Creative Cloud Files
#> Desktop
#> Documents
#> Downloads
#> Dropbox
#> Fonts
#> Library
#> Movies
#> Music
#> Pictures
#> Public
#> R
#> Themes
Dollar Sign: $
$
indicates a variable.
Example
echo $PATH
prints the value of the PATH
environment variable:
echo $PATH
Ampersand: &
&
runs a command in the background.
Example
firefox &
opens Firefox in the background, allowing the terminal to be used for other commands.
firefox &
Semicolon: ;
;
separates multiple commands to be run in sequence.
Example
cd data; ls
changes the directory to data
and then lists its contents:
cd data; ls
# README.md
# ajperlis_epigrams.txt
# music_vids.tsv
# pwrds.csv
# pwrds.tsv
# roxanne.txt
# roxanne_orig.txt
# roxanne_rev.txt
# trees.tsv
# vg_hof.tsv
# who_tb_data.tsv
# who_tb_data.txt
# wu_tang.csv
# wu_tang.dat
# wu_tang.tsv
# wu_tang.txt
Greater Than: >
Redirection operators: >
directs output to a file or a device.
Example
echo "This is my 2nd file" > myfile2.txt
writes "This is my 2nd file"
into myfile2.txt
:
echo "This is my 2nd file" > myfile2.txt
Less Than: <
Redirection operators: <
takes input from a file or a device.
Example
Then wc < myfile2.txt
counts the words in myfile2.txt
:
wc < myfile2.txt
# 1 5 20
Parentheses: ()
Parentheses can be used to group commands or for command substitution with $( )
.
Example
(cd /data; ls)
runs ls
in /data
without changing the current directory:
(cd data; ls)
# README.md
# ajperlis_epigrams.txt
# music_vids.tsv
# pwrds.csv
# pwrds.tsv
# roxanne.txt
# roxanne_orig.txt
# roxanne_rev.txt
# trees.tsv
# vg_hof.tsv
# who_tb_data.tsv
# who_tb_data.txt
# wu_tang.csv
# wu_tang.dat
# wu_tang.tsv
# wu_tang.txt
$(command)
uses the output of command
.