grep -E '[0-9]+' data/raw/who_tb_data.csv
# Afghanistan,1999,cases,745
# Afghanistan,1999,population,19987071
# Afghanistan,2000,cases,2666
# Afghanistan,2000,population,20595360
# Brazil,1999,cases,37737
# Brazil,1999,population,172006362
# Brazil,2000,cases,80488
# Brazil,2000,population,174504898
# China,1999,cases,212258
# China,1999,population,1272915272
# China,2000,cases,213766
# China,2000,population,1280428583
Regular Expressions
Regular expressions (or the singular ‘regex’) are powerful tools for searching and manipulating text data. A regex is made up of the symbols and characters that define specific patterns to be matched, identified or transformed.
A regex operates on text–the sequence of characters that include letters, digits, punctuation, and other character types. Text serves as the ‘data’ or ‘medium’ for which the patterns the regex describes are searched.1
Matching occurrences
+
matches one or more occurrences of the preceding element.
[abc]
: Matches any one character listed (a
, b
, or c
).
Example
Match any line containing one or more digits in who_tb_data.csv.
:
Starts with
^
matches the start of a line.
[abc]
: Matches any one character listed (a
, b
, or c
).
Example
Find lines beginning with an uppercase letter in README.md
:
grep -E '^[A-Z]' data/README.md
# The `ajperlis_epigrams.txt` file contains a collection of witty
# These statements cover various topics related to computer programming,
# The file `music_vids.tsv` is a tab-separated values (TSV) data file that
# The dataset contains information about passwords' strength, popularity
# The file `roxanne.txt` contains the lyrics to the song "Roxanne" by The
# Police.[^readme-4] The structure of the lyrics emphasizes the repeated
# The file `trees.csv` is a comma-separated value (CSV) document that
# The file `vg_hof.csv` is a comma-separated values (CSV) document that
# The fields detailed in the dataset are:
# The dataset includes the following fields:
Ends with
($| )
: followed by either end-of-line ($
) or space
Example
Match lines that match a literal ah
and either end-of-line ($
) or space.
grep -E 'ah($| )' data/wu_tang.psv
# |Ghostface Killah |Dennis Coles |
# |Inspectah Deck |Jason Hunter |
Search
Example
^
matches the start of a line.
Find lines beginning with Ghost
in data/raw/wu_tang.csv
:
grep -E '^Ghost' data/raw/wu_tang.csv
# Ghostface Killah,Dennis Coles
Example
Print lines with i
followed by one or more l
s or lines with i
followed by one or more f
s:
grep -E '(il+|if+)' data/wu_tang.psv
# |Method Man |Clifford Smith |
# |Ghostface Killah |Dennis Coles |
# |Masta Killa |Jamel Irief |
# |Cappadonna |Darryl Hill |
Deep search
Example
Recursively search for Kill
or cybernetic
in any file under data/raw
grep -R -E '(Kill|cybernetic)' data/raw
# data/raw/wu_tang.csv:Ghostface Killah,Dennis Coles
# data/raw/wu_tang.csv:Masta Killa,Jamel Irief
# data/raw/ajperlis_epigrams.txt:The cybernetic exchange between man, computer and algorithm is like a game of musical chairs: The frantic search for balance always leaves one of the three standing ill at ease.
Example
Print lines containing any four-digit sequence with square brackets and curly braces:
awk '/[0-9]{4}/ {print}' data/raw/who_tb_data.csv
# Afghanistan,1999,cases,745
# Afghanistan,1999,population,19987071
# Afghanistan,2000,cases,2666
# Afghanistan,2000,population,20595360
# Brazil,1999,cases,37737
# Brazil,1999,population,172006362
# Brazil,2000,cases,80488
# Brazil,2000,population,174504898
# China,1999,cases,212258
# China,1999,population,1272915272
# China,2000,cases,213766
# China,2000,population,1280428583
Regular expressions can combine wildcards and special characters and are typically used with tools like grep
(global regular expression print), sed
(stream editor), cat
(concatenate) and awk
.
The
regex
patterns are not the data themselves, but rather a framework for locating or modifying text data.↩︎