Regular Expressions

Published

2025-06-25

Caution

This section is under development. Thank you for your patience.

Regular expressions (or the singular ‘regex’) are powerful tools for searching and manipulating text data. A regex is made up of the symbols and characters that define specific patterns to be matched, identified or transformed.

A regex operates on text–the sequence of characters that include letters, digits, punctuation, and other character types. Text serves as the ‘data’ or ‘medium’ for which the patterns the regex describes are searched.¹

Matching occurrences

+ matches one or more occurrences of the preceding element.

[abc]: Matches any one character listed (a, b, or c).

Example

Match any line containing one or more digits in who_tb_data.csv.:

grep -E '[0-9]+' data/raw/who_tb_data.csv
# Afghanistan,1999,cases,745
# Afghanistan,1999,population,19987071
# Afghanistan,2000,cases,2666
# Afghanistan,2000,population,20595360
# Brazil,1999,cases,37737
# Brazil,1999,population,172006362
# Brazil,2000,cases,80488
# Brazil,2000,population,174504898
# China,1999,cases,212258
# China,1999,population,1272915272
# China,2000,cases,213766
# China,2000,population,1280428583

Starts with

^ matches the start of a line.

[abc]: Matches any one character listed (a, b, or c).

Example

Find lines beginning with an uppercase letter in README.md:

grep -E '^[A-Z]' data/README.md
# The `ajperlis_epigrams.txt` file contains a collection of witty
# These statements cover various topics related to computer programming,
# The file `music_vids.tsv` is a tab-separated values (TSV) data file that
# The dataset contains information about passwords' strength, popularity
# The file `roxanne.txt` contains the lyrics to the song "Roxanne" by The
# Police.[^readme-4] The structure of the lyrics emphasizes the repeated
# The file `trees.csv` is a comma-separated value (CSV) document that
# The file `vg_hof.csv` is a comma-separated values (CSV) document that
# The fields detailed in the dataset are:
# The dataset includes the following fields:

Ends with

($| ): followed by either end-of-line ($) or space

Example

Match lines that match a literal ah and either end-of-line ($) or space.

grep -E 'ah($| )' data/wu_tang.psv
# |Ghostface Killah |Dennis Coles         |
# |Inspectah Deck   |Jason Hunter         |

Search

Example

^ matches the start of a line.

Find lines beginning with Ghost in data/raw/wu_tang.csv:

grep -E '^Ghost' data/raw/wu_tang.csv
# Ghostface Killah,Dennis Coles

Example

Print lines with i followed by one or more ls or lines with i followed by one or more fs:

grep -E  '(il+|if+)' data/wu_tang.psv
# |Method Man       |Clifford Smith       |
# |Ghostface Killah |Dennis Coles         |
# |Masta Killa      |Jamel Irief          |
# |Cappadonna       |Darryl Hill          |

Deep search

Example

Recursively search for Kill or cybernetic in any file under data/raw

grep -R -E '(Kill|cybernetic)' data/raw
# data/raw/wu_tang.csv:Ghostface Killah,Dennis Coles
# data/raw/wu_tang.csv:Masta Killa,Jamel Irief
# data/raw/ajperlis_epigrams.txt:The cybernetic exchange between man, computer and algorithm is like a game of musical chairs: The frantic search for balance always leaves one of the three standing ill at ease.

Example

Print lines containing any four-digit sequence with square brackets and curly braces:

awk '/[0-9]{4}/ {print}' data/raw/who_tb_data.csv
# Afghanistan,1999,cases,745
# Afghanistan,1999,population,19987071
# Afghanistan,2000,cases,2666
# Afghanistan,2000,population,20595360
# Brazil,1999,cases,37737
# Brazil,1999,population,172006362
# Brazil,2000,cases,80488
# Brazil,2000,population,174504898
# China,1999,cases,212258
# China,1999,population,1272915272
# China,2000,cases,213766
# China,2000,population,1280428583

Regular expressions can combine wildcards and special characters and are typically used with tools like grep (global regular expression print), sed (stream editor), cat (concatenate) and awk.

See a typo, error, or something missing?

Please open an issue on GitHub.

The regex patterns are not the data themselves, but rather a framework for locating or modifying text data.↩︎