Regular Expressions

Published

2025-06-25

Caution

This section is under development. Thank you for your patience.

Regular expressions (or the singular ‘regex’) are powerful tools for searching and manipulating text data. A regex is made up of the symbols and characters that define specific patterns to be matched, identified or transformed.

A regex operates on text–the sequence of characters that include letters, digits, punctuation, and other character types. Text serves as the ‘data’ or ‘medium’ for which the patterns the regex describes are searched.1

Matching occurrences

+ matches one or more occurrences of the preceding element.

[abc]: Matches any one character listed (a, b, or c).

Example

Match any line containing one or more digits in who_tb_data.csv.:

grep -E '[0-9]+' data/raw/who_tb_data.csv
# Afghanistan,1999,cases,745
# Afghanistan,1999,population,19987071
# Afghanistan,2000,cases,2666
# Afghanistan,2000,population,20595360
# Brazil,1999,cases,37737
# Brazil,1999,population,172006362
# Brazil,2000,cases,80488
# Brazil,2000,population,174504898
# China,1999,cases,212258
# China,1999,population,1272915272
# China,2000,cases,213766
# China,2000,population,1280428583

Starts with

^ matches the start of a line.

[abc]: Matches any one character listed (a, b, or c).

Example

Find lines beginning with an uppercase letter in README.md:

grep -E '^[A-Z]' data/README.md
# The `ajperlis_epigrams.txt` file contains a collection of witty
# These statements cover various topics related to computer programming,
# The file `music_vids.tsv` is a tab-separated values (TSV) data file that
# The dataset contains information about passwords' strength, popularity
# The file `roxanne.txt` contains the lyrics to the song "Roxanne" by The
# Police.[^readme-4] The structure of the lyrics emphasizes the repeated
# The file `trees.csv` is a comma-separated value (CSV) document that
# The file `vg_hof.csv` is a comma-separated values (CSV) document that
# The fields detailed in the dataset are:
# The dataset includes the following fields:

Ends with

($| ): followed by either end-of-line ($) or space

Example

Match lines that match a literal ah and either end-of-line ($) or space.

grep -E 'ah($| )' data/wu_tang.psv
# |Ghostface Killah |Dennis Coles         |
# |Inspectah Deck   |Jason Hunter         |

  1. The regex patterns are not the data themselves, but rather a framework for locating or modifying text data.↩︎