Files

Published

2025-01-25

Caution

This section is under development. Thank you for your patience.

The following commands can be used for creating, managing, and manipulating files. Some of these commands also work on directories (which we covered in the previous chapter).

Create

The commands below can be used to create new files or update the time stamp of an existing file.

touch

We’ll start by creating a new empty file (data/who_tb_data.tsv) with touch:

touch data/who_tb_data.tsv

We can confirm the new who_tb_data.tsv file was created, we’ll use the tree command to check the data folder:

tree -P who_tb_data.tsv data
# data
# └── who_tb_data.tsv
# 
# 1 directory, 1 file

The -P option lets us specify a pattern to search for in the data folder, which we’ll cover more in Symbols & Patterns.

echo

We can add some contents to the data/who_tb_data.tsv file using echo and the > operator.1

echo "country   year    type    count
Afghanistan 1999    cases   745
Afghanistan 1999    population  19987071
Afghanistan 2000    cases   2666
Afghanistan 2000    population  20595360
Brazil  1999    cases   37737
Brazil  1999    population  172006362
Brazil  2000    cases   80488
Brazil  2000    population  174504898
China   1999    cases   212258
China   1999    population  1272915272
China   2000    cases   213766
China   2000    population  1280428583" > data/who_tb_data.tsv

View

cat concatenates and displays file contents. We can use this to view the entire data/who_tb_data.tsv file we just created:

cat

cat data/who_tb_data.tsv
# country   year    type    count
# Afghanistan   1999    cases   745
# Afghanistan   1999    population  19987071
# Afghanistan   2000    cases   2666
# Afghanistan   2000    population  20595360
# Brazil    1999    cases   37737
# Brazil    1999    population  172006362
# Brazil    2000    cases   80488
# Brazil    2000    population  174504898
# China 1999    cases   212258
# China 1999    population  1272915272
# China 2000    cases   213766
# China 2000    population  1280428583

more & less

less and more lets you skim through a file on your computer, moving forwards and backwards as you please. These commands are helpful for larger files, like the Video Game Hall of Fame data stored in the data/vg_hof.tsv file:2

more data/vg_hof.tsv

Enter ‘q’ to exit the more scroll
less data/vg_hof.tsv

Enter ‘q’ to exit the less scroll

head & tail

The head and tail commands let us view the tops and bottoms of files (the -n3 specifies three rows from data/vg_hof.tsv).

head -n3 data/vg_hof.tsv
# year  game    developer   year_released
# 2015  DOOM    id Software 1993
# 2015  Pac-Man Namco   1980
tail -n3 data/vg_hof.tsv
# 2024  Tony Hawk's Pro Skater  Neversoft   1999
# 2024  Ultima  Richard Garriott, Origin Systems    1981
# 2024  You Don't Know Jack Jellyvision 1995

Manage

The commands below can be used for copying, moving, renaming, and creating links to files. Assume we want to create backups of the delimiter-separated data files in the data/ folder. We’ll store these backups in folders according the file extension.

First we need to create folders for each type of delimiter (.tsv and .psv):

mkdir data/tsv
mkdir data/psv

Confirm these new folders with tree -d

tree data -d
# data
# ├── psv
# ├── raw
# └── tsv
# 
# 4 directories

cp

We’ll copy the files into their respective folder based their extension using cp and *. For reference, here is a look of the data/ folder before and after copying the .tsv files:

cp data/*.tsv data/tsv/

Before copying .csv files into data/csv/:

After copying .csv files into data/csv/:4

tree data -L 2 -P '*.tsv' --filesfirst
# data
# ├── music_vids.tsv
# ├── pwrds.tsv
# ├── trees.tsv
# ├── vg_hof.tsv
# ├── who_tb_data.tsv
# ├── wu_tang.tsv
# ├── psv
# ├── raw
# └── tsv
# 
# 4 directories, 6 files
tree data -L 2 -P '*.tsv' --filesfirst
# data
# ├── music_vids.tsv
# ├── pwrds.tsv
# ├── trees.tsv
# ├── vg_hof.tsv
# ├── who_tb_data.tsv
# ├── wu_tang.tsv
# ├── psv
# ├── raw
# └── tsv
#     ├── music_vids.tsv
#     ├── pwrds.tsv
#     ├── trees.tsv
#     ├── vg_hof.tsv
#     ├── who_tb_data.tsv
#     └── wu_tang.tsv
# 
# 4 directories, 12 files

Note that the number of .tsv files doubled (from six files to twelve files). We’ll do the same for the .psv files in data/

cp data/*.psv data/psv/

Confirm the .psv files were copied:

tree data/psv/
# data/psv/
# └── wu_tang.psv
# 
# 1 directory, 1 file

mv

The commands below create a .psv version of the who_tb_data.tsv we created above:

touch data/who_tb_data.psv
echo "| country     | year | type       | count      |
|-------------|------|------------|------------|
| Afghanistan | 1999 | cases      | 745        |
| Afghanistan | 1999 | population | 19987071   |
| Afghanistan | 2000 | cases      | 2666       |
| Afghanistan | 2000 | population | 20595360   |
| Brazil      | 1999 | cases      | 37737      |
| Brazil      | 1999 | population | 172006362  |
| Brazil      | 2000 | cases      | 80488      |
| Brazil      | 2000 | population | 174504898  |
| China       | 1999 | cases      | 212258     |
| China       | 1999 | population | 1272915272 |
| China       | 2000 | cases      | 213766     |
| China       | 2000 | population | 1280428583 |" > data/who_tb_data.psv

Oops–we created the who_tb_data.psv in the data folder and not the data/psv folder: 5

tree data -L 2 -P '*who_tb_data.psv'
# data
# ├── psv
# ├── raw
# ├── tsv
# └── who_tb_data.psv
# 
# 4 directories, 1 file

We’ll use mv to move the who_tb_data.psv and data/who_tb_data.psv files into data/psv/ and confirm with tree:6

mv data/who_tb_data.psv data/psv/who_tb_data.psv
tree data/psv
# data/psv
# ├── who_tb_data.psv
# └── wu_tang.psv
# 
# 1 directory, 2 files

ln

ln is used to create links between files. It can create two types of links: hard links and symbolic (or soft) links.

A hard link is an additional name for an existing file on the same file system, and is effectively an additional directory entry for the file. In Linux file systems, all file names are technically hard links.

A symbolic link (often called a symlink) is a file that points to another file or directory, and it contains a path to another entry somewhere in the file system.

With the ln command, you need to specify the target file first (the original file) and then the name of the new link:

ln original_file.txt new_link.txt

ln doesn’t produce any output and returns zero when it’s successful.

We’ll use ln to create data/who_tb_data.psv, a hard link for the data file in data/psv/who_tb_data.psv:

ln data/psv/who_tb_data.psv data/who_tb_hardlink.psv

If we check the data folder with tree, we see the new who_tb_data.psv file looks identical to the other files:

tree data -P '*.psv'
# data
# ├── csv
# ├── psv
# │   ├── who_tb_data.psv
# │   └── wu_tang.psv
# ├── tsv
# ├── who_tb_hardlink.psv
# └── wu_tang.psv
# 
# 4 directories, 4 files

Hard links are basically copies–changes made to one will reflect in the other since they both refer to the same data.

Now we’ll use ln -s to create data/who_tb_symlink.csv, a symlink for the data file in data/raw/who_tb_data.csv.

ln -s raw/who_tb_data.csv data/who_tb_symlink.csv

When we look at the folder with tree now, we see the symlink is listed with a special pointer (->) to the original file:

tree data -P '*.csv'
# data
# ├── psv
# ├── raw
# │   ├── music_vids.csv
# │   ├── pwrds.csv
# │   ├── trees.csv
# │   ├── vg_hof.csv
# │   ├── who_tb_data.csv
# │   └── wu_tang.csv
# ├── tsv
# └── who_tb_symlink.csv -> raw/who_tb_data.csv
# 
# 4 directories, 7 files

The symbolic link only references the actual file, but doesn’t store the data itself.

The tree command displays the target file first (the original file) and the link using color:

Color for symlinks with tree

Info

The commands below return different types of information from a files (or files).

ls

The ls command lists the contents of a directory.

ls data/who_tb_hardlink.psv
# data/who_tb_hardlink.psv

Adding -l returns the contents in a detailed long format. The information below is from the hardlink for the .psv file.

ls -l data/psv/who_tb_data.psv
# -rw-r--r--@ 2 username  staff  686 May 16 07:47 data/psv/who_tb_data.psv

-l will add information like file permissions, number of links, owner, group, file size, and last modification date for each item.

If we check the hardlink for the .psv file:

ls -l data/who_tb_hardlink.psv
# -rw-r--r--@ 2 username  staff  686 May 16 07:47 data/who_tb_hardlink.psv

Both ls -l commands return similar information for the who_tb_data.psv files, indicating they’re not treated as separate files, but rather two paths to the same file.

Compare this to the symlink we created with it’s original file:

ls -l data/who_tb_symlink.csv
# lrwxr-xr-x@ 1 username  staff  19 May 16 07:50 data/who_tb_symlink.csv -> raw/who_tb_data.csv

The ls -l output for data/who_tb_data.csv (the symlink) returns lrwxr-xr-x (file permissions for a symbolic link) and

ls -l data/raw/who_tb_data.csv
# -rw-r--r--@ 1 uername  staff  381 May 15 21:55 data/raw/who_tb_data.csv

The who_tb_data.csv file in data/raw returns -rw-r--r-- (standard file permissions),

File link info (@)

The preceding @ 2 from ls -l data/who_tb_hardlink.psv indicates the files have two hard links, meaning who_tb_data.psv is physically located in one place but can be accessed from two locations: data/who_tb_data.psv and data/psv/who_tb_data.psv. The @ 1 from ls -l data/csv/who_tb_data.csv indicates a single hard link (symbolic links always have one link).

We’ll cover this output more in the Permissions section below.

diff

diff compare the contents of two files line-by-line. We’ll use diff to compare the pipe-separated values file (data/wu_tang.psv) to the comma-separated separated values file (data/wu_tang.csv)

diff data/wu_tang.psv data/wu_tang.csv
# 1,11c1,11

The first line of the output indicates that lines 1 through 11 in the first file (data/wu_tang.psv) have been changed compared to lines 1 through 11 in the second file (data/wu_tang.csv).

Lines starting with < indicate the content from the first file (data/wu_tang.psv). These entries are separated by pipes or spaces (as commonly used in PSV files).

# < |Member           |Name                 |
# < |RZA              |Robert Diggs         |
# < |GZA              |Gary Grice           |
# < |Method Man       |Clifford Smith       |
# < |Raekwon the Chef |Corey Woods          |
# < |Ghostface Killah |Dennis Coles         |
# < |Inspectah Deck   |Jason Hunter         |
# < |U-God            |Lamont Hawkins       |
# < |Masta Killa      |Jamel Irief          |
# < |Cappadonna       |Darryl Hill          |
# < |Ol Dirty Bastard |Russell Tyrone Jones |
# > Member,Name
# > RZA,Robert Diggs
# > GZA,Gary Grice
# > Method Man,Clifford Smith
# > Raekwon the Chef,Corey Woods
# > Ghostface Killah,Dennis Coles
# > Inspectah Deck,Jason Hunter
# > U-God,Lamont Hawkins
# > Masta Killa,Jamel Irief
# > Cappadonna,Darryl Hill
# > Ol Dirty Bastard,Russell Tyrone Jones

Lines starting with > show the content from the second file (data/wu_tang.csv). These entries are separated by commas, as is typical for CSV files.

There is no difference in the actual data (Member or Name)–both files contain the same information, so the primary difference is purely in the formatting of the data: the PSV (pipe-separated Values) file uses vertical bars (|) and spaces to separate data fields, whereas the CSV (comma-separated values) file uses commas (,).7

What happens when we compare the symlink (data/who_tb_data.csv) and it’s original file (data/raw/who_tb_data.csv) with diff?

diff data/who_tb_symlink.csv data/raw/who_tb_data.csv

diff returns nothing and doesn’t produce any output if there are no differences between the two files.

Deleting, moving, or renaming the original file does not affect the integrity of a hard link:

# remove original file
rm data/psv/who_tb_data.psv
# check hard link
cat data/who_tb_hardlink.psv
# | country     | year | type       | count      |
# |-------------|------|------------|------------|
# | Afghanistan | 1999 | cases      | 745        |
# | Afghanistan | 1999 | population | 19987071   |
# | Afghanistan | 2000 | cases      | 2666       |
# | Afghanistan | 2000 | population | 20595360   |
# | Brazil      | 1999 | cases      | 37737      |
# | Brazil      | 1999 | population | 172006362  |
# | Brazil      | 2000 | cases      | 80488      |
# | Brazil      | 2000 | population | 174504898  |
# | China       | 1999 | cases      | 212258     |
# | China       | 1999 | population | 1272915272 |
# | China       | 2000 | cases      | 213766     |
# | China       | 2000 | population | 1280428583 |

However, if the original file for a symlink is deleted, moved, or renamed, the symbolic link breaks and typically becomes a ‘dangling’ link that points to a non-existent path.

file

file gives us a summary of what a file is or what it contains, like telling us what’s in data/who_tb_data.csv.

file data/who_tb_symlink.csv
# data/who_tb_symlink.csv: CSV text

The -i option will tell us if this is a regular file:

file -i data/who_tb_symlink.csv
# data/who_tb_symlink.csv: regular file

wc

wc (word count) counts the number of lines, words, and characters in the given input. If a file name is provided, it performs the count on the file; otherwise, it reads from the standard input.

wc data/who_tb_symlink.csv
#       13      13     381 data/who_tb_symlink.csv

stat

stat displays detailed information about files.

stat data/who_tb_symlink.csv
# 16777221 317774438 lrwxr-xr-x 1 username staff 0 19 "May 13 13:10:34 2024" \
#     "May 13 13:10:34 2024" "May 13 13:10:34 2024" "May 13 13:10:34 2024"     \ 
#     4096 0 0 data/who_tb_symlink.csv

Adding the -l includes the symbolic link to the original file.

stat -l data/who_tb_symlink.csv
# lrwxr-xr-x 1 username staff 19 May 13 13:10:34 2024 \
#     data/who_tb_symlink.csv -> csv/who_tb_data.csv

du

du estimates file space usage.

du data/raw/pwrds.csv
# 24    data/raw/pwrds.csv

The -h makes the output human readable.

du -h data/raw/pwrds.csv
#  12K  data/raw/pwrds.csv

If we pass the original file and symlink of who_tb_data.csv to du, we see the symlink doesn’t contain any actual data:

du -h data/raw/who_tb_data.csv
# 4.0K  data/raw/who_tb_data.csv
du -h data/who_tb_symlink.csv
#   0B  data/who_tb_symlink.csv

Permissions

We’ll go over file permissions in-depth in the Permissions chapter, but I’ll quickly summarize two common uses of chmod below. The file permissions are printed with the ls -l output:

ls -l data/README.md
# -rw-r--r--@ 1 username  staff  8834 May 14 09:33 data/README.md

chmod

chmod changes file permissions. To change the file permissions using chmod, using either Below are some simple examples using symbolic notation.8

To grant the permissions above (i.e., -rw-r--r--) using symbolic notation with chmod, we could use:

chmod u=rw,g=r,o=r data/README.md

u=rw: sets the user (owner) permissions to read and write.

g=r: sets the group permissions to read.

o=r: set the permissions for others to read.

Recap

See a typo, error, or something missing?

Please open an issue on GitHub.


  1. data/who_tb_data.tsv comes from the WHO global tuberculosis programme.↩︎

  2. data/vg_hof.tsv is the Video Game Hall of Fame data↩︎

  3. locate sometimes requires the search database is generated/updated. Read more here↩︎

  4. The -L 2 option tells tree to only look in the data folder (no subfolders), -P '*.tsv' matches the .csv files, and --filesfirst lists the files first, then the new directories.↩︎

  5. The -L 2 option tells tree to only look in the data folder (no subfolders) and -P '*who_tb_data.psv' matches the who_tb_data.psv file.↩︎

  6. cp and mv also work with directories.↩︎

  7. This type of difference is significant if the format impacts how data is parsed or used. For example, a software program expecting data in CSV format might not correctly parse a PSV file, and vice versa.↩︎

  8. We’ll cover octal (numeric) notation in the Permissions chapter.↩︎