touch data/who_tb_data.tsv
Files
The following commands can be used for creating, managing, and manipulating files. Some of these commands also work on directories (which we covered in the previous chapter).
Create
The commands below can be used to create new files or update the time stamp of an existing file.
touch
We’ll start by creating a new empty file (data/who_tb_data.tsv
) with touch
:
We can confirm the new who_tb_data.tsv
file was created, we’ll use the tree
command to check the data
folder:
tree -P who_tb_data.tsv data
# data
# └── who_tb_data.tsv
#
# 1 directory, 1 file
The -P
option lets us specify a pattern to search for in the data
folder, which we’ll cover more in Symbols & Patterns.
echo
We can add some contents to the data/who_tb_data.tsv
file using echo
and the >
operator.1
echo "country year type count
Afghanistan 1999 cases 745
Afghanistan 1999 population 19987071
Afghanistan 2000 cases 2666
Afghanistan 2000 population 20595360
Brazil 1999 cases 37737
Brazil 1999 population 172006362
Brazil 2000 cases 80488
Brazil 2000 population 174504898
China 1999 cases 212258
China 1999 population 1272915272
China 2000 cases 213766
China 2000 population 1280428583" > data/who_tb_data.tsv
View
cat
concatenates and displays file contents. We can use this to view the entire data/who_tb_data.tsv
file we just created:
cat
cat data/who_tb_data.tsv
# country year type count
# Afghanistan 1999 cases 745
# Afghanistan 1999 population 19987071
# Afghanistan 2000 cases 2666
# Afghanistan 2000 population 20595360
# Brazil 1999 cases 37737
# Brazil 1999 population 172006362
# Brazil 2000 cases 80488
# Brazil 2000 population 174504898
# China 1999 cases 212258
# China 1999 population 1272915272
# China 2000 cases 213766
# China 2000 population 1280428583
more
& less
less
and more
lets you skim through a file on your computer, moving forwards and backwards as you please. These commands are helpful for larger files, like the Video Game Hall of Fame data stored in the data/vg_hof.tsv
file:2
more data/vg_hof.tsv
more
scrollless data/vg_hof.tsv
less
scrollhead
& tail
The head
and tail
commands let us view the tops and bottoms of files (the -n3
specifies three rows from data/vg_hof.tsv
).
head -n3 data/vg_hof.tsv
# year game developer year_released
# 2015 DOOM id Software 1993
# 2015 Pac-Man Namco 1980
tail -n3 data/vg_hof.tsv
# 2024 Tony Hawk's Pro Skater Neversoft 1999
# 2024 Ultima Richard Garriott, Origin Systems 1981
# 2024 You Don't Know Jack Jellyvision 1995
Search
These commands can help you search for files (and directories).
grep
grep
searches files for lines matching a pattern. We’ll use it to search for a specific video game title in data/vg_hof.tsv
:
grep "The Oregon Trail" data/vg_hof.tsv
# 2015 The Oregon Trail Don Rawitsch, Bill Heinemann, and Paul Dillenberger 1971
# 2016 The Oregon Trail Don Rawitsch, Bill Heinemann, and Paul Dillenberger 1971
find
is used to search for files and directories in a directory hierarchy based on various criteria such as name, size, file type, and modification time.
find
The .psv
extension is used for pipe-separated files (|
). We’ll use find
to locate any .psv
files in data/
:
find data -name "*.psv"
# data/wu_tang.psv
find
can be very specific, too. For example, the commands below look in the data
directory for tab-delimited files (i.e., with a .tsv
extension modified in the last day.
find data -name "*.tsv" -mtime -1
# data/who_tb_data.tsv
locate
locate
finds files by name quickly using a database.3
locate who_tb_data data | head -n3
# /Users/mjfrigaard/projects/books/fm-unix/data/csv/who_tb_data.csv
# /Users/mjfrigaard/projects/books/fm-unix/data/tsv/who_tb_data.tsv
# /Users/mjfrigaard/projects/books/fm-unix/data/who_tb_data.csv
Manage
The commands below can be used for copying, moving, renaming, and creating links to files. Assume we want to create backups of the delimiter-separated data files in the data/
folder. We’ll store these backups in folders according the file extension.
First we need to create folders for each type of delimiter (.tsv
and .psv
):
mkdir data/tsv
mkdir data/psv
Confirm these new folders with tree -d
tree data -d
# data
# ├── psv
# ├── raw
# └── tsv
#
# 4 directories
cp
We’ll copy the files into their respective folder based their extension using cp
and *
. For reference, here is a look of the data/
folder before and after copying the .tsv
files:
cp data/*.tsv data/tsv/
Before copying .csv
files into data/csv/
:
After copying .csv
files into data/csv/
:4
tree data -L 2 -P '*.tsv' --filesfirst
# data
# ├── music_vids.tsv
# ├── pwrds.tsv
# ├── trees.tsv
# ├── vg_hof.tsv
# ├── who_tb_data.tsv
# ├── wu_tang.tsv
# ├── psv
# ├── raw
# └── tsv
#
# 4 directories, 6 files
tree data -L 2 -P '*.tsv' --filesfirst
# data
# ├── music_vids.tsv
# ├── pwrds.tsv
# ├── trees.tsv
# ├── vg_hof.tsv
# ├── who_tb_data.tsv
# ├── wu_tang.tsv
# ├── psv
# ├── raw
# └── tsv
# ├── music_vids.tsv
# ├── pwrds.tsv
# ├── trees.tsv
# ├── vg_hof.tsv
# ├── who_tb_data.tsv
# └── wu_tang.tsv
#
# 4 directories, 12 files
Note that the number of .tsv
files doubled (from six files to twelve files). We’ll do the same for the .psv
files in data/
cp data/*.psv data/psv/
Confirm the .psv
files were copied:
tree data/psv/
# data/psv/
# └── wu_tang.psv
#
# 1 directory, 1 file
mv
The commands below create a .psv
version of the who_tb_data.tsv
we created above:
touch data/who_tb_data.psv
echo "| country | year | type | count |
|-------------|------|------------|------------|
| Afghanistan | 1999 | cases | 745 |
| Afghanistan | 1999 | population | 19987071 |
| Afghanistan | 2000 | cases | 2666 |
| Afghanistan | 2000 | population | 20595360 |
| Brazil | 1999 | cases | 37737 |
| Brazil | 1999 | population | 172006362 |
| Brazil | 2000 | cases | 80488 |
| Brazil | 2000 | population | 174504898 |
| China | 1999 | cases | 212258 |
| China | 1999 | population | 1272915272 |
| China | 2000 | cases | 213766 |
| China | 2000 | population | 1280428583 |" > data/who_tb_data.psv
Oops–we created the who_tb_data.psv
in the data
folder and not the data/psv
folder: 5
tree data -L 2 -P '*who_tb_data.psv'
# data
# ├── psv
# ├── raw
# ├── tsv
# └── who_tb_data.psv
#
# 4 directories, 1 file
We’ll use mv
to move the who_tb_data.psv
and data/who_tb_data.psv
files into data/psv/
and confirm with tree
:6
mv data/who_tb_data.psv data/psv/who_tb_data.psv
tree data/psv
# data/psv
# ├── who_tb_data.psv
# └── wu_tang.psv
#
# 1 directory, 2 files
ln
ln
is used to create links between files. It can create two types of links: hard links and symbolic (or soft) links.
A hard link is an additional name for an existing file on the same file system, and is effectively an additional directory entry for the file. In Linux file systems, all file names are technically hard links.
A symbolic link (often called a symlink) is a file that points to another file or directory, and it contains a path to another entry somewhere in the file system.
With the ln
command, you need to specify the target file first (the original file) and then the name of the new link:
ln original_file.txt new_link.txt
We’ll use ln
to create data/who_tb_data.psv
, a hard link for the data file in data/psv/who_tb_data.psv
:
ln data/psv/who_tb_data.psv data/who_tb_hardlink.psv
If we check the data
folder with tree
, we see the new who_tb_data.psv
file looks identical to the other files:
tree data -P '*.psv'
# data
# ├── csv
# ├── psv
# │ ├── who_tb_data.psv
# │ └── wu_tang.psv
# ├── tsv
# ├── who_tb_hardlink.psv
# └── wu_tang.psv
#
# 4 directories, 4 files
Hard links are basically copies–changes made to one will reflect in the other since they both refer to the same data.
Now we’ll use ln -s
to create data/who_tb_symlink.csv
, a symlink for the data file in data/raw/who_tb_data.csv
.
ln -s raw/who_tb_data.csv data/who_tb_symlink.csv
When we look at the folder with tree
now, we see the symlink is listed with a special pointer (->
) to the original file:
tree data -P '*.csv'
# data
# ├── psv
# ├── raw
# │ ├── music_vids.csv
# │ ├── pwrds.csv
# │ ├── trees.csv
# │ ├── vg_hof.csv
# │ ├── who_tb_data.csv
# │ └── wu_tang.csv
# ├── tsv
# └── who_tb_symlink.csv -> raw/who_tb_data.csv
#
# 4 directories, 7 files
The symbolic link only references the actual file, but doesn’t store the data itself.
Info
The commands below return different types of information from a files (or files).
ls
The ls
command lists the contents of a directory.
ls data/who_tb_hardlink.psv
# data/who_tb_hardlink.psv
Adding -l
returns the contents in a detailed long format. The information below is from the hardlink for the .psv
file.
ls -l data/psv/who_tb_data.psv
# -rw-r--r--@ 2 username staff 686 May 16 07:47 data/psv/who_tb_data.psv
-l
will add information like file permissions, number of links, owner, group, file size, and last modification date for each item.
If we check the hardlink for the .psv
file:
ls -l data/who_tb_hardlink.psv
# -rw-r--r--@ 2 username staff 686 May 16 07:47 data/who_tb_hardlink.psv
Both ls -l
commands return similar information for the who_tb_data.psv
files, indicating they’re not treated as separate files, but rather two paths to the same file.
Compare this to the symlink we created with it’s original file:
ls -l data/who_tb_symlink.csv
# lrwxr-xr-x@ 1 username staff 19 May 16 07:50 data/who_tb_symlink.csv -> raw/who_tb_data.csv
The ls -l
output for data/who_tb_data.csv
(the symlink) returns lrwxr-xr-x
(file permissions for a symbolic link) and
ls -l data/raw/who_tb_data.csv
# -rw-r--r--@ 1 uername staff 381 May 15 21:55 data/raw/who_tb_data.csv
The who_tb_data.csv
file in data/raw
returns -rw-r--r--
(standard file permissions),
File link info (@
)
The preceding @ 2
from ls -l data/who_tb_hardlink.psv
indicates the files have two hard links, meaning who_tb_data.psv
is physically located in one place but can be accessed from two locations: data/who_tb_data.psv
and data/psv/who_tb_data.psv
. The @ 1
from ls -l data/csv/who_tb_data.csv
indicates a single hard link (symbolic links always have one link).
We’ll cover this output more in the Permissions section below.
diff
diff
compare the contents of two files line-by-line. We’ll use diff
to compare the pipe-separated values file (data/wu_tang.psv
) to the comma-separated separated values file (data/wu_tang.csv
)
diff data/wu_tang.psv data/wu_tang.csv
# 1,11c1,11
The first line of the output indicates that lines 1 through 11 in the first file (data/wu_tang.psv
) have been changed compared to lines 1 through 11 in the second file (data/wu_tang.csv
).
Lines starting with <
indicate the content from the first file (data/wu_tang.psv
). These entries are separated by pipes or spaces (as commonly used in PSV files).
# < |Member |Name |
# < |RZA |Robert Diggs |
# < |GZA |Gary Grice |
# < |Method Man |Clifford Smith |
# < |Raekwon the Chef |Corey Woods |
# < |Ghostface Killah |Dennis Coles |
# < |Inspectah Deck |Jason Hunter |
# < |U-God |Lamont Hawkins |
# < |Masta Killa |Jamel Irief |
# < |Cappadonna |Darryl Hill |
# < |Ol Dirty Bastard |Russell Tyrone Jones |
# > Member,Name
# > RZA,Robert Diggs
# > GZA,Gary Grice
# > Method Man,Clifford Smith
# > Raekwon the Chef,Corey Woods
# > Ghostface Killah,Dennis Coles
# > Inspectah Deck,Jason Hunter
# > U-God,Lamont Hawkins
# > Masta Killa,Jamel Irief
# > Cappadonna,Darryl Hill
# > Ol Dirty Bastard,Russell Tyrone Jones
Lines starting with >
show the content from the second file (data/wu_tang.csv
). These entries are separated by commas, as is typical for CSV files.
There is no difference in the actual data (Member
or Name
)–both files contain the same information, so the primary difference is purely in the formatting of the data: the PSV (pipe-separated Values) file uses vertical bars (|
) and spaces to separate data fields, whereas the CSV (comma-separated values) file uses commas (,
).7
What happens when we compare the symlink (data/who_tb_data.csv
) and it’s original file (data/raw/who_tb_data.csv
) with diff
?
diff data/who_tb_symlink.csv data/raw/who_tb_data.csv
diff
returns nothing and doesn’t produce any output if there are no differences between the two files.
Deleting, moving, or renaming the original file does not affect the integrity of a hard link:
# remove original file
rm data/psv/who_tb_data.psv
# check hard link
cat data/who_tb_hardlink.psv
# | country | year | type | count |
# |-------------|------|------------|------------|
# | Afghanistan | 1999 | cases | 745 |
# | Afghanistan | 1999 | population | 19987071 |
# | Afghanistan | 2000 | cases | 2666 |
# | Afghanistan | 2000 | population | 20595360 |
# | Brazil | 1999 | cases | 37737 |
# | Brazil | 1999 | population | 172006362 |
# | Brazil | 2000 | cases | 80488 |
# | Brazil | 2000 | population | 174504898 |
# | China | 1999 | cases | 212258 |
# | China | 1999 | population | 1272915272 |
# | China | 2000 | cases | 213766 |
# | China | 2000 | population | 1280428583 |
However, if the original file for a symlink is deleted, moved, or renamed, the symbolic link breaks and typically becomes a ‘dangling’ link that points to a non-existent path.
file
file
gives us a summary of what a file is or what it contains, like telling us what’s in data/who_tb_data.csv
.
file data/who_tb_symlink.csv
# data/who_tb_symlink.csv: CSV text
The -i
option will tell us if this is a regular file:
file -i data/who_tb_symlink.csv
# data/who_tb_symlink.csv: regular file
readlink
readlink
displays the target of a symbolic link.
readlink data/who_tb_symlink.csv
# raw/who_tb_data.csv
The -f
option provides the target’s absolute path.
readlink -f data/who_tb_symlink.csv
# path/to/data/raw/who_tb_data.csv
wc
wc
(word count) counts the number of lines, words, and characters in the given input. If a file name is provided, it performs the count on the file; otherwise, it reads from the standard input.
wc data/who_tb_symlink.csv
# 13 13 381 data/who_tb_symlink.csv
stat
stat
displays detailed information about files.
stat data/who_tb_symlink.csv
# 16777221 317774438 lrwxr-xr-x 1 username staff 0 19 "May 13 13:10:34 2024" \
# "May 13 13:10:34 2024" "May 13 13:10:34 2024" "May 13 13:10:34 2024" \
# 4096 0 0 data/who_tb_symlink.csv
Adding the -l
includes the symbolic link to the original file.
stat -l data/who_tb_symlink.csv
# lrwxr-xr-x 1 username staff 19 May 13 13:10:34 2024 \
# data/who_tb_symlink.csv -> csv/who_tb_data.csv
du
du
estimates file space usage.
du data/raw/pwrds.csv
# 24 data/raw/pwrds.csv
The -h
makes the output human readable.
du -h data/raw/pwrds.csv
# 12K data/raw/pwrds.csv
If we pass the original file and symlink of who_tb_data.csv
to du
, we see the symlink doesn’t contain any actual data:
du -h data/raw/who_tb_data.csv
# 4.0K data/raw/who_tb_data.csv
du -h data/who_tb_symlink.csv
# 0B data/who_tb_symlink.csv
Permissions
We’ll go over file permissions in-depth in the Permissions chapter, but I’ll quickly summarize two common uses of chmod
below. The file permissions are printed with the ls -l
output:
ls -l data/README.md
# -rw-r--r--@ 1 username staff 8834 May 14 09:33 data/README.md
chmod
chmod
changes file permissions. To change the file permissions using chmod
, using either Below are some simple examples using symbolic notation.8
To grant the permissions above (i.e., -rw-r--r--
) using symbolic notation with chmod
, we could use:
chmod u=rw,g=r,o=r data/README.md
u=rw
: sets the u
ser (owner) permissions to r
ead and w
rite.
g=r
: sets the g
roup permissions to r
ead.
o=r
: set the permissions for o
thers to r
ead.
Recap
data/who_tb_data.tsv
comes from the WHO global tuberculosis programme.↩︎data/vg_hof.tsv
is the Video Game Hall of Fame data↩︎locate
sometimes requires the search database is generated/updated. Read more here↩︎The
-L 2
option tells tree to only look in the data folder (no subfolders),-P '*.tsv'
matches the.csv
files, and--filesfirst
lists the files first, then the new directories.↩︎The
-L 2
option tells tree to only look in the data folder (no subfolders) and-P '*who_tb_data.psv'
matches thewho_tb_data.psv
file.↩︎cp
andmv
also work with directories.↩︎This type of difference is significant if the format impacts how data is parsed or used. For example, a software program expecting data in CSV format might not correctly parse a PSV file, and vice versa.↩︎
We’ll cover octal (numeric) notation in the Permissions chapter.↩︎