touch data/who_tb_data.tsvFiles
The following commands can be used for creating, managing, and manipulating files. Some of these commands also work on directories (which we covered in the previous chapter).
Create
The commands below can be used to create new files or update the time stamp of an existing file.
touch
Weβll start by creating a new empty file (data/who_tb_data.tsv) with touch:
We can confirm the new who_tb_data.tsv file was created, weβll use the tree command to check the data folder:
tree -P who_tb_data.tsv data
# data
# βββ who_tb_data.tsv
#
# 1 directory, 1 fileThe -P option lets us specify a pattern to search for in the data folder, which weβll cover more in the Special Characters, Wildcards, Regular Expressions, Manipulating Text, and ?sec-txt-edit.
echo
We can add some contents to the data/who_tb_data.tsv file using echo and the > operator.1
echo "country year type count
Afghanistan 1999 cases 745
Afghanistan 1999 population 19987071
Afghanistan 2000 cases 2666
Afghanistan 2000 population 20595360
Brazil 1999 cases 37737
Brazil 1999 population 172006362
Brazil 2000 cases 80488
Brazil 2000 population 174504898
China 1999 cases 212258
China 1999 population 1272915272
China 2000 cases 213766
China 2000 population 1280428583" > data/who_tb_data.tsvView
cat concatenates and displays file contents. We can use this to view the entire data/who_tb_data.tsv file we just created:
cat
cat data/who_tb_data.tsv
# country year type count
# Afghanistan 1999 cases 745
# Afghanistan 1999 population 19987071
# Afghanistan 2000 cases 2666
# Afghanistan 2000 population 20595360
# Brazil 1999 cases 37737
# Brazil 1999 population 172006362
# Brazil 2000 cases 80488
# Brazil 2000 population 174504898
# China 1999 cases 212258
# China 1999 population 1272915272
# China 2000 cases 213766
# China 2000 population 1280428583more & less
less and more lets you skim through a file on your computer, moving forwards and backwards as you please. These commands are helpful for larger files, like the Video Game Hall of Fame data stored in the data/vg_hof.tsv file:2
more data/vg_hof.tsv
more scrollless data/vg_hof.tsv
less scrollhead & tail
The head and tail commands let us view the tops and bottoms of files (the -n3 specifies three rows from data/vg_hof.tsv).
head -n3 data/vg_hof.tsv
# year game developer year_released
# 2015 DOOM id Software 1993
# 2015 Pac-Man Namco 1980tail -n3 data/vg_hof.tsv
# 2024 Tony Hawk's Pro Skater Neversoft 1999
# 2024 Ultima Richard Garriott, Origin Systems 1981
# 2024 You Don't Know Jack Jellyvision 1995Search
These commands can help you search for files (and directories).
grep
grep searches files for lines matching a pattern. Weβll use it to search for a specific video game title in data/vg_hof.tsv:
grep "The Oregon Trail" data/vg_hof.tsv
# 2015 The Oregon Trail Don Rawitsch, Bill Heinemann, and Paul Dillenberger 1971
# 2016 The Oregon Trail Don Rawitsch, Bill Heinemann, and Paul Dillenberger 1971find is used to search for files and directories in a directory hierarchy based on various criteria such as name, size, file type, and modification time.
find
The .psv extension is used for pipe-separated files (|). Weβll use find to locate any .psv files in data/:
find data -name "*.psv"
# data/wu_tang.psvfind can be very specific, too. For example, the commands below look in the data directory for tab-delimited files (i.e., with a .tsv extension modified in the last day.
find data -name "*.tsv" -mtime -1
# data/who_tb_data.tsvlocate
locate finds files by name quickly using a database.3
locate who_tb_data data | head -n3
# /Users/mjfrigaard/projects/books/fm-linux/data/csv/who_tb_data.csv
# /Users/mjfrigaard/projects/books/fm-linux/data/tsv/who_tb_data.tsv
# /Users/mjfrigaard/projects/books/fm-linux/data/who_tb_data.csvManage
The commands below can be used for copying, moving, renaming, and creating links to files. Assume we want to create backups of the delimiter-separated data files in the data/ folder. Weβll store these backups in folders according the file extension.
First we need to create folders for each type of delimiter (.tsv and .psv):
mkdir data/tsv
mkdir data/psvConfirm these new folders with tree -d
tree data -d
# data
# βββ psv
# βββ raw
# βββ tsv
#
# 4 directoriescp
Weβll copy the files into their respective folder based their extension using cp and *. For reference, here is a look of the data/ folder before and after copying the .tsv files:
cp data/*.tsv data/tsv/Before copying .csv files into data/csv/:
After copying .csv files into data/csv/:4
tree data -L 2 -P '*.tsv' --filesfirst
# data
# βββ music_vids.tsv
# βββ pwrds.tsv
# βββ trees.tsv
# βββ vg_hof.tsv
# βββ who_tb_data.tsv
# βββ wu_tang.tsv
# βββ psv
# βββ raw
# βββ tsv
#
# 4 directories, 6 filestree data -L 2 -P '*.tsv' --filesfirst
# data
# βββ music_vids.tsv
# βββ pwrds.tsv
# βββ trees.tsv
# βββ vg_hof.tsv
# βββ who_tb_data.tsv
# βββ wu_tang.tsv
# βββ psv
# βββ raw
# βββ tsv
# βββ music_vids.tsv
# βββ pwrds.tsv
# βββ trees.tsv
# βββ vg_hof.tsv
# βββ who_tb_data.tsv
# βββ wu_tang.tsv
#
# 4 directories, 12 filesNote that the number of .tsv files doubled (from six files to twelve files). Weβll do the same for the .psv files in data/
cp data/*.psv data/psv/Confirm the .psv files were copied:
tree data/psv/
# data/psv/
# βββ wu_tang.psv
#
# 1 directory, 1 filemv
The commands below create a .psv version of the who_tb_data.tsv we created above:
touch data/who_tb_data.psv
echo "| country | year | type | count |
|-------------|------|------------|------------|
| Afghanistan | 1999 | cases | 745 |
| Afghanistan | 1999 | population | 19987071 |
| Afghanistan | 2000 | cases | 2666 |
| Afghanistan | 2000 | population | 20595360 |
| Brazil | 1999 | cases | 37737 |
| Brazil | 1999 | population | 172006362 |
| Brazil | 2000 | cases | 80488 |
| Brazil | 2000 | population | 174504898 |
| China | 1999 | cases | 212258 |
| China | 1999 | population | 1272915272 |
| China | 2000 | cases | 213766 |
| China | 2000 | population | 1280428583 |" > data/who_tb_data.psvOopsβwe created the who_tb_data.psv in the data folder and not the data/psv folder: 5
tree data -L 2 -P '*who_tb_data.psv'
# data
# βββ psv
# βββ raw
# βββ tsv
# βββ who_tb_data.psv
#
# 4 directories, 1 fileWeβll use mv to move the who_tb_data.psv and data/who_tb_data.psv files into data/psv/ and confirm with tree:6
mv data/who_tb_data.psv data/psv/who_tb_data.psvtree data/psv
# data/psv
# βββ who_tb_data.psv
# βββ wu_tang.psv
#
# 1 directory, 2 filesln
ln is used to create links between files. It can create two types of links: hard links and symbolic (or soft) links.
A hard link is an additional name for an existing file on the same file system, and is effectively an additional directory entry for the file. In Linux file systems, all file names are technically hard links.
A symbolic link (often called a symlink) is a file that points to another file or directory, and it contains a path to another entry somewhere in the file system.
With the ln command, you need to specify the target file first (the original file) and then the name of the new link:
ln original_file.txt new_link.txtWeβll use ln to create data/who_tb_data.psv, a hard link for the data file in data/psv/who_tb_data.psv:
ln data/psv/who_tb_data.psv data/who_tb_hardlink.psvIf we check the data folder with tree, we see the new who_tb_data.psv file looks identical to the other files:
tree data -P '*.psv'
# data
# βββ csv
# βββ psv
# β βββ who_tb_data.psv
# β βββ wu_tang.psv
# βββ tsv
# βββ who_tb_hardlink.psv
# βββ wu_tang.psv
#
# 4 directories, 4 filesHard links are basically copiesβchanges made to one will reflect in the other since they both refer to the same data.
Now weβll use ln -s to create data/who_tb_symlink.csv, a symlink for the data file in data/raw/who_tb_data.csv.
ln -s raw/who_tb_data.csv data/who_tb_symlink.csvWhen we look at the folder with tree now, we see the symlink is listed with a special pointer (->) to the original file:
tree data -P '*.csv'
# data
# βββ psv
# βββ raw
# β βββ music_vids.csv
# β βββ pwrds.csv
# β βββ trees.csv
# β βββ vg_hof.csv
# β βββ who_tb_data.csv
# β βββ wu_tang.csv
# βββ tsv
# βββ who_tb_symlink.csv -> raw/who_tb_data.csv
#
# 4 directories, 7 filesThe symbolic link only references the actual file, but doesnβt store the data itself.
Info
The commands below return different types of information from a files (or files).
ls
The ls command lists the contents of a directory.
ls data/who_tb_hardlink.psv
# data/who_tb_hardlink.psvAdding -l returns the contents in a detailed long format. The information below is from the hardlink for the .psv file.
ls -l data/psv/who_tb_data.psv
# -rw-r--r--@ 2 username staff 686 May 16 07:47 data/psv/who_tb_data.psv-l will add information like file permissions, number of links, owner, group, file size, and last modification date for each item.
If we check the hardlink for the .psv file:
ls -l data/who_tb_hardlink.psv
# -rw-r--r--@ 2 username staff 686 May 16 07:47 data/who_tb_hardlink.psvBoth ls -l commands return similar information for the who_tb_data.psv files, indicating theyβre not treated as separate files, but rather two paths to the same file.
Compare this to the symlink we created with itβs original file:
ls -l data/who_tb_symlink.csv
# lrwxr-xr-x@ 1 username staff 19 May 16 07:50 data/who_tb_symlink.csv -> raw/who_tb_data.csvThe ls -l output for data/who_tb_data.csv (the symlink) returns lrwxr-xr-x (file permissions for a symbolic link) and
ls -l data/raw/who_tb_data.csv
# -rw-r--r--@ 1 uername staff 381 May 15 21:55 data/raw/who_tb_data.csvThe who_tb_data.csv file in data/raw returns -rw-r--r-- (standard file permissions),
File link info (@)
The preceding @ 2 from ls -l data/who_tb_hardlink.psv indicates the files have two hard links, meaning who_tb_data.psv is physically located in one place but can be accessed from two locations: data/who_tb_data.psv and data/psv/who_tb_data.psv. The @ 1 from ls -l data/csv/who_tb_data.csv indicates a single hard link (symbolic links always have one link).
Weβll cover this output more in the Permissions section below.
diff
diff compare the contents of two files line-by-line. Weβll use diff to compare the pipe-separated values file (data/wu_tang.psv) to the comma-separated separated values file (data/wu_tang.csv)
diff data/wu_tang.psv data/wu_tang.csv# 1,11c1,11The first line of the output indicates that lines 1 through 11 in the first file (data/wu_tang.psv) have been changed compared to lines 1 through 11 in the second file (data/wu_tang.csv).
Lines starting with < indicate the content from the first file (data/wu_tang.psv). These entries are separated by pipes or spaces (as commonly used in PSV files).
# < |Member |Name |
# < |RZA |Robert Diggs |
# < |GZA |Gary Grice |
# < |Method Man |Clifford Smith |
# < |Raekwon the Chef |Corey Woods |
# < |Ghostface Killah |Dennis Coles |
# < |Inspectah Deck |Jason Hunter |
# < |U-God |Lamont Hawkins |
# < |Masta Killa |Jamel Irief |
# < |Cappadonna |Darryl Hill |
# < |Ol Dirty Bastard |Russell Tyrone Jones |# > Member,Name
# > RZA,Robert Diggs
# > GZA,Gary Grice
# > Method Man,Clifford Smith
# > Raekwon the Chef,Corey Woods
# > Ghostface Killah,Dennis Coles
# > Inspectah Deck,Jason Hunter
# > U-God,Lamont Hawkins
# > Masta Killa,Jamel Irief
# > Cappadonna,Darryl Hill
# > Ol Dirty Bastard,Russell Tyrone JonesLines starting with > show the content from the second file (data/wu_tang.csv). These entries are separated by commas, as is typical for CSV files.
There is no difference in the actual data (Member or Name)βboth files contain the same information, so the primary difference is purely in the formatting of the data: the PSV (pipe-separated Values) file uses vertical bars (|) and spaces to separate data fields, whereas the CSV (comma-separated values) file uses commas (,).7
What happens when we compare the symlink (data/who_tb_data.csv) and itβs original file (data/raw/who_tb_data.csv) with diff?
diff data/who_tb_symlink.csv data/raw/who_tb_data.csvdiff returns nothing and doesnβt produce any output if there are no differences between the two files.
Deleting, moving, or renaming the original file does not affect the integrity of a hard link:
# remove original file
rm data/psv/who_tb_data.psv
# check hard link
cat data/who_tb_hardlink.psv
# | country | year | type | count |
# |-------------|------|------------|------------|
# | Afghanistan | 1999 | cases | 745 |
# | Afghanistan | 1999 | population | 19987071 |
# | Afghanistan | 2000 | cases | 2666 |
# | Afghanistan | 2000 | population | 20595360 |
# | Brazil | 1999 | cases | 37737 |
# | Brazil | 1999 | population | 172006362 |
# | Brazil | 2000 | cases | 80488 |
# | Brazil | 2000 | population | 174504898 |
# | China | 1999 | cases | 212258 |
# | China | 1999 | population | 1272915272 |
# | China | 2000 | cases | 213766 |
# | China | 2000 | population | 1280428583 |However, if the original file for a symlink is deleted, moved, or renamed, the symbolic link breaks and typically becomes a βdanglingβ link that points to a non-existent path.
file
file gives us a summary of what a file is or what it contains, like telling us whatβs in data/who_tb_data.csv.
file data/who_tb_symlink.csv
# data/who_tb_symlink.csv: CSV textThe -i option will tell us if this is a regular file:
file -i data/who_tb_symlink.csv
# data/who_tb_symlink.csv: regular filereadlink
readlink displays the target of a symbolic link.
readlink data/who_tb_symlink.csv
# raw/who_tb_data.csvThe -f option provides the targetβs absolute path.
readlink -f data/who_tb_symlink.csv
# path/to/data/raw/who_tb_data.csvwc
wc (word count) counts the number of lines, words, and characters in the given input. If a file name is provided, it performs the count on the file; otherwise, it reads from the standard input.
wc data/who_tb_symlink.csv
# 13 13 381 data/who_tb_symlink.csvstat
stat displays detailed information about files.
stat data/who_tb_symlink.csv
# 16777221 317774438 lrwxr-xr-x 1 username staff 0 19 "May 13 13:10:34 2024" \
# "May 13 13:10:34 2024" "May 13 13:10:34 2024" "May 13 13:10:34 2024" \
# 4096 0 0 data/who_tb_symlink.csvAdding the -l includes the symbolic link to the original file.
stat -l data/who_tb_symlink.csv
# lrwxr-xr-x 1 username staff 19 May 13 13:10:34 2024 \
# data/who_tb_symlink.csv -> csv/who_tb_data.csvdu
du estimates file space usage.
du data/raw/pwrds.csv
# 24 data/raw/pwrds.csvThe -h makes the output human readable.
du -h data/raw/pwrds.csv
# 12K data/raw/pwrds.csvIf we pass the original file and symlink of who_tb_data.csv to du, we see the symlink doesnβt contain any actual data:
du -h data/raw/who_tb_data.csv
# 4.0K data/raw/who_tb_data.csvdu -h data/who_tb_symlink.csv
# 0B data/who_tb_symlink.csvPermissions
Weβll go over file permissions in-depth in the Permissions chapter, but Iβll quickly summarize two common uses of chmod below. The file permissions are printed with the ls -l output:
ls -l data/README.md
# -rw-r--r--@ 1 username staff 8834 May 14 09:33 data/README.mdchmod
chmod changes file permissions. To change the file permissions using chmod, using either Below are some simple examples using symbolic notation.8
To grant the permissions above (i.e., -rw-r--r--) using symbolic notation with chmod, we could use:
chmod u=rw,g=r,o=r data/README.mdu=rw: sets the user (owner) permissions to read and write.
g=r: sets the group permissions to read.
o=r: set the permissions for others to read.
Recap
data/who_tb_data.tsvcomes from the WHO global tuberculosis programme.β©οΈdata/vg_hof.tsvis the Video Game Hall of Fame dataβ©οΈlocatesometimes requires the search database is generated/updated. Read more hereβ©οΈThe
-L 2option tells tree to only look in the data folder (no subfolders),-P '*.tsv'matches the.csvfiles, and--filesfirstlists the files first, then the new directories.β©οΈThe
-L 2option tells tree to only look in the data folder (no subfolders) and-P '*who_tb_data.psv'matches thewho_tb_data.psvfile.β©οΈcpandmvalso work with directories.β©οΈThis type of difference is significant if the format impacts how data is parsed or used. For example, a software program expecting data in CSV format might not correctly parse a PSV file, and vice versa.β©οΈ
Weβll cover octal (numeric) notation in the Permissions chapter.β©οΈ
