This article reviews common text manipulation utility programs available within the Linux operating system. The article is chock full of discussions paired with simple examples to demonstrate the usefulness of several text utility programs available to Linux users.
To aid in demonstration I retrieved a text file copy of the famous book A Tale of Two Cities, by Charles Dickens from The Gutenberg Project and, I also parsed out the first two paragraphs into separate files as shown in the directory listing below.
$ ls -l
total 796
-rw-rw-r-- 1 tci tci 615 Apr 22 19:04 first-paragraph.txt
-rw-rw-r-- 1 tci tci 333 Apr 22 19:05 second-paragraph.txt
-rw-rw-r-- 1 tci tci 804335 Mar 19 2018 tail-of-two-cities.txt
I will begin by introducing the cat command which is a fairly low level utility used to combine files fed into the program and send the contents to standard output.
$ cat first-paragraph.txt
It was the best of times,
it was the worst of times,
it was the age of wisdom,
it was the age of foolishness,
it was the epoch of belief,
it was the epoch of incredulity,
it was the season of Light,
it was the season of Darkness,
it was the spring of hope,
it was the winter of despair,
we had everything before us,
we had nothing before us,
we were all going direct to Heaven,
we were all going direct the other way--
in short, the period was so far like the present period, that some of
its noisiest authorities insisted on its being received, for good or for
evil, in the superlative degree of comparison only.
A fairly useful enhancement for the cat command is the -n argument which causes the output to show line numbers.
$ cat -n first-paragraph.txt
1 It was the best of times,
2 it was the worst of times,
3 it was the age of wisdom,
4 it was the age of foolishness,
5 it was the epoch of belief,
6 it was the epoch of incredulity,
7 it was the season of Light,
8 it was the season of Darkness,
9 it was the spring of hope,
10 it was the winter of despair,
11 we had everything before us,
12 we had nothing before us,
13 we were all going direct to Heaven,
14 we were all going direct the other way--
15 in short, the period was so far like the present period, that some of
16 its noisiest authorities insisted on its being received, for good or for
17 evil, in the superlative degree of comparison only.
To combine then output multiple files together just specify a list of space separated file names. For example, to combine the first and second paragraph files is as simple as the following.
$ cat first-paragraph.txt second-paragraph.txt
It was the best of times,
it was the worst of times,
it was the age of wisdom,
it was the age of foolishness,
it was the epoch of belief,
it was the epoch of incredulity,
it was the season of Light,
it was the season of Darkness,
it was the spring of hope,
it was the winter of despair,
we had everything before us,
we had nothing before us,
we were all going direct to Heaven,
we were all going direct the other way--
in short, the period was so far like the present period, that some of
its noisiest authorities insisted on its being received, for good or for
evil, in the superlative degree of comparison only.
There were a king with a large jaw and a queen with a plain face, on the
throne of England; there were a king with a large jaw and a queen with
a fair face, on the throne of France. In both countries it was clearer
than crystal to the lords of the State preserves of loaves and fishes,
that things in general were settled for ever.
The cat command works really well for viewing the contents of relatively small files but, when inspecting larger files that span several screen sized pages of content you are better off with either the more or less commands.
The more command can be followed by the name of a file then the output will span your screen giving you the option to skip down the document a page at a time using the space bar or a line at a time using the enter key.
$ more tail-of-two-cities.txt
Results in output shown below.
The Project Gutenberg EBook of A Tale of Two Cities, by Charles Dickens
This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever. You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org
Title: A Tale of Two Cities
A Story of the French Revolution
Author: Charles Dickens
Release Date: January, 1994 [EBook #98]
Posting Date: November 28, 2009
Last Updated: March 4, 2018
Language: English
Character set encoding: UTF-8
*** START OF THIS PROJECT GUTENBERG EBOOK A TALE OF TWO CITIES ***
Produced by Judith Boss
A TALE OF TWO CITIES
A STORY OF THE FRENCH REVOLUTION
By Charles Dickens
CONTENTS
Book the First--Recalled to Life
--More--(0%)
The less command is used the same way where you follow it with the name of a file for inspection but, it is actually more feature rich than the more command because it allows for scrolling and down a file using the arrow keys. Scrolling down a page can still be accomplished with the space bar and enter key. Both more and less programs also offer the ability to search for text within a document. To do this supply / followed by a string of text you want to search for and pressing enter will jump to subsequent matches. To search backwards replace the / with a ?.
The Linux shell has a wonderfully powerful construct known as a pipe, specified with the | character, which allows for piecing together multiple utility programs using the pipe charater to redirect one program's output as input for a subsequent command. For example, I can use the pipe command to send the output from issuing cat on the full tail-of-two-cities.txt document to the less command for better navigation like so.
$ cat tail-of-two-cities.txt | less
This is obviously a pretty stange thing to do since I could have just used less in combination with the tail-of-two-cities.txt file to accomplish the same thing but, it will likely become more evident how powerful the use of pipes are in later examples.
Similar in concept to the pipe command being able to capture the flow of a prior commands output you can use the > or >> redirection symbols followed by the name of a file for redirecting output to. The > symbol will send output to a newly created file yet, the >> symbol will append to an existing file should it already exist.
For example, if I wanted to create a file named first-two-paragraphs.txt which is the result from passing first-paragraph.txt as well as second-paragraph.txt to cat and redirecting the combined output to the new file.
$ cat first-paragraph.txt second-paragraph.txt > first-two-paragraphs.txt
$ cat first-two-paragraphs.txt
It was the best of times,
it was the worst of times,
it was the age of wisdom,
it was the age of foolishness,
it was the epoch of belief,
it was the epoch of incredulity,
it was the season of Light,
it was the season of Darkness,
it was the spring of hope,
it was the winter of despair,
we had everything before us,
we had nothing before us,
we were all going direct to Heaven,
we were all going direct the other way--
in short, the period was so far like the present period, that some of
its noisiest authorities insisted on its being received, for good or for
evil, in the superlative degree of comparison only.
There were a king with a large jaw and a queen with a plain face, on the
throne of England; there were a king with a large jaw and a queen with
a fair face, on the throne of France. In both countries it was clearer
than crystal to the lords of the State preserves of loaves and fishes,
that things in general were settled for ever.
In this section I shift gears to focus on using programs to manipulate text as opposed to the simple viewing utilites that have been discussed thus far. Again, to aid in the discussion I have another text file I've created to use in examples which contains fictitious test scores for a group of people over the years 2018 and 2019 as shown below.
$ cat test-scores.txt
Year Name Score
2018 Jim 90
2018 Sally 98
2018 Bob 76
2018 Tim 87
2019 Jim 95
2019 Sally 98
2019 Bob 88
2019 Tim 92
The first command I would like to introduce is the cut command which is commonly used to cut out specific columns or bytes of characters from text. This could be used to excise the list of the people in the second column in the test-scores.txt file using the -f2 argument indicating the second field like so.
$ cat test-scores.txt | cut -f2
Name
Jim
Sally
Bob
Tim
Jim
Sally
Bob
Tim
By default fields are assumed to be delimited by tabs which is the case in my test-scores.txt example file. However, you can actually also use the cut command along with the --output-delimiter argument and a new delimiter. Doing so I could issue the following to transform my test-scores.txt file from tab delimited to a comma delimited file named test-scores.csv.
$ cat test-scores.txt | cut -f1-3 --output-delimiter=',' > test-scores.csv
tci@thecodinginterface:~/demo$ cat test-scores.csv
Year,Name,Score
2018,Jim,90
2018,Sally,98
2018,Bob,76
2018,Tim,87
2019,Jim,95
2019,Sally,98
2019,Bob,88
2019,Tim,92
The above series of commands read the test-scores.txt file, cut the fields out individually and replace the tab delimiter with a comma then redirect the result to a new file. With a CSV version of the text data I can show how the -d argument can be used to specify a comma as the delimiter and again select out just the names.
$ cat test-scores.csv | cut -f2 -d ','
Name
Jim
Sally
Bob
Tim
Jim
Sally
Bob
Tim
Building on this example I can add in the sort command to sort the list of names like so.
$ cat test-scores.csv | cut -f2 -d ',' | sort
Bob
Bob
Jim
Jim
Name
Sally
Sally
Tim
Tim
The sort command has a number of useful optional arguments but, probably the most used is the -r for sorting in reverse order.
$ cat test-scores.csv | cut -f2 -d ',' | sort -r
Tim
Tim
Sally
Sally
Name
Jim
Jim
Bob
Bob
A natural next manipulation would be to remove the duplicates from the list of names with the uniq command.
$ cat test-scores.csv | cut -f2 -d ',' | sort | uniq
Bob
Jim
Name
Sally
Tim
Its worth noting that the uniq command will only remove duplicates if they occur sequentially.
In this final section I introduce the grep text utility program. The grep program is used for searching and parsing out lines containing patterns within text and is especially useful when paired with [regular expressions](https://en.wikipedia.org/wiki/Regular_expression).
The basic syntax for the grep command is to follow the grep keyword with a string of text representing a pattern to match against then ending with one or more files. For example, if I search the first-paragraph.txt file of the Tail of Two Cities book for the word 'epoch' I get the following lines returned in the output.
$ grep 'epoch' first-paragraph.txt
it was the epoch of belief,
it was the epoch of incredulity,
Similar to the other commands I've covered, grep can be used in conjunction with the pipe operator to feed the result of an earlier command into the grep program. In the previous example I could have used the cat command to read the first-paragraph.txt file and pipe its output into grep then search for the word epoch like so.
$ cat first-paragraph.txt | grep 'epoch'
it was the epoch of belief,
it was the epoch of incredulity,
As mentioned previously, the grep program is compatible with regular expression pattern matching which significantly enhances the power and flexibility of it's searching capabilities. Regular expressions are an incredibly large and fairly involved topic so, my goal is to only scratch the surface a bit to enable the reader to become comfortable enough to further explore the topic at a later date.
The first regular expression topics to cover are what are known as the anchor symbols ^ and $. The ^ symbol is used to indicate that a line should start with a particular pattern and the $ symbol means to search for lines ending with a particular pattern.
For example, to find all lines in the first-paragraph.txt file starting with 'we' I use the following.
$ grep '^we' first-paragraph.txt
we had everything before us,
we had nothing before us,
we were all going direct to Heaven,
we were all going direct the other way--
Conversely, to find all paragraphs ending with 'us,' I use the following.
$ grep 'us,$' first-paragraph.txt
we had everything before us,
we had nothing before us,
Another symbol that is highly used in the world of regular expression is the character wildcard symbol '.' which means match any character. This is especially useful when used with a quantifier symbol such as * which means match the preceeding expression zero or more times. Another demonstration will likely be helpful in explaining this so, if I wanted to find all lines that started with 'we' and ended with a comma ',' I could use the following.
$ grep '^we.*,$' first-paragraph.txt
we had everything before us,
we had nothing before us,
we were all going direct to Heaven,
The above syntax of '^we.*,$' means search for lines starting with 'we' followed by any character '.' occuring zero or more times '*' and ending with a comma ',$'.
The last thing I want to cover in the context of regular expressions is the notion of a subpattern. You can use brackets [...] followed by a quantifier to search for a group of characters within your pattern. If I wanted to search first-paragraph.txt for all lines that start with an uppercase letter I could use the pattern ^[A-Z] as shown below.
$ grep '^[A-Z]' first-paragraph.txt
It was the best of times,
Now lets say I wanted to find all lines that started with exactly three lowercase letters. To accomplish this I would start with he [a-z] range specifier, append a quantifier of \{3\} then, end with \b which is the bounding symbol saying that the characters should make up a word.
$ grep '^[a-z]\{3\}\b' first-paragraph.txt
its noisiest authorities insisted on its being received, for good or for
Note the use of lowercase a-z as compared to the previous uppercase range specifier A-Z.
Similarly, if I wanted find all lines in first-paragraph.txt that started with three or more lowercase letters I could use '^[a-z]\{3,\}\b' like so leaving an empty slot after the , means any additional occurrences.
$ grep '^[a-z]\{3,\}\b' first-paragraph.txt
its noisiest authorities insisted on its being received, for good or for
evil, in the superlative degree of comparison only.
thecodinginterface.com earns commision from sales of linked products such as the books above. This enables providing continued free tutorials and content so, thank you for supporting the authors of these resources as well as thecodinginterface.com
In this tutorial I have covered around a half dozen different common text viewing and manipulation utility programs used amongst Linux users. The use cases for these programs are extremely vast and when taking into account the use of the pipe operator to string together multiple operations the possibilities are enormous. However, through the use of several different examples and simple explanations I feel confident that a reader can be comfortable doing similar simple tasks and going further with more investigations.
As always, thanks for reading and don't be shy about commenting or critiquing below.