8. Finding Things

In the same way that many of us now use “Google” as a verb meaning “to find”, Unix programmers often use the word “grep”. “grep” is a contraction of “global/regular expression/print”, a common sequence of operations in early Unix text editors. It is also the name of a very useful command-line program.

grep finds and prints lines in files that match a pattern. For our examples, we will use a file that contains three haikus taken from a 1998 competition in Salon magazine. For this set of examples, we’re going to be working in the writing subdirectory:

Forever, or Five Years

We haven’t linked to the original haikus because they don’t appear to be on Salon’s site any longer. As Jeff Rothenberg said, “Digital information lasts forever — or five years, whichever comes first.” Luckily, popular content often has backups.

Let’s find lines that contain the word “not”:

Here, not is the pattern we are searching for. The grep command searches through the file, looking for matches to the pattern specified. To use it type grep, then the pattern we’re searching for and finally the name of the file (or files) we’re searching in.

The output is the three lines in the file that contain the letters “not”.

By default, grep searches for a pattern in a case-sensitive way. In addition, the search pattern we have selected does not have to form a complete word, as we will see in the next example.

Let us search for the pattern: “The”.

This time, two lines that include the letters “The” are outputted, one of which contained our search pattern within a larger word, “Thesis”.

To restrict matches to lines containing the word “The” on its own, we can give grep with the -w option. This will limit matches to word boundaries.

Later in this lesson, we will also see how we can change the search behavior of grep with respect to its case sensitivity.

Note that a “word boundary” includes the start and end of a line, so not just letters surrounded by spaces. Sometimes we do not want to search for a single word, but a phrase. This is also easy to do with grep by putting the phrase in quotes.

We have now seen that you do not have to have quotes around single words, but it is useful to use quotes when searching for multiple words. It also helps to make it easier to distinguish between the search term or phrase and the file being searched. We will use quotes in the remaining examples.

Another useful option is -n, which numbers the lines that match:

Here, we can see that lines 5, 9, and 10 contain the letters “it”.

We can combine options (i.e. flags) as we do with other Unix commands. For example, let us find the lines that contain the word “the”. We can combine the option -w to find the lines that contain the word “the” and -n to number the lines that match:

Now we want to use the option -i to make our search case-insensitive:

Now, we want to use the option -v to invert our search, i.e., we want to output the lines that do not contain the word “the”.

grep has lots of other options. To find out what they are, we can type:

Wildcards

grep’s real power doesn’t come from its options, though; it comes from the fact that patterns can include wildcards. (The technical name for these is regular expressions, which is what the “re” in “grep” stands for.) Regular expressions are both complex and powerful; if you want to do complex searches, please look at the lesson on our website. As a taster, we can find lines that have an ‘o’ in the second position like this:

We use the -E option and put the pattern in quotes to prevent the shell from trying to interpret it. (If the pattern contained a *, for example, the shell would try to expand it before running grep.) The ^ in the pattern anchors the match to the start of the line. The . matches a single character (just like ? in the shell), while the o matches an actual ‘o’.

Tracking a Species

Leah has several hundred data files saved in one directory, each of which is formatted like this:

She wants to write a shell script that takes a species as the first command-line argument and a directory as the second argument. The script should return one file called species.txt containing a list of dates and the number of that species seen on each date. For example using the data shown above, rabbit.txt would contain:

The find command

While grep finds lines in files, the find command finds files themselves. Again, it has a lot of options; to show how the simplest ones work, we’ll use the directory tree shown below.

/Node%20anatomy

Nelle’s writing directory contains one file called haiku.txt and three subdirectories: thesis (which contains a sadly empty file, empty-draft.md); data (which contains three files LittleWomen.txt, one.txt and two.txt); and a tools directory that contains the programs format and stats, and a subdirectory called old, with a file oldtool.

For our first command, let’s run find . (remember to run this command from the data-shell/writing folder).

As always, the . on its own means the current working directory, which is where we want our search to start. find’s output is the names of every file and directory under the current working directory. This can seem useless at first but find has many options to filter the output and in this lesson we will discover some of them.

The first option in our list is -type d that means “things that are directories”. Sure enough, find’s output is the names of the five directories in our little tree (including .):

Notice that the objects find finds are not listed in any particular order. If we change -type d to -type f, we get a listing of all the files instead:

Now let’s try matching by name:

We expected it to find all the text files, but it only prints out ./haiku.txt. The problem is that the shell expands wildcard characters like * before commands run. Since *.txt in the current directory expands to haiku.txt, the command we actually ran was:

find did what we asked; we just asked for the wrong thing.

To get what we want, let’s do what we did with grep: put *.txt in single quotes to prevent the shell from expanding the * wildcard. This way, find actually gets the pattern *.txt, not the expanded filename haiku.txt:

Listing vs. Finding

ls and find can be made to do similar things given the right options, but under normal circumstances,

ls lists everything it can, while find searches for things with certain properties and shows them.

As we said earlier, the command line’s power lies in combining tools. We’ve seen how to do that with pipes; let’s look at another technique. As we just saw, find . -name '*.txt' gives us a list of all text files in or below the current directory. How can we combine that with wc -l to count the lines in all those files?

The simplest way is to put the find command inside $():

When the shell executes this command, the first thing it does is run whatever is inside the $(). It then replaces the $() expression with that command’s output. Since the output of find is the four filenames ./data/one.txt, ./data/LittleWomen.txt, ./data/two.txt, and ./haiku.txt, the shell constructs the command:

which is what we wanted. This expansion is exactly what the shell does when it expands wildcards like * and ?, but lets us use any command we want as our own “wildcard”.

It’s very common to use find and grep together. The first finds files that match a pattern; the second looks for lines inside those files that match another pattern. Here, for example, we can find PDB files that contain iron atoms by looking for the string “FE” in all the .pdb files above the current directory:

Matching and Subtracting

find Pipeline Reading Comprehension

Finding Files With Different Properties