When working with big data, there are thousands of housekeeping tasks to do. As you will soon discover, moving data around, munging it, transforming it, and other mundane tasks will take up a depressingly inordinate amount of your time. A proper knowledge of some useful tools and tricks will make these tasks doable, and in many cases, easy.
In this chapter we discuss a variety of Unix commandline tools that are essential for shaping, transforming, and munging text. All of these commands are covered elsewhere, and covered more completely, but we’ve focused in on their applications for big data, specifically their use within Hadoop.
If you’re already familiar with Unix pipes and chaining commands together, feel free to skip the first few sections and jump straight into the tour of useful commands.
One of the key aspects of the Unix philosophy is that it is built on simple tools that can be combined in nearly infinite ways to create complex behavior. Command line tools are linked together through 'pipes' which take output from one tool and feed it into the input of another. For example, the cat
command reads a file and outputs its contents. You can pipe this output into another command to perform useful transformations. For example, to select only the first field (delimited by tabs) of the each line in a file you could:
cat somefile.txt | cut -f1`
The vertical pipe character, '|', represents the pipe between the cat
and cut
commands. You can chain as many commands as you like, and it’s common to construct chains of 5 commands or more.
In addition to redirecting a command’s output into another command, you can also redirect output into a file with the >
operator. For example:
echo 'foobar' > stuff.txt
writes the text 'foobar' to stuff.txt. stuff.txt is created if it doesn’t exist, and is overwritten if it does.
If you’d like to append to a file instead of overwriting it, the '>>' operator will come in handy. Instead of creating or overwriting the specified file, it creates the file if it doesn’t exist, and appends to the file if it does.
As a side note, the Hadoop streaming API is built using pipes. Hadoop sends each record in the map and reduce phases to your map and reduce scripts' stdin. Your scripts print the results of their processing to stdout, which Hadoop collects.
Each Unix command has 3 input/output streams: standard input, standard output, and standard error, commonly referred to by the more concise stdin, stdout, and stderr. Commands accept input via stdin and generate output through stdout. When we said that pipes take output from one tool and feed it into the input of another, what we really meant was that pipes feed one command’s stdout stream into another command’s stdin stream.
The third stream, stderr, is generally used for error messages, progress information, or other text that is not strictly 'output'. Stderr is especially useful because it allows you to see messages from a command even if you have redirected the command’s stdout. For example, if you wanted to run a command and redirect its output into a file, you could still see any errors generated via stderr. curl, a command used to make network requests, #FINISH
* CURL COMMAND*
It’s occasionally useful to be able to redirect these streams independently or into each other. For example, if you’re running a command and want to log its output as well as any errors generated, you should redirect stderr into stdout and then direct stdout to a file:
*EXAMPLE*
Alternatively, you could redirect stderr and stdout into separate files:
*EXAMPLE*
You might also want to suppress stderr if the command you’re using gets too chatty. You can do that by redirecting stderr to /dev/null, which is a special file that discards everything you hand it.
Now that you understand the basics of pipes and output redirection, lets get on to the fun part - munging data!
cat
reads the content of a file and prints it to stdout. It can accept multiple files, like so, and will print them in order:
cat foo.txt bar.txt bat.txt
cat
is generally used to examine the contents of a file or as the start of a chain of commands:
cat foo.txt | sort | uniq > bar.txt
In addition to examining and piping around files, cat
is also useful as an 'identity mapper', a mapper which does nothing. If your data already has a key that you would like to group on, you can specify cat
as your mapper and each record will pass untouched through the map phase to the sort phase. Then, the sort and shuffle will group all records with the same key at the proper reducer, where you can perform further manipulations.
echo
is very similar to cat
except it prints the supplied text to stdout. For example:
echo foo bar baz bat > foo.txt
will result in foo.txt holding 'foo bar baz bat', followed by a newline. If you don’t want the newline you can give echo the -n option.
The cut
command allows you to cut certain pieces from each line, leaving only the interesting bits. The -f option means 'keep only these fields', and takes a comma-delimited list of numbers and ranges. So, to select the first 3 and 5th fields of a tsv file you could use:
cat somefile.txt | cut -f 1-3,5`
Watch out - the field numbering is one-indexed. By default cut assumes that fields are tab-delimited, but delimiters are configurable with the -d
option.
This is especially useful if you have tsv output on the hdfs and want to filter it down to only a handful of fields. You can create a hadoop streaming job to do this like so:
wu-mapred --mapper='cut -f 1-3,5'
cut
is great if you know the indices of the columns you want to keep, but if you data is schema-less or nearly so (like unstructured text), things get slightly more complicated. For example, if you want to select the last field from all of your records, but the field length of your records vary, you can combine cut with the rev
command, which reverses text:
cat foobar.txt | rev | cut -1 | rev`
This reverses each line, selects the first field in the reversed line (which is really the last field), and then reverses the text again before outputting it.
cut
also has a -c (for 'character') option that allows you to select ranges of characters. This is useful for quickly verifying the output of a job with long lines. For example, in the Regional Flavor exploration, many of the jobs output wordbags which are just giant JSON blobs, one line of which would overflow your entire terminal. If you want to quickly verify that the output looks sane, you could use:
wu-cat /data/results/wikipedia/wordbags.tsv | cut -c 1-100
Cut’s -c option, as well as many Unix text manipulation tools require a little forethought when working with different character encodings because each encoding can use a different numbers of bits per character. If cut
thinks that it is reading ASCII (7 bits per character) when it is really reading UTF-8 (variable number of bytes per character), it will split characters and produce meaningless gibberish. Our recommendation is to get your data into UTF-8 as soon as possible and keep it that way, but the fact of the matter is that sometimes you have to deal with other encodings.
Unix’s solution to this problem is the LC_* environment variables. LC stands for 'locale', and lets you specify your preferred language and character encoding for various types of data.
LC_CTYPE
(locale character type) sets the default character encoding used systemwide. In absence of LC_CTYPE, LANG
is used as the default, and LC_ALL
can be used to override all other locale settings. If you’re not sure whether your locale settings are having their intended effect, check the man page of the tool you are using and make sure that it obeys the LC variables.
You can view your current locale settings with the locale
command. Operating systems differ on how they represent languages and character encodings, but on my machine en_US.UTF-8
represents English, encoded in UTF-8.
Remember that if you’re using these commands as Hadoop mappers or Reducers, you must set these environment variables across your entire cluster, or set them at the top of your script.
While cut
is used to select columns of output, head and tail are used to select lines of output. head selects lines at the beginning of its input while tail selects lines at the end. For example, to view only the first 10 lines of a file, you could use head like so:
head -10 foobar.txt
head
is especially useful for sanity-checking the output of a Hadoop job without overflowing your terminal. head
and cut make a killer combination:
wu-cat /data/results/foobar | head -10 | cut -c 1-100
tail
works almost identically to head
. Viewing the last ten lines of a file is easy:
tail -10 foobar.txt
tail
also lets you specify the selection in relation to the beginning of the file with the '+' operator. So, to select every line from the 10th line on:
tail +10 foobar.txt
What if you just finished uploading 5,000 small files to the HDFS and realized that you left a header on every one of them? No worries, just use tail
as a mapper to remove the header:
wu-mapred --mapper='tail +2'`
This outputs every line but the first one.
tail
is also useful for watching files as they are written to. For example, if you have a log file that you want to watch for errors or information, you can 'tail' it with the -f option:
tail -f yourlogs.log
This outputs the end of the log to your terminal and waits for new content, updating the output as more is written to yourlogs.log.
grep
is a tool for finding patterns in text. You can give it a word, and it will diligently search its input, printing only the lines that contain that word:
GREP EXAMPLE
grep
has a many options, and accepts regular expressions as well as words and word sequences:
ANOTHER EXAMPLE
The -i option is very useful to make grep ignore case:
EXAMPLE
As is the -z option, which decompresses g-zipped text before grepping through it. This can be tremendously useful if you keep files on your HDFS in a compressed form to save space.
When using grep
in Hadoop jobs, beware its non-standard exit statuses. grep
returns a 0 if it finds matching lines, a 1 if it doesn’t find any matching lines, and a number greater than 1 if there was an error. Because Hadoop interprets any exit code greater than 0 as an error, any Hadoop job that doesn’t find any matching lines will be considered 'failed' by Hadoop, which will result in Hadoop re-trying those jobs without success. To fix this, we have to swallow `grep’s exit status like so:
(grep foobar || true)
This ensures that Hadoop doesn’t erroneously kill your jobs.
As you might expect, sort
sorts lines. By default it sorts alphabetically, considering the whole line:
EXAMPLE
You can also tell it to sort numerically with the -n option, but -n only sorts integers properly. To sort decimals and numbers in scientific notation properly, use the -g option:
EXAMPLE
You can reverse the sort order with -r:
EXAMPLE
You can also specify a column to sort on with the -k option:
EXAMPLE
By default the column delimiter is a non-blank to blank transition, so any content character followed by a whitespace character (tab, space, etc…) is treated as a column. This can be tricky if your data is tab delimited, but contains spaces within columns. For example, if you were trying to sort some tab-delimited data containing movie titles, you would have to tell sort to use tab as the delimiter. If you try the obvious solution, you might be disappointed with the result:
sort -t"\t" sort: multi-character tab `\\t'
Instead we have to somehow give the -t option a literal tab. The easiest way to do this is:
sort -t$'\t'
$'<string>'
is a special directive that tells your shell to expand <string>
into its equivalent literal. You can do the same with other control characters, including \n
, \r
, etc…
Another useful way of doing this is by inserting a literal tab manually:
sort -t' '
To insert the tab literal between the single quotes, type CTRL-V
and then Tab
.
If you find your sort command is taking a long time, try increasing its sort buffer size with the --buffer
command. This can make things go a lot faster:
example
TALK ABOUT SORT’S USEFULNESS IN BIG DATA
uniq
is used for working with with duplicate lines - you can count them, remove them, look for them, among other things. For example, here is how you would find the number of oscars each actor has in a list of annual oscar winners:
example
Note the -c option, which prepends the output with a count of the number of duplicates. Also note that we sort the list before piping it into uniq - input to uniq must always be sorted or you will get erroneous results.
You can also filter out duplicates with the -u option:
example
And only print duplicates with the -d option:
example
-
TALK ABOUT USEFULNESS, EXAMPLES*
wc
is a utility for counting words, lines, and characters in text. Without options, it searches its input and outputs the number of lines, words, and bytes, in that order:
EXAMPLE
wc
will also print out the number of characters, as defined by the LC_CTYPE environment variable:
EXAMPLE
We can use wc as a mapper to count the total number of words in all of our files on the HDFS:
EXAMPLE
Why not Hive? The appealing thing about Hive is that it feels a lot like SQL. The dismal thing about Hive is that it feels a lot like SQL. Similarly, the wonderful thing about Pig is that its operations more closely mirror the underlying map-reduce setup, making it easier to reason about the performance of your tasks; this however means more brain-bendy at the outset for a traditional DBA. Lastly, Hive organizes your data — useful for a multi-analyst setup - but it’s a pain when using a polyglot toolset. Ultimately, Hive is better for an Enterprise Data Warehouse experience, or if you’re already a SQL expert. but all else equal, for exploratory analysis and Data science, you’re better off with Pig.