Command-line flow
Things to remember about command-line flows:
- Each command should do one small thing and do it well
- Commands are meant to be chained (piped) together, feeding the output of one command into another. We can do really powerful things with this tactic.
- Text is our putty here. We read text files, filter them, mangle them and place the output into other text files, that we can then mangle again.
Example use: Extract image url's from potato-varieties.json. For this we just need to look for occurrences of "jpg". The cat command simply outputs the contents of the file, whereas grep filters that output to show only the lines that match an expression, in this case "jpg":
$ cat potato-varieties.json | grep jpg
We can then use sed to catch both characters and replace them with nothing. Sed is not a trivial command, but worth learning for this kind of text filtering.
The expression syntax for sed is s/expression_to_search/replacement/g.
$ cat potato-varieties.json | grep jpg | sed 's/[",]/g'
The regular expression we used matches both the double quotes character (") as well as the comma, removing them.
Now we can wrap the command to use its output in a for loop.
This will feed each URL to whatever command we want, in this case wget (which downloads a file from a URL)
$ for img in `cat potato-varieties.json | grep jpg | sed 's/[",]/g'`; do wget $img; done
This command is already pretty complex! It uses a bash for loop, as well as wrapping commands using backticks so that we can just get their output
grep = personal Google
CSVkit: the poor man's pivot table
csvgrep: Get the rows with 'Excellent' in the column "taste" and save them in a new file
$ cat filename.csv | csvgrep -c taste -m Excellent > excellent_varieties.csv
Note the use of ">" to send the command output to a file! This is extremely handy for filtering and narrowing large datasets into smaller subsets for analysis or visualization
Try typing "csvgrep --help" to see how the command can be used.
csvcut, csvsort, csvlook: Get a CSV with x columns from the full dataset; sort it by column x; show it nicely in the commandline
$ cat filename.csv | csvcut -c column-name,othercolumn-name | csvsort -c column-name | csvlook | less
The less command allows you to scroll through the output.
Now go look at the wonderful csvkit documentation! http://csvkit.readthedocs.org/
Other notes
Pythonpy is a great tool for people like me who like Python more than bash.
https://github.com/Russell91/pythonpy has a fantastic introduction that might very well convince you to give it a try.
Pythonpy allows you to use Python one-liners to filter and process shell output. Here's a simple example where all the a's in the output are replaced by u's:
$ cat filename.csv | csvcut -c column-name,othercolumn-name | csvsort -c column-name | py -x 'x.replace("a", "u")
Note how we call a python command on variable x (which is run on each line of output).
You'll notice how we could have done the same thing with sed (explained above). However, it can often be simpler to resort to Python commands instead of looking up less friendly commands like awk or tr.