We often find ourselves in a situation where we want to get information from a text and present it in the form of a report. To do so, we usually extract the text and rely on a set of utilities, such as word processors and spreadsheets, search and replace text, create tables, summarize, etc.
The Linux utilities allow us to do this in a few sentences within the command line, and even parameterize it and script it to do it repetitively and with different inputs.
In this post, we will explore some useful commands for filtering information from a text.
There are groups of commands that operate filtering rows, others do it with columns and others work doing both things. We also have commands that are used to count, sort, combine, split, and replace text.
The examples that we will see below show a way to obtain the expected result. It is not the only way, there are several commands that produce similar outputs and can be combined in different ways and the result will remain the same.
The commands were executed in bash 4.3 but they can work in previous versions of most of the examples and even in other shell and other Unix-like OS.
Sample text
For the examples, we will use the following text included in the sampletext.txt file. From now, up to the end of the post, we will refer to the content of this text supposing that the file contains the same text as is shown:
Basic commands
I will explain how to use the basics for text filtering.
Filtering lines: grep
This is the most commonly used command to filter lines in a text. It shows the text lines that respond to a pattern and eliminates the rest.
It allows us to get rid of everything that is not useful and shows the lines that interest us in the text.
It has several flag and variants that make it more flexible and allows the use of regexp to extend the filtering capabilities.
In our sample text, suppose we want to show only the lines in which the string Intraway is shown, we use:
$ grep Intraway sampletext.txt
(If the string contains blank spaces you can wrap the string into single or double quotation marks)
If we want to show the ones containing the string Intraway AND solution, we use:
$ grep Intraway sampletext.txt | grep solution
(Remember that the environment is case sensitive. To ignore the difference between uppercase and lowercase, we have to use the flag -i )
Now if we want to show the rows that contain the string Intraway OR broadband we use:
$ egrep "Intraway|broadband" sampletext.txt
To show everything that does not contain the character ‘a’ ignoring case we use:
$ grep -v -i "a" sampletext.txt
If we want to remove the blank lines, we can use grep -v ^$
Filtering columns: cut
To separate one line into columns and then show certain numbers of these columns, we may use the cut command. It splits the line using a character as a pattern.
For example, giving the following line:
Intraway orchestrates service delivery and administration while providing northbound interfaces and process guidelines to integrate OSS/BSS systems and achieve the full scope of business process automation essential to a Service Provider.
We want to cut by blank spaces and show the first column and the third to the sixth column. We use:
$ echo "Intraway orchestrates service delivery and administration while providing northbound interfaces and process guidelines to integrate OSS/BSS systems and achieve the full scope of business process automation essential to a Service Provider." | cut -d ' ' -f 1,3-6
The command can be used to filter a file as well:
Replacing characters: tr
In the cases where we need to replace a single character with another one, tr is the simplest way to do this.
For example, if we need to replace the letter ‘e’ for ‘a’ in a sentence, we use:
$ echo "Hello" | tr 'e' 'a' Hallo
We can also use tr to change the case of a character, for example:
$ echo "Hello, this is an example" | tr '[:lower:]' '[:upper:]' HELLO, THIS IS AN EXAMPLE
Also, we can use it to change digits or alphabetic character for another one.
$ echo "Hello, this is an example" | tr '[:alpha:]' '*' *****, **** ** ** *******
$ echo "Hello, this is a number 123456" | tr '[:digit:]' 'X' Hello, this is a number XXXXXX
Command Substitution
Often we need to execute a command so that the result is used as an argument to another command.
In bash, we can use the substitution operators `COMMANDS` and $(COMMANDS), combinable with each other.
The output of whatever is executed within that operator will be inserted into the command that follows.
e.g:
$ date '+%H:%M' 15:13
If we want to insert this into another command we could use some of these variants:
$ echo "It's $(date '+%H:%M')" It's 15:13
or
$ echo "It's `date '+%H:%M'`" It's 15:13
Putting all together
We will see an example that uses everything we have learned so far.
Giving the sample file, sampletext.txt:
If we want to get all the text from a given string, without showing the first lines and showing everything after the first occurrence of the string, then convert all to lowercase.
Let’s use the tail -n command to show us the last N lines from the string. The formula used is:
N = FILE_ROWS – STRING_ROW + 1
To obtain the number of lines (FILE_ROWS) of the file we use wc -l as follows
$ cat sampletext.txt | wc -l 29
Then we want to know what line the first occurrence of the string “Solution Highlight” (STRING_ROW) is. We use grep -n -m 1
grep -n -m 1 "Solution Highlight" sampletext.txt 20:Solution Highlights:
Now we just want to keep the line number, we use cut delimited by colon
$ grep -n -m 1 "Solution Highlight" sampletext.txt | cut -d ':' -f 1 20
The command should be something like this:
$ tail -n $(expr `cat sampletext.txt | wc -l` - `grep -n -m 1 "Solution Highlight" sampletext.txt | cut -d : -f 1` + 1 ) sampletext.txt | tr '[:upper:]' '[:lower:]'
Explanation:
At first the commands within single opening quotes are executed:
$ tail -n $(expr `sampletext.txt | wc -l` – `grep -n -m 1 “Solution Highlight” sampletext.txt | cut -d : -f 1` + 1 ) sampletext.txt | tr ‘[:upper:]’ ‘[:lower:]’
The line replaced is left as follows:
$ tail -n $(expr 29 – 20 + 1 ) sampletext.txt | tr ‘[:upper:]’ ‘[:lower:]’
Then, the command within $( ) is executed:
$ tail -n $(expr 29 – 20 + 1 ) sampletext.txt | tr ‘[:upper:]’ ‘[:lower:]’
The final replaced command looks like this:
$ tail -n 10 sampletext.txt | tr ‘[:upper:]’ ‘[:lower:]’
After the execution of tail, the output is sent to the tr command
$ tail -n 10 sampletext.txt | tr ‘[:upper:]’ ‘[:lower:]’
Solution Highlights:
Quick time-to-market for new offerings
Real time service activation, CDR generation and monitoring
Assists Cable MSOs to increase market penetration among distinct demographic groups
Captive portal, debt management and marketing campaign portal
Diverse business models
Customer Self-Care and Self-Management via the portal, SMS or IVR
CSR Portal for assistance via the Call Center | tr ‘[:upper:]’ ‘[:lower:]’
The final output is:
solution highlights:
quick time-to-market for new offerings
real time service activation, cdr generation and monitoring
assists cable msos to increase market penetration among distinct demographic groups
captive portal, debt management and marketing campaign portal
diverse business models
customer self-care and self-management via the portal, sms or ivr
csr portal for assistance via the call center
Advanced commands: sed and awk
These commands are very powerful and we will only show the first approach to them.
Replacing text: sed
Sed is used to replace a portion of the text with another text following a pattern that usually is a regular expression.
For example, taking the following text:
If we want to replace the text “more than 50 commercial operations” for “up to 100 commercial operations” we use sed as following:
$ grep "commercial operations" sampletext.txt | sed 's/more than 50 commercial operations/up to 100 commercial operations/g'
The command admits regular expressions, thus, it has more flexibility than only searching and replacing a fixed string. We will not explain the use of regular expressions, leaving it for another post
Filtering, reordering and formatting columns: awk
The command awk is a word processor itself. It allows to filter, select, summarize, combine and even re-write and entire text.
We will see a short example of this using the result of the last example in basic commands’ section.
If we need to print a table with the two first words after “Solution Highlights” text string, we may use awk as follows:
$ tail -n 9 sampletext.txt | grep -v ^$ | tr ',' ' ' | awk '{ print "|" $1 " t|" $2 " t|" }'
Explanation:
The three first command of the pipe chain extract the last nine lines from the file, vanish the blank lines and replace the commas with spaces.
The result of this is sent to awk. It processes the text and takes each column (using groups of spaces as column separators), and put them into variables named $1, $2, .. $n. We can put these columns in the order we want into the expression that follows the print statement. Additional characters can be put between them, like spaces, tabulation ( t ), new lines ( n ), as well.
Conclusion
We have seen several commands to parse text. There are a lot more and a lot of different flags and ways to do that, all into the same command line. With a few resources, we can process and extract information and create custom reports.
Intraway.com: http://www.intraway.com/w02/products/service-fulfillment-assurance/docsis-packetcable-management.html
The Linux man pages: https://www.kernel.org/doc/man-pages/