How to do a "Control Break" (Algorithm)?

A vast amount of problems asked in "Shell Programming an Scripting" can be traced back to be an application of a basic algorithm called a Control Break. Every programmer - and script writers are programmers nonetheless - should immediately recognize problems of this sort and know how to deal with them. We will first discuss the problem in theory, then implement a shell script dealing with an example data set to show the ropes.

The Single Control Break

The most basic form is the single control break. It occurs when some record-oriented data is sortable by a key and all the records with an identical key value are to be processed somehow.

Too complicated? Perhaps, but in fact it is really easy: suppose you have a file of customers (the key) and their purchase values. The goal is to get the purchase totals for every customer. You build a sum (the processing) for all entries with the same customer ID (the identical key values). Lets see:

Code:

Alan      75

Bill      50

Charlie   75

Bill      40

Charlie   55

Alan      25

Bill      30

The first thing we have to do is to identify our key - the part we want to use to differentiate between customers, their names - and the data we need to do our processing - in this case the purchase values. This is easy in our example, but can be quite tricky in real-world applications

By the way: do not confuse "key" and "key value". Key is the part of the line we sort on. Here it is the first word. Key value is the value the key part holds in each record. The key value for line 1 is "Alan", while for line 5 it is "Charlie":

Code:

<Key>     <data>

Alan      75

Bill      50

Charlie   75

Bill      40

Charlie   55

Alan      25

Bill      30

The next step is really easy for shell programmers, because there is a genuine UNIX command for it: we need to sort our input data set for the key we have identified. In our case this simply means: sort , without any options.

Data after applying "sort"

Code:

Alan      25

Alan      75

Bill      30

Bill      40

Bill      50

Charlie   55

Charlie   75

Suppose we read this, line by line. Because it is sorted we can rely on all the identical key values coming right one after the other. That is, once we read a line with "Bill", there will be no "Alan"s any more. Keep this in mind while we set up a simple read-loop for the file:

Script.version1.sh:

Code:

#! /bin/ksh



typeset    customer=""

typeset -i value=0

typeset    infile="./input"



sort "$infile" |\

while read customer value ; do

     print - "Customer is: $customer\t\t Purchase is: $value"

done



exit 0

OK, we got the reading of the input correctly, as we have seen from the sample output. So let us come back on our last idea: every time the value of the key changes, one "group" (one certain customer) is finished and a new one begins.

Every time the key value changes from one to the next line we need to output the total. Let us see if we get the line where this happens correctly - we will just mark it, do nothing else:

Script.version2.sh:

Code:

#! /bin/ksh



typeset    customer=""

typeset -i value=0

typeset    infile="./input"

typeset    lastcustomer=""



sort "$infile" |\

while read customer value ; do

     if [ "$lastcustomer" != "$customer" ] ; then

          print - "Here needs to be a total."

     fi

     print - "Customer is: $customer\t\t Purchase is: $value"



     lastcustomer="$customer"

done



exit 0

Well - almost, yes? Two things which aren't quite right.

First, the first total is requested in the first line, which is nonsense. It happens because the value of "lastcustomer" is pre-set to "" which is of course different of the first customer name we actually read in. Still, there should be no total there.

Second, the last total, for "Charlie", is missing at all. The reason is that after the last record, which would end the group of "Charlie"s, the loop is simply left. So let us fix the code to take care of these two problems:

Script.version3.sh:

Code:

#! /bin/ksh



typeset    customer=""

typeset -i value=0

typeset    lastcustomer=""

typeset    infile="./input"



sort "$infile" |\

while read customer value ; do

     if [ "$lastcustomer" != "$customer" -a "$lastcustomer" != "" ] ; then

          print - "Here needs to be a total"

     fi

     print - "Customer is: $customer\t\t Purchase is: $value"



     lastcustomer="$customer"

done

print - "Here needs to be a total"



exit 0

Very well! Now let us implement the total: we already know, where we need to output it. We need to sum up every value read in in a sum variable. Upon the "control break" happening, when the key value changes, we have to output the sum, then reinitialize the sum variable with zero again and continue. Let's do it:

Code:

#! /bin/ksh



typeset    customer=""

typeset -i value=0

typeset    lastcustomer=""

typeset    infile="./input"

typeset -i sum=0



sort "$infile" |\

while read customer value ; do

     if [ "$lastcustomer" != "$customer" -a "$lastcustomer" != "" ] ; then

          print - "--- Total for $lastcustomer is $sum"

          (( sum = 0 ))

     fi

     print - "Customer is: $customer\t\t Purchase is: $value"

     (( sum = sum + value ))



     lastcustomer="$customer"

done

print - "--- Total for $lastcustomer is $sum"



exit 0

That was really easy, wasn't it? In fact, that was all - we solved the problem! But suppose we would have had to calculate the average of the purchases instead of the total for each customer. You sure know by now how this works, no?

OK, don't read any further! Instead, do it yourself and compare your solution to mine:

Code:

#! /bin/ksh



typeset    customer=""

typeset -i value=0

typeset    lastcustomer=""

typeset    infile="./input"

typeset -i sum=0             # total of the purchases

typeset -i avg=0             # average purchase 

typeset -i num=0             # number of purchases



sort "$infile" |\

while read customer value ; do

     if [ "$lastcustomer" != "$customer" -a "$lastcustomer" != "" ] ; then

          (( avg = sum / num ))        # calculate average

          print - "--- Average purchase of $lastcustomer is $avg"

          (( sum = 0 ))                # clear counters

          (( num = 0 ))

          (( avg = 0 ))

     fi

     print - "Customer is: $customer\t\t Purchase is: $value"

     (( sum = sum + value ))

     (( num = num + 1 ))



     lastcustomer="$customer"

done



(( avg = sum / num ))

print - "--- Average purchase of $lastcustomer is $avg"



exit 0

Very well! But you see, when the "end processing" gets more and more complicated there is more and more redundant code to be written: once inside the main loop, once after it. It is therefore a good idea - at least for anything less trivial than summation - to move the end processing of each group to a function you can call:

Code:

#! /bin/ksh



pEndProcessing ()

{

typeset    cust="$1"

typeset -i sum=$2

typeset -i num=$3

typeset -i avg=$(( sum / num ))



print - "--- Average purchase of $cust is $avg"

print - ""                   # insert extra line feed for easier reading



return 0

}







# main ()

typeset    customer=""

typeset -i value=0

typeset    lastcustomer=""

typeset    infile="./input"

typeset -i sum=0             # total of the purchases

typeset -i num=0             # number of purchases



sort "$infile" |\

while read customer value ; do

     if [ "$lastcustomer" != "$customer" -a "$lastcustomer" != "" ] ; then

          pEndProcessing "$lastcustomer" $sum $num

          (( sum = 0 ))                # clear counters

          (( num = 0 ))

     fi

     print - "Customer is: $customer\t\t Purchase is: $value"

     (( sum = sum + value ))

     (( num = num + 1 ))



     lastcustomer="$customer"

done



pEndProcessing "$lastcustomer" $sum $num



exit 0

Perfect! This was our final transformation and i promise you will be able to solve all kinds of simple control break type problems using this blueprint and adapting it. That's all!

The Multiple Control Break

You might already have sensed it from the innocent word "single": this wasn't really all there is. Where there is "single" there is "multiple" and the same is true here. So here is the multiple control break, which is of course more complex than the single one. But don't panic! The solution is really easy to grasp for experts of the single control break - you!

Suppose that every group you have to process consists of several subgroups you have to process too. Sounds complicated again? Well, an example will clear that up. This is a list of the cars our customers purchased:

file input2

Code:

Bell    Dodge   Charger

Craig   VW      Touareg

Graham  VW      Golf

Jones   Dodge   Dart

Leslie  Dodge   Dart

Myers   Dodge   Avenger

Rock    Dodge   Avenger

Loman   VW      Beetle

Smith   Dodge   Dart

Smyth   VW      Beetle

You see, we have several models as well as several manufacturers. In the end i not only want to know how many of every model we sold, but also, how the manufacturers fared. So we need a sum over all the "Dodge"s and the "VW"s, but also a sub-sum for the Dodge Charger, one for the Dodge Dart, etc..

At first we start again with identifying the key(s): now, every key consists of a main key and a sub-key. If the main key value changes, we have a "primary control break", if only the sub-key value changes we have a "secondary control break". Instead of the single-layer control break we had in our first example we have now a two-layer hierarchy. Having anything else than a single layer - instead of the 2 levels here there could be several - means executing a multiple control break.

This time the sorting process is way more complex. I suggest you consult the man page of sort if you are not absolutely sure what the following command does:

Code:

# sort -bk 2 input2  

Myers   Dodge   Avenger

Rock    Dodge   Avenger

Bell    Dodge   Charger

Jones   Dodge   Dart

Leslie  Dodge   Dart

Smith   Dodge   Dart

Loman   VW      Beetle

Smyth   VW      Beetle

Graham  VW      Golf

Craig   VW      Touareg

As you have already gotten that far, i am sure the following code will be obvious to you. I changed the totalling function a bit to do either a "big" total (for the manufacturers) or a "small" total (for the models). But beware: i have introduced a very subtle weakness in the program and you might want to try to find it. Spoilsports will find it below, but if you want to try your debugging expertise - be my guest.

And now, without further ado, here is the "double control break":

Code:

#! /bin/ksh



pEndProcessing ()

{

typeset    type="$1"

typeset    manu="$2"

typeset    mod="$3"

typeset -i num=$4



case "$type" in

     small)

          print - "Total sold $manu $mod's: $num"

          ;;



     big)

          print - "-------- Total sold $manu's: $num\n"

          ;;



     *)

          print -u2 - "Error: i cannot handle mode $type"

          ;;

esac



return 0

}









# main ()

typeset    customer=""

typeset    manufacturer=""

typeset    model=""

typeset -i nummanu=0

typeset -i nummodel=0

typeset    infile="./input2"



sort -bk 2 "$infile" |\

while read customer manufacturer model ; do

     if [ "$lastmodel" != "$model" -a "$lastmodel" != "" ] ; then

          pEndProcessing small "$lastmanufacturer" "$lastmodel" $nummodel

          if [ "$lastmanufacturer" != "$manufacturer" ] ; then

               pEndProcessing big "$lastmanufacturer" "$lastmodel" $nummanu

               (( nummanu = 0 ))

          fi

          (( nummodel = 0 ))

     fi



     (( nummanu = nummanu + 1 ))

     (( nummodel = nummodel + 1 ))

     lastmodel="$model"

     lastmanufacturer="$manufacturer"

done



pEndProcessing small "$lastmanufacturer" "$lastmodel" $nummodel

pEndProcessing big "$lastmanufacturer" "$lastmodel" $nummanu



exit 0

Have you found the weakness? You probably need a hint. OK, the problem is in this line:

Code:

if [ "$lastmodel" != "$model" -a "$lastmodel" != "" ] ; then

Still not sure? OK, here is the last hint: replace the input file "input2" with the following "input3" and let it run again:

Code:

Bell    MaA     A

Craig   MaB     C

Graham  MaB     D

Jones   MaA     A

Leslie  MaA     B

Myers   MaA     C

Rock    MaA     A

Loman   MaB     D

Smith   MaA     C

Smyth   MaB     E

Solution follows:

When you look at the sorted file, you will notice that the "last" model of "MaA" has the same name as the "first" model of "MaB". Because of this the "big" control break for the manufacturer is not executed. The weakness in the line is, that it implies a "big" group change to contain a "small" group change too. This does not necessarily have to be the case. Modify the line therefore like this:

Code:

     if [ \( "$lastmodel" != "$model" \

             -o "$lastmanufacturer" != "$manufacturer" \

          \) -a "$lastmodel" != "" ] ; then

and you will see the program can even process the "input3" file.

Happy shell programming.

bakunin

How to do a "Control Break" (Algorithm)?

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List