The Universe of Discourse
           
Wed, 15 Feb 2012

Insane calculations in bash
A few weeks ago I wrote an article about various methods of arithmetic calculation in shell scripts and in bash in particular, but it was all leading up to today's article, which I think is more interesting technically.

A while back, Zach Holman (who I hadn't heard of before, but who is apparently a bigwig at GitHub) implemented a kind of cute little hack, called "spark". It's a little shell utility, spark, which gets a list of numbers as its input and uses Unicode block characters to print a little bar graph of the numbers on the output. For example, the invocation:

  spark 2,4,6,8
will print out something like:

  ▃▄▆▇
To do this in one of the 'P' languages (Perl, Python, PHP, Puby, or maybe Pickle) takes something like four lines of code. But M. Holman decided to implement it in bash for maximum portability, so it took 72 lines, not counting comments, whitespace, etc.

Let's begin by discussing the (very simple) mathematics that underlies drawing bar graphs. Suppose you want to generate a set of bars for the numbers $1, $9, $20. And suppose you can actually generate bars of integer heights only, say integers from 0–7:

  0    1 ▁  2 ▂  3 ▃  4 ▄  5 ▅  6 ▆  7 ▇
(M. Holman 's original program did this, even though a height-8 bar █ is available. But the mathematics is the same either way.)

Absolute scaling

The first step is to scale the input numbers onto the range of the bars. To do this, we find a scale factor f that maps dollars onto bar heights, say that f bar units = $1.

A reasonable thing to try is to say that since your largest number is $20, we will set 7 bar units = $20. Then 0.35 bar units = $1, and 3.45 bar units = $9. We'll call these the "natural heights" for the bars.

Unfortunately we can't render the bars at their natural heights; we can only render them at integer heights, so we have to round off. 0.35 bar units rounds off to 0, so we will represent $1 as no bar at all. 3.45 bar units rounds off, badly, to 3, but that's the way it goes; if you try to squeeze the numbers from 1 to 20 into the range 0 to 7, something has to give. Anyway, this gives

     (1,9,20) → ( ▃▇)
The formula is: Let max be the largest input number (here, 20) and let n be the size of the largest possible bar (here, 7). Then an input number x becomes a bar of size n·x / max:

$$x\rightarrow {n\cdot x \over max } $$

Note that this maps max itself to n, and 0 to 0.

I'll call this method "absolute scaling", because big numbers turn into big bars. (It fails for negative numbers, but we'll assume that the numbers are non-negative.)

     (0…20) → (  ▁▁▁▂▂▂▃▃▄▄▄▅▅▅▆▆▆▇▇)
There are a couple of variations we might want to apply. First, maybe we don't like that $1 mapped to no bar at all; it's too hard to see, depending on the context. Perhaps we would like to guarantee that only 0 maps to 0. One way to ensure that is to round everything up, instead of rounding to the nearest integer:

     (0…20) → ( ▁▁▂▂▂▃▃▃▄▄▄▅▅▅▆▆▆▇▇▇)
     (1,9,20)      → (▁▄▇)
Another benefit of always rounding up is that it uses the bars equally. Suppose we're mapping numbers in the range 1–100 to bars of heights 1–7. If we round off to the nearest integer, each bar represents 14 or 15 different numbers, except that the tallest bar only represents the 8 numbers 93–100. This is a typical situation. If we always round up, each bar corresponds to a nearly equal range of numbers. (Another way to adjust this is to replace n with n+½ in the formula.)

Relative scaling

Now consider the numbers $18, $19, $20. Under the absolute scaling method, we get:

     (18,19,20) → (▆▇▇)
or, if you're rounding up,

     (18,19,20) → (▇▇▇)
which obscures the difference between the numbers. There's only an 11% difference between the tallest and shortest bar, and that doesn't show up at this resolution. Depending on your application, this might be what you want, but we might also want to avail ourselves of the old trick of adjusting the baseline. Instead of the bottom of the bar being 0, we can say it represents 17. This effectively reduces every bar by 17 before scaling it, so that the the number x is now represented by a bar with natural height n·(x−17) / (max−17). Then we get these bars:

     (18,19,20) → (▃▅▇)
Whether this "relative scaling" is a better representation than ▇▇▇ depends on the application. It emphasizes different properties of the data.

In general, if we put the baseline at b, the natural height for a bar representing number x is:

$$x\rightarrow {n\cdot (x-b) \over (max-b) } $$

That is the same formula as before, except that everything has been shifted down by b.

A reasonable choice of b would be the minimum input value, or perhaps a bit less than the minimum input value.

The shell sucks

But anyway, what I really wanted to talk about was how to fix this program, because I think my solution was fun and interesting. There is a tricky problem, which is that you need to calculate values like (n-b)/(x-b), which so you might like to do some division, but as I wrote earlier, bash has no facilities for doing fractional arithmetic. The original program used $((…)) everywhere, which throws away fractions. You can work around that, because you don't actually the fractional part of (n-b)/(x-b); you only need the greatest integer part. But the inputs to the program might themselves be fractional numbers, like say 3.5, and $((…)) barfs if you try to operate on such a number:

	$ x=3.5; echo $((x + 1))
	bash: 3.5: syntax error: invalid arithmetic operator (error token is ".5")
and you seemingly cannot work around that.

My first response to this was to replace all the uses of $((…)) with bc, which, as I explained in the previous article, does not share this problem. M. Holman rejected this, saying that calling out to bc all the time made the program too slow. And there is something to be said for this. M. Holman also said that bc is non-portable, which I find astounding, since it has been in Unix since 1974, but sadly plausible.

So supposing that you take this complaint seriously, what can you do? Are you just doomed? No, I found a solution to the problem that solves all the problems. It is portable, efficient, and correct. It is also slightly insane.

Portable fractions in bash

We cannot use decimal numbers:

	$ x=3.5; echo $((x + 1))
	bash: 3.5: syntax error: invalid arithmetic operator (error token is ".5")
But we can use fractions:

	$ x_n=7; x_d=2; echo $((x_n + x_d))/$((x_d))
        9/2
And we can convert decimal inputs to fractions without arithmetic:

        # given an input number which might be a decimal, convert it to
        # a rational number; set n and d to its numerator and
        # denominator.  For example, 3.3 becomes n=33 and d=10;
        # 17 becomes n=17 and d=1.
        to_rational() {
          # Crapulent bash can't handle decimal numbers, so we will convert
          # the input number to a rational
          if [[ $1 =~ (.*)\.(.*) ]] ; then
              i_part=${BASH_REMATCH[1]}
              f_part=${BASH_REMATCH[2]}
              n="$i_part$f_part";
              d=$(( 10 ** ${#f_part} ))
          else
              n=$1
              d=1
          fi
        }
This processes a number like 35.17 in a purely lexical way, extracting the 37 and the 17, and turning them into the numerator 3517 and the denominator 100. If the input number contains no decimal point, our task is trivial: 23 has a numerator of 23 and a denominator of 1.

Now we can rewrite all the shell arithmetic in terms of rational numbers. If a_n and a_d are the numerator and denominator of a, and b_n and b_d are the numerator and denominator of b, then addition, subtraction, multiplication, and even division of a and b are fast, easy, and even portable:

        # a + b
        sum_n = $((a_n * b_d + a_d * b_n))
        sum_d = $((a_d * b_d))

        # a - b
        diff_n = $((a_n * b_d - a_d * b_n))
        diff_d = $((a_d * b_d))

        # a * b
        prod_n = $((a_n * b_n))
        prod_d = $((a_d * b_d))

        # a / b
        quot_n = $((a_n * b_d))
        quot_d = $((a_d * b_n))
We can easily truncate a number to produce an integer, because the built-in division does this for us:

        greatest_int = $((a_n / a_d))
And we can round to the nearest integer by adding 1/2 before truncating:

        nearest_int = $(( (a_n * 2 + a_d) / (a_d * 2) ))
(Since n/d + 1/2 = (2n+d)/2d.)

For complicated calculations, you can work the thing out as several steps, or you can solve it on paper and then just embed a big rational expression. For example, suppose you want to calculate ((x-minnumber_of_tiers)/range, where number_of_tiers is known to be an integer. You could do each operation in a separate step, or you could use instead:

  tick_index_n=$(( ( x_n * min_d - min_n * x_d ) * number_of_tiers * range_d ))
  tick_index_d=$(( range_n * x_d * min_d ))
Should you need to convert to decimals for output, the following is a proof-of-concept converter:

	function to_dec {
	  n=$1
	  d=$2
          maxit=$(( 1 + ${3:-10} ))
	  while [ $n != 0 -a $maxit -gt -1 ]; do
	    next=$((n/d))
	    if [ "$r" = "" ]; then r="$next."; else r="$r$next"; fi
	    n=$(( (n - d * next) * 10 ))
	    maxit=$(( maxit - 1 ))
	  done
	  r=${r:-'0.'}
	}
For example, to_dec 13 8 sets r to 1.625, and to_dec 13 7 sets r to 1.857142857. The optional third argument controls the maximum number of digits after the decimal point, and defaults to 10. The principal defect is that it doesn't properly round off; frac2dec 19 10 0 yields 1. instead of 2., but this could be fixed without much trouble. Extending it to convert to arbitrary base output is quite easy as well.

Coming next month, libraries in bash for computing with continued fractions using Gosper's algorithms. Ha ha, just kidding. The obvious next step is to implement base-10 floating-point numbers in bash like this:

  prod_mantissa=$((a_mantissa * b_mantissa))
  prod_exponent=$((a_exponent + b_exponent))
[ Addendum 20120306: David Jones corrects a number of portability problems in my implementation. ]


[Other articles in category /prog] permanent link