Robust fitting

However, the square of the vertical distance is only one
of many reasonable ways of measuring the distance from the data points to
the line *mx* + *b*. For example, it is perfectly reasonable to consider instead
the expression
1 + (*mx*_{i} + *b* - *y*_{i})^{2}. If (*x*_{i}, *y*_{i}) is on the line, this
value would be equal to 1, and its logarithm would then be zero. Thus, the
function

There is a priori no particular reason to prefer one error function over
another. And for each choice of such we could end up with a problem whose
solution exists and produces a line that best fit the data.
Then we could compute the value of the error function for the best fit line,
value that depends upon the choice of function made. A *canonical* way of
choosing the best error function could be that for which this number is the
smallest. But finding such a function is quite a difficult, if not
impossible,
problem to solve. If we limit our attention to *linear estimators*,
in a sense to be explained in the last section of this chapter, the solution
obtained in §3 is optimal.
In here, we pursue further the problem of finding the best fit according to
the two error functions introduced above.

Notice that if
= *mx*_{i} + *b* - *y*_{i}, data with large
errors
excert a huge influence on the original error function
of §3 (because we use
). The two error
functions mentioned above grow linearly (or sub-linearly)
with
||, causing the tails to count much less heavily. We will
see this when comparing the different results.

In order to solve the problem now at hand, we first generate some ``noisy''
data. We load the routines in `lsq_data.txt`

:

> read(`lsq_data.txt`);

The data and its graphical visualization can then be obtained by:

> pts:=bad_line_pts():

> plot(pts,style=point,symbol=box);

You should notice that while most of the data lies fairly near a line, there is one point which is extremely far away from the rest of the data.

In order to compare results, let us first find the line given by least
squares. First, we define a function , which gives the signed
vertical distance between a point pt and a line of slope
m and intercept b

> epsilon:= (pt,m,b) -> (m*pt[1]+b-pt[2]);

Now we compute the regular least squares fit.

> H:=(m,b,pts)->sum(epsilon(pts[i],m,b))^2,i=1..nops(pts)): sol:=solve(diff(H(m,b,pts),m)=0, diff(H(m,b,pts),b)=0,m,b);

> display(plot(pts,style=point,symbol=box),plot(subs(sol,m*x+b),x=-10..10), view=-20..20);

In the above, we restricted our attention to the region
-20 < *y* < 20,
where the line and the majority of the data lies. As you can see, the
fit to the majority of the data is very poor, because of the influence
exerted by the anomalous data point.

Now, let's try again, this time minimizing ln(1 + /2), which behaves like for small errors, but grows very slowly for large ones.

> R:=(m,b,pts)->sum(ln(1+epsilon(pts[i],m,b)^2/2),i=1..nops(pts));

Here's what the functional we want to minimize looks like:

> plot3d(R(m,b,pts),m=-5..0,b=-20..0,style=patchcontour, axes=boxed);

The equations we want to solve are not linear. In fact, they're messy enough that Maple needs some help to find the minimum. By examining the plot and zooming in on the minimum, we can choose an appropriate region.

> plot3d(R(m,b,pts),m=-2..-1.5,b=3..5,view=38..50, axes=boxed);

Once we have an appropriate region, we can ask Maple to look for the solution numerically using fsolve.

> rs:=fsolve( diff(R(m,b,pts),m)=0, diff(R(m,b,pts),b)=0, m,b,m=-2..-1.5, b=3..5);

And here we can compare the two lines found.

> p := plot(pts,style=point): l := plot(subs(sol,m*x+b),x=-10..10): s := plot(subs(rs,m*x+b),x=-10..10,color=blue): plots[display](p,l,s,view=-20..20);

Let us now try to minimize
||. Since this function is
not even differentiable near
= 0, we have to work harder
to get less.

> A:=(m,b,pts)->sum(abs(epsilon(pts[i],m,b)),i=1..nops(pts));

Displaying the graph of this function gives us an idea about where its minimum lies:

> plot3d(A(m,b,pts),m=-5..0,b=-10..10,style=patchcontour, axes=boxed);

> plot3d(A(m,b,pts),m=-1.0875..-1.0872,b=3.817..3.8178, axes=boxed);

This leads us to a choice of
*m* = - 1.80733, *b* = 3.8174 as our ``best''
line. This is almost (but not quite) the same line found using our distance
*R*(*m*, *b*).

2002-08-29