Scan for Lines

The following exchange appeared in 1998 on the usenet group comp.lang.apl (reformatted for readability).

The problem

From Leonard Howell <leonard.howell@msfc.nasa.gov

Here is a interesting problem that perhaps someone has solved before or has an idea of how to program the solution. Given a square array (NxN) filled with zeros and ones (in practice, there are many more zeros than ones), what is the distribution of patterns of size 1, 2, 3, ...., where a pattern is defined any string of adjacent 1's, either horizontally or /and vertically distributed, and the number of "singlets" is also of interest To simply the problem, I'm ignoring diagonal connections. For example, in the following array, there are: 5 ones, 1 two, 1 three, 2 fours, and 1 six.

0 0 0 0 0 0 1 0
1 0 0 1 1 0 1 1
0 1 1 0 1 1 0 0
1 1 1 0 0 0 1 0
0 0 1 0 0 0 0 1
0 1 0 0 0 0 0 1
0 0 1 1 1 0 0 0
1 0 1 0 0 0 0 1

APL and FORTRAN solutions preferred, but any language appreciated.

Bill Huber's algorithm

From whuber <whuber@home.com>

Let's generalize and suppose there are M rows and N columns, with M >= N (you can always transpose the array to make this happen). There is a scanning algorithm that requires O(N) space and O(M*N) time, which is as good as it gets. The proof is by induction on the number of rows, M. The algorithm requires a data structure that maintains information about the distribution of patterns within the M X N array together with information about the patterns that touch its bottom row. For the induction step, one updates this structure according to what's in the next row.

Rather than be formal, I will proceed by example. In the 8 X 8 array presented, after scanning the first row I will represent the data structure thus:

0 0 0 0 0 0 [1] 0

where the brackets enclose the tip of the lone identified pattern. After scanning the second row, the data structure looks like this:

[1] 0 0 [2 2] 0 [3 3]

which echos the pattern of ones on the second row of the array, but also remembers the size of each pattern in which those ones participate. You should be able to see that creating this updated structure requires only the previous structure together with the new row, and that you only have to scan the new row two or three times (or once, if you're clever with pointers). At the third step we get:

0 [2 2] 0 [4 4] 0 0	| 1, 3

where now, on the right, we begun to output the complete patterns that have been identified. The full algorithm for this pattern then looks like:

0 0 0 0 0 0 [1] 0	
[1] 0 0 [2 2] 0 [3 3]	|
0 [2 2] 0 [4 4] 0 0	| 1, 3
[5 5 5] 0 0 0 [1] 0	| 4
0 0 [6] 0 0 0 0 [1]	| 1
0 [1] 0 0 0 0 0 [2]	| 6
0 0 [3 3 3] 0 0 0	| 1, 2
[1] 0 [4] 0 0 0 0 [1]	|
			| 1, 4, 1

The cumulative output, shown on the right, gives the desired result. (To get the last bit of output, consider processing an additional row of zeros.)

Actually, one of the most interesting aspects of the updating is not used in this example, so let's look at the following array:

1 0 1
1 1 1
1 0 0

The algorithm proceeds as follows:

[1] 0 [1]
[5 5 5]		|
[6] 0 0		|
		| 6

The interesting thing happened at the second step, where a row of three ones overlapped two patterns. They were merged. Overall, though, it should still be easy to see that the updating can occur in one pass, left to right across each row, and that the data structure can't be any bigger than a constant times the row size.

Here's another amusing example:

which is processed in this way:

[4 4 4 4]
[6 0 0 6]	|
[9 9 0 9]	|
[10] 0 [1] 0	|
		| 10, 1

Here, in the second row, separate runs of 1s were connected to the same pattern. In general, adjacent pattern tips can merge during the updating, pattern tips can disappear (and cause some output), and new patterns can appear.

The trickiest part of all this is demonstrating that the data structure can be represented succinctly as a disjoint series of intervals: in other words, that you can't have any interleaving of the tips of two patterns. But this is clear, by a discrete analog of the Jordan curve theorem, that states if you find a 1 that is the tip of one pattern, and another 1 further on in the same row that is the tip of the same pattern, then any intervening 1s must belong to the same pattern, even if they are separated along that row by 0s. The previous example illustrates this.

This scanning algorithm clearly lends itself to a FORTRAN implementation (which would actually scan left to right) rather than an APL one. I leave the details to an intrepid programmer. Of course, since I've been so informal (perhaps cryptic), it's also possible I've been wrong... there's nothing like the rigor of writing a proof (or a working program).

Implementation

Here's a K solution:

There are two patterns here: a singleton at a[3;2], and a 10-ton around the left-top-right. We don't count diagonal connections in this
problem.

  span:{:[0=#b:&x;b;(&1,~{y=-1+x}':b)_ b]}
  blob:{{(x;#x)}'span x}
  meet:{|/x[0]_lin*y}
  fuse:{(x[0],*y;x[1]+y 1)}

  pair:{
   v:*x;x:x 1
   r:(x;y)step/!#x
   x:*r;y:r 1
   if[0<#x;v[x[&x[;1]>0;1]]+:1]
   (v;y)}

  step:{
   if[0<#j:&x[0;y]meet/:x 1
    x[1]:_di[x 1;j],,@[fuse/x[1;j];1;+;x[0;y;1]]
    x[0;y;1]:0]
   x}

  count:{tally@*pair/@[blob'x,,&#*x;0;{(y;x)};&*/^x]}

  tally:{+(i;x i:&x>0)}

  count a
(1 1
 10 1)

That is, one 1, one 10.

Explanation

The 'count' function is: tally the first of pair over: amend the blob-lists at 0 with the pair (tally-vector;blob-list 0)

The blobs in a boolean vector are represented as a list of pairs. For example,

  blob 1 0 0 1 1 1 0 1 1
((,0;1)
 (3 4 5;3)
 (7 8;;2))

The first item in each blob is a vector of the indices spanned by the blob; the second item is the count of that vector. This seems redundant, but wait.

Two blobs meet when their extents overlap, and a pair of blobs are merged by joining their extents and summing their counts (try it).

Our goal is to scan the matrix and report the frequencies of different size blobs. So the arguments to 'pair' are:

  (v;x)		y

That is, the first argument is initially:

  (0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0;(0 1 2 3;4))

where the first item is a vector of the tallies for blobs up to the maximum size, which is */^matrix, and the second item is the blob-list for the first row of the matrix.

'pair' takes (v;x) and y, and returns (v';y'), where v' has accumulated information about which patterns have terminated in x, and y' is the result of applying the blobs in x to those in y.

So that's the first problem: because of the way that / works, everything has to be crammed into the first argument, which then requires 'pair' to pick it apart.

Within 'pair', we encounter something similar. The idea is that we will step through the x blobs, finding for each x blob those y blobs which it meets. We take those y blobs and fuse them, adding the new blob to the y-list, and removing its children from the y-list. Again, due to the structure of /, we invoke 'step' this way:

  (x;y)step/!#x

The result-argument of 'step' is a pair, and we process it #x times, using indices !#x to pick out blobs in the x list. We also modify the x list in the course of processing: when an x blob meets some y blob, it has met every y blob it can meet, so we zero out its count. (After stepping through all x blobs, the ones with count > 0 have terminated -- they meet no y blobs -- so we tally those in the 'pair' function).

An argument list containing patterns would significantly simplify this code:

pair:{[(v;x);y] ...}
step:{[(x;y);z] ...}

Redux

Turn each row of the matrix X (with a row of 0s appended) into a list of blobs.

Each blob is initialized to a pair consisting of a span (the indices of the row 
to which it belongs) and a count (the length of the span).

Initialize the tally vector T to M x N 0s.

Then do:

For each row X i, 0<=i<#X do:{

  For each blob b in X i do:{

     Find the blobs c ... d in X i+1 which b meets.

     Join c ... d into a new blob e by joining the spans of c ... d,
     and set the count of e to count(b)++/count EACH c .. d.

     Eliminate c ... d from the blob list for X i+1.

     Append e to the blob list for X i+1

     Set the count of b to 0.}

  For each blob bb in b whose count is > 0, increment T[count bb] by 1}

The algorithm operates on overlapping pairs of blob lists, accumulating information in two places: the second blob list and the tally vector.

So you need to define a two-place predicate

blob1 meet blob2

and a two-place function

blob:blob1 join blob2.

The rest is simply a control framework for processing pairs of rows and the tally vector.

Allow for diagonal connections between blobs by modifying the 'meet' predicate (e.g. in my implemenation, by extending the left blob by one bit around every span-segment in the blob).

Questions: What range of pattern-detection problems can be solved by modifying only the meet and join operations, and the accumulating object T and its logic? What range allowing for two passes?

A solution based on UNION-FIND

Here's a solution developed jointly by Greg Heil and Arthur Whitney based on the UNION-FIND algorithm:

u:{@[x;y;:;&/y:x/y]}  			/ connect tops
v:{@[x;y;:;&/*|y:x\y]}			/ connect all nodes
w:{@[x;y;:;&/(*|:)'y:x\'y]}		/ v, but do the paths separately

ab:{(0;#*x)+/:&,/&':x}			/ above
lf:{-1 0+/:&,/(0&':)'x}			/ left
eq:{-1+(+\,/x)lf[x],ab x}		/ renumber equations 
h:{@[&1+|/x;x;+;1]}			/ histogram

g:{[f;x]1_ h h{x/x}(!+//x)f/eq x}	/ apply algorithm f to x

m:(0 0 0 0 0 0 1 0
   1 0 0 1 1 0 1 1
   0 1 1 0 1 1 0 0
   1 1 1 0 0 0 1 0
   0 0 1 0 0 0 0 1
   0 1 0 0 0 0 0 1
   0 0 1 1 1 0 0 0
   1 0 1 0 0 0 0 1)

g[u]m
g[v]m
g[w]m