The Phonecodes Challenge

Problem

See http://www.ipd.uka.de/~prechelt/phonecode/ for the task description.

I've mirrored Lutz Prechelt's paper (PDF alert!), which reports the results of the experiment, here.

A recursive K solution

My final recursive K solution is here. Previous versions were either buggy or failed to turn in acceptable performance (approximately 3 minutes to process the w1000 test data on a Pentium 2, 300 mhz machine.)

The current version requires a few seconds of setup-time for the benchmark datasets, but succeeds in processing the w1000 test data in 2.1 seconds, an 80-fold improvement over my first efforts.

Run the script by saying:

k phone

The function 'codes' is applied to the small test data-set, so you should then see the result in the console:

d:\k>k phone
K 2.95t 2003-06-25 Copyright (C) 1993-2003 Kx Systems

5624-82: mir Tor
5624-82: Mix Tor
4824: fort
4824: Tor 4
4824: Torf
10/783--5: je Bo" da
10/783--5: je bo"s 5
10/783--5: neu o"d 5
381482: so 1 Tor
04824: 0 fort
04824: 0 Tor 4
04824: 0 Torf

To run the script on the w1000 data, download and unzip this file to the same directory containing the phone.k script and execute the following lines in the console:

W:0:"w.txt"
N:0:"n.txt"
O:0:"o.txt"
setup[]
\t R:,/codes'N

The algorithm

There are three input data-structures:

W is a list of words:

  W
("an"
 "Bo\""
 "da"
 "je"
 "o\"d"
 "so"
 "blau"
 "Boot"
 "fern"
 "Fest"
 "fort"
 "Name"
 "Torf"
 "bo\"s"
 "Fee"
 "mir"
 "Mix"
 "neu"
 "Ort"
 "Tor")
 "jemand"
 "Wasser"
 "Mixer")

N is a list of phone codes:

  N
("112"
 "5624-82"
 "4824"
 "0721/608-4067"
 "10/783--5"
 "1078-913-5"
 "381482"
 "04824")

M maps characters to integers:

M:("ejnqrwxdsyftamcivbkulopghz";"01112223334455666777888999")

The 'setup' function

setup:{D@:p:=C::#:'D::digit'W;W@:p;C?:;P::=:'D;H::_hash'D::D@'P[;;0]}
digit:{M[1;M[0]?/:_dv[(_ci@[!256;65+!26;+;32])@_ic x;"\""]]}

partitions W by count, and derives "working" variables C, D, P, and H:

  W
(("an"
  "Bo\""
  "da"
  "je"
  "o\"d"
  "so")
 ("blau"
  "Boot"
  "fern"
  "Fest"
  "fort"
  "Name"
  "Torf")
 ("bo\"s"
  "Fee"
  "mir"
  "Mix"
  "neu"
  "Ort"
  "Tor")
 ("jemand"
  "Wasser")
 ,"Mixer")

D is a list whose elements are strings of digits which code corresponding elements of the count-partitioned word-list W. Since W i may contain words which map to the same digit-string, each partition of D is "uniqued", removing duplicates:

   D
(("51"
  "78"
  "35"
  "10"
  "83"
  "38")
 ("7857"
  "7884"
  "4021"
  "4034"
  "4824"
  "1550")
 ("783"
  "400"
  "562"
  "107"
  "824"
  "482")
 ("105513"
  "253302")
 ,"56202")

and a partition vector P is created, which maps each digit-string in D i to one or more words in W i:

  P
((,0
  ,1
  ,2
  ,3
  ,4
  ,5)
 (,0
  ,1
  ,2
  ,3
  4 6
  ,5)
 (,0
  ,1
  2 3
  ,4
  ,5
  ,6)
 (,0
  ,1)
 ,,0)

C is a vector of counts: C i = the count of each word in D i.

  C
2 4 3 6 5

H is a list of hashes into lists of D:

  H
((..)
 (..)
 (..)
 (..)
 (..))

The function 'codes' is applied to each element of N. The result is a list, each of whose results is a string of the form "phone: code":

codes:{x,/:match[1;":";x _dvl"/-"]}
match:{[b;m;n]:[~#n;,m;#r:,/find[m;n]'[D;W;C;P;H];r;b;_f[0;m," ",*n;1_ n];r]}
find:{[m;n;d;w;c;p;h]:[c>#n;();(#d)>i:(d;h)?c#n;,/w[p i]{m," ",x,y}/:\:match[1;"";c _ n];()]}

Here's how it works.

Say x is "10/783--5". In 'codes', we apply 'match', whose arguments are b, m, and n. b is either 0 or 1, and indicates whether we are allowed to substitute a digit in case we find no matches; m is the prefix so far accumulated in the scan; and n is the remainder of the phone number to be processed. Initially, we delete "-" and "/" from the input phone number ("107835"), initialize the prefix to ":", and set b to 1.

The 'match' function will recurse until either the input phone number is empty, in which case we return m (success) or no matches have been found, in which case we return the empty list () (failure).

The 'find' function is called on each paritition of D, W, C, P, and H. It tries to find an initial substring of n of length k in that partition of D consisting of length k strings. If it finds one (and it will find either none or exactly one), it calls 'match', setting b to 1, appending the corresponding word in W to m, dropping k leading characters from n. If it does not find a match, it returns ().

If no match is found, then, if b is 1, we append n[0] to the prefix and continue hunting matches on 1_ n, else we return r, which is ().

But is the benchmark result correct?

None of my K solutions succeeded in matching the benchmark output data. Specifically,

  W:0:"w.txt"
  N:0:"n.txt"
  O:0:"o.txt"
  setup[]
  \t R:,/codes'N
2213
  #O
262
  #R  
413
  &/O _lin R
1

i.e. the K solution finds mappings which are not in the benchmark result (but not vice versa.) For example, the mapping:

  *R _dvl O
"1556/0: 1 Mai 0"

In fact, the benchark data contains no encodings for the phone number 1556/0:

  |/(7#'O)~\:"1556/0:"
0

Yet, as we can confirm manually, there is one:

  codes"1556/0"
,"1556/0: 1 Mai 0"

Is there an array solution?

Can we solve this problem non-recursively? Perhaps.

First, sort the globals by C:

C@:i:<C;D@:i;W@:i;H@:i;P@:i

Next, for a given input phone number n, generate all substrings of length k = 2 ... #n:

sstr:{[n;k]1_*+(1+(#n)-k){(k#;1_)@\:x 1}\("";n)}
x:N 4
k:#s:x sstr/:2+!-1+#x:x _dvl"/-"
  x
"107835"
  s
(("10"
  "07"
  "78"
  "83"
  "35")
 ("107"
  "078"
  "783"
  "835")
 ("1078"
  "0783"
  "7835")
 ("10783"
  "07835")
 ,"107835")

Next, find the index of each match in the corresponding D list:

j:&:'(#:'D)>i:{(x;y)?/:z}'[k#D;k#H;s]
  i
(3 6 1 4 2
 3 6 0 6
 6 6 6
 1 1
 ,2)
  j
(0 2 3 4
 0 2
 !0
 !0
 !0)

Matches found:

  s@'j
(("10"
  "78"
  "83"
  "35")
 ("107"
  "783")
 ()
 ()
 ())

This takes care of all the necessary "heavy lifting".

We know that the three encodings for 10783 are:

je Bo" da = 10 78 35
je bo"s 5 = 10 783 5
neu o"d 5 = 107 83 5

where the digits in red are drawn from the input number.

Note

After completing the work described above, I discovered this paper by Lisp programmer Peter Norvig, whose algorithm closely resembles the one I implement. It appears that Norvig managed to complete his implementation in under three hours (comparing implementation time across languages was one focus of Prechelt's investigation), whereas I required five or six hours (and a failed initial design) over the course of several days.

The Lisp code is impressively terse, and has good performance, whereas this Haskell solution is surprisingly verbose.

Also, see how a C++ programmer takes up the challenge after studying Norvig's Lisp code: http://userpages.umbc.edu/~bcorfm1/C++-vs-Lisp.html.