Vector Excel


Note

The work described in this paper (and the associated .k script) is several years old, and could certainly profit from a rewrite. In particular, the standalone precedence parser should replace the earlier version used in the script.

Introduction

We envision the analyst using Excel to develop a calculation model, where the data in the spreadsheet at any point is inessential to the model. The spreadsheet is a repository for formulas, relations between formulas, and relations between formulas and data-cells. A separate database will function as a primary source for data-values (typically, many of them) with which the calculation model will be populated at run-time. We will think of the spreadsheet as a function, and the procedure, conceptually, will consist of repeatedly populating the data-cells of the sheet with scalar data read from a database, calculating the formula-cells, and then extracting the results and writing them out to the database:

	i:0
	do[n
	 d:read[i]
	 s:populate[s;d]
	 s:calculate[s]
	 r:extract[s]
	 write[r]
	 i+:1]

Instead, imagine that each data-cell of the spreadsheet could be populated with a vector of values, and the formulas interpreted to apply to vectors or scalars. Then our algorithm would be:

	d:read[]
	s:populate[s;d]
	s:calculate[s]
	r:extract[s]
	write[r]

Details

From K's point of view, an Excel spreadsheet is a matrix, every cell of which is either null, a value, or a formula. Suppose we had the following sheet S:

  A B C
1 100 200 =A1+B1
2 300 400 =A2+B2
3 =A1+A2 =B1+B2 =C1+C2

The C column contains row totals, and row 3 contains column totals. C3 is the grand total.

Now suppose that we had, in a database somewhere, a table T with the following contents:

Sheet Col Row Value
1 A 1 100
1 B 1 200
1 A 2 300
1 B 2 400
2 A 2 101
: : : :

We imagine writing a loop over the distinct values of T.Sheet. On each iteration, select four rows from T and plug the T.Values into the appropriate T.Rows and T.Cols of S. Then let S recalculate. Then retrieve the results from S and insert them into the table R:

Sheet Col Row Value
1 A 3 400
1 B 3 600
1 C 1 300
1 C 2 700
1 C 3 1000
: : : :

Suppose that T contains the input data for a large number of sheets, say a million. Then clearly this procedure is very inefficient. Instead, imagine the input-cells of S to be vectors. So, for example, A1 is a vector of length one million, beginning 100 101 102 103 .. .

Now, suppose that we could do the following: represent S as a matrix M, each cell of which is either null, a vector, a string, or a K vector expression which translates the corresponding excel scalar formula. Then we could use the following algorithm to calculate all million sheets in parallel: Calculate each formula-cell of M.

What is involved in calculating a formula-cell?

Every formula in M "bottoms out" referring to constants or to other cells. For example, A3 contains a formula which refers to A1 and A2, which are input-cells. But C3 contains a formula which refers to other formula-cells, C1 and C2. So, when calculating a cell c, if an uncalculated formula-cell d is reached, then calculate d, store the result in d, and return the result.

The result is a matrix, where each cell is either null or a vector.

Analysis

Every formula in M has the form

= ... expression ...

where expression is built from cell-references and constants by a small number of basic operators:

+ - * / ^ = < > <> <= >= AND OR MIN MAX NOT SUM AVG IF

Cell-references have one of the forms:

RxCy 				/ row x, col y
RxCy:RzCw 			/ range: row x, col y through row z, col w
sheet!cell-reference 		/ cell(s) in some other sheet

Square brackets are relative-references:

R[10]C[-5] 			/ 10 rows down, 5 columns left

'R' and 'C' by themselves refer to current row and column:

RC3 				/ this row, column 3
R 				/ this row

Additionally, function-operators such as AND have fixed-valence, their arguments separated by commas:

AND(1,0)

Within the scope of list-operators such as SUM, comma and space are the operators UNION and INTERSECTION, which take cells or cell-ranges as arguments, and return cell-ranges as results:

SUM(R1C1:R5C5,R20C20:R25C25)

The conditional IF is result-returning (like :[] in K), and has three arguments:

IF(boolean-expression,true-expression,false-expression)

Implementation

The parser/evaluator is driven by data contained in the following global variables:

 D 		digits
 F		primitive functions
 C 		token classes
 M 		finite state machine (used to tokenize formulas)
 I 		map from machine-state classes to token-classes
 O 		operator table
 E 		error table
 G 		functions vs aggregators (e.g. AND vs SUM)
 V 		operator valences (e.g. '+' is valence 2)
 P 		parse table (stack vs. input)
 K 		operator translations from Excel to K
 R 		reference operators (e.g. RANGE)
 T 		trace of parse (transient, showable)
.W		dictionary of spreadsheet matrices (read from Excel)

The entry point is the function:

calc 		evaluate each formula cell in each spreadsheet in .W.

Other functions:

token 		convert a formula string into a list of tokens
parse 		convert a token list into a parse tree

To tokenize a formula, I use M/ (index-over = pointer-chasing) and a small state-transition table. I think it is possible to do tokenizing using ONLY a state-transition table, but this requires too many states. Consequently, I use this method to bust up a formula into alphanumeric constants, strings of operator symbols, and sequences within quotes and brackets. I then perform several post- processing steps on the resulting list of strings to find unary minus, negative constants, and multi-token symbolic operators (e.g. '<=').

Although K supports several standard iterators (do and while), I've required them in only one place; namely, to resolve the ambiguity of Excel's comma. Within the scope of an aggregator, comma means Union; for example, Sum(x,y,z) means: Sum the values in the union of the ranges x, y, and z. Sum has valence 1. But, within the scope of a function, comma operates as an argument separator; for example, If(x,y,z) has valence 3, and means: if x is true, then y, else z. So far, I have been unable to find a non-looping method for determining which commas are which. (The loop occurs in the function 'sep', at tokenizing-time.) Otherwise, tokenizing, parsing, and evaluation are computed using only array primitives.

Example

Given an Excel workbook with sheets 'one' and 'two', the following VB module loads the excel.k script, constructs dictionaries .W.one and .W.two, and parses and evaluates the Excel formula-cells:

Declare Function k Lib "k20" (s, ParamArray a())
Sub kload()
    k "\l excel.k"
    k "set", "one", Worksheets("one").UsedRange.FormulaR1C1Local, Worksheets("one").UsedRange
    k "set", "two", Worksheets("two").UsedRange.FormulaR1C1Local, Worksheets("two").UsedRange
    k "calc[]"
End Sub

Before parsing and evaluation, .W might look like this:

  .W
.((`one
   ((10.0;20.0;30.0;"=SUM(RC[-3]:RC[-1])")
    (40.0;50.0;60.0;"=SUM(RC[-3]:RC[-1])")
    ("=SUM(R[-2]C:R[-1]C)"
     "=SUM(R[-2]C:R[-1]C)"
     "=SUM(R[-2]C:R[-1]C)"
     "=SUM(R[-2]C:R[-1]C)"))
   )
  (`two
   ((10.0;20.0;30.0;"=SUM(RC[-3]:RC[-1])")
    (40.0;50.0;60.0;"=SUM(RC[-3]:RC[-1])")
    ("=SUM(R[-2]C:R[-1]C)"
     "=SUM(R[-2]C:R[-1]C)"
     "=SUM(R[-2]C:R[-1]C)"
     "=SUM(R[-2]C:R[-1]C)"))
   ))

Now we run 'calc' and examine .W:

  calc[]
  .W
.((`one
   (10 20 30 60.0
    40 50 60 150.0
    50 70 90 210.0)
   )
  (`two
   (10 20 30 60.0
    40 50 60 150.0
    50 70 90 210.0)
   ))

The function 'test' replaces the value of each numeric cell with a randomly-generated vector of count n:

  test 3
  .W
.((`one
   ((83 73 40.0
     38 14 48.0
     38 4 81.0
     "=SUM(RC[-3]:RC[-1])")
    (30 82 38.0
     43 33 11.0
     96 82 81.0
     "=SUM(RC[-3]:RC[-1])")
    ("=SUM(R[-2]C:R[-1]C)"
     "=SUM(R[-2]C:R[-1]C)"
     "=SUM(R[-2]C:R[-1]C)"
     "=SUM(R[-2]C:R[-1]C)"))
   )
  (`two
   ((81 9 99.0
     46 67 59.0
     46 85 81.0
     "=SUM(RC[-3]:RC[-1])")
    (78 71 37.0
     65 66 3.0
     42 34 71.0
     "=SUM(RC[-3]:RC[-1])")
    ("=SUM(R[-2]C:R[-1]C)"
     "=SUM(R[-2]C:R[-1]C)"
     "=SUM(R[-2]C:R[-1]C)"
     "=SUM(R[-2]C:R[-1]C)"))
   ))

Running 'calc' now produces:

  calc[]
  .W
.((`one
   ((83 73 40.0
     38 14 48.0
     38 4 81.0
     159 91 169.0)
    (30 82 38.0
     43 33 11.0
     96 82 81.0
     169 197 130.0)
    (113 155 78.0
     81 47 59.0
     134 86 162.0
     328 288 299.0))
   )
  (`two
   ((81 9 99.0
     46 67 59.0
     46 85 81.0
     173 161 239.0)
    (78 71 37.0
     65 66 3.0
     42 34 71.0
     185 171 111.0)
    (159 80 136.0
     111 133 62.0
     88 119 152.0
     358 332 350.0))
   ))