# Vector Excel

## Note

The work described in this paper (and the associated .k script) is several years old, and could certainly profit from a rewrite. In particular, the standalone precedence parser should replace the earlier version used in the script.

## Introduction

We envision the analyst using Excel to develop a calculation model, where the data in the spreadsheet at any point is inessential to the model. The spreadsheet is a repository for formulas, relations between formulas, and relations between formulas and data-cells. A separate database will function as a primary source for data-values (typically, many of them) with which the calculation model will be populated at run-time. We will think of the spreadsheet as a function, and the procedure, conceptually, will consist of repeatedly populating the data-cells of the sheet with scalar data read from a database, calculating the formula-cells, and then extracting the results and writing them out to the database:

```	i:0
do[n
s:populate[s;d]
s:calculate[s]
r:extract[s]
write[r]
i+:1]```

Instead, imagine that each data-cell of the spreadsheet could be populated with a vector of values, and the formulas interpreted to apply to vectors or scalars. Then our algorithm would be:

```	d:read[]
s:populate[s;d]
s:calculate[s]
r:extract[s]
write[r]```

## Details

From K's point of view, an Excel spreadsheet is a matrix, every cell of which is either null, a value, or a formula. Suppose we had the following sheet S:

 A B C 1 100 200 =A1+B1 2 300 400 =A2+B2 3 =A1+A2 =B1+B2 =C1+C2

The C column contains row totals, and row 3 contains column totals. C3 is the grand total.

Now suppose that we had, in a database somewhere, a table T with the following contents:

 Sheet Col Row Value 1 A 1 100 1 B 1 200 1 A 2 300 1 B 2 400 2 A 2 101 : : : :

We imagine writing a loop over the distinct values of T.Sheet. On each iteration, select four rows from T and plug the T.Values into the appropriate T.Rows and T.Cols of S. Then let S recalculate. Then retrieve the results from S and insert them into the table R:

 Sheet Col Row Value 1 A 3 400 1 B 3 600 1 C 1 300 1 C 2 700 1 C 3 1000 : : : :

Suppose that T contains the input data for a large number of sheets, say a million. Then clearly this procedure is very inefficient. Instead, imagine the input-cells of S to be vectors. So, for example, A1 is a vector of length one million, beginning 100 101 102 103 .. .

Now, suppose that we could do the following: represent S as a matrix M, each cell of which is either null, a vector, a string, or a K vector expression which translates the corresponding excel scalar formula. Then we could use the following algorithm to calculate all million sheets in parallel: Calculate each formula-cell of M.

What is involved in calculating a formula-cell?

Every formula in M "bottoms out" referring to constants or to other cells. For example, A3 contains a formula which refers to A1 and A2, which are input-cells. But C3 contains a formula which refers to other formula-cells, C1 and C2. So, when calculating a cell c, if an uncalculated formula-cell d is reached, then calculate d, store the result in d, and return the result.

The result is a matrix, where each cell is either null or a vector.

## Analysis

Every formula in M has the form

`= ... expression ...`

where expression is built from cell-references and constants by a small number of basic operators:

`+ - * / ^ = < > <> <= >= AND OR MIN MAX NOT SUM AVG IF`

Cell-references have one of the forms:

```RxCy 				/ row x, col y
RxCy:RzCw 			/ range: row x, col y through row z, col w
sheet!cell-reference 		/ cell(s) in some other sheet```

Square brackets are relative-references:

`RC[-5] 			/ 10 rows down, 5 columns left`

'R' and 'C' by themselves refer to current row and column:

```RC3 				/ this row, column 3
R 				/ this row```

Additionally, function-operators such as AND have fixed-valence, their arguments separated by commas:

`AND(1,0)`

Within the scope of list-operators such as SUM, comma and space are the operators UNION and INTERSECTION, which take cells or cell-ranges as arguments, and return cell-ranges as results:

`SUM(R1C1:R5C5,R20C20:R25C25)`

The conditional IF is result-returning (like :[] in K), and has three arguments:

`IF(boolean-expression,true-expression,false-expression)`

## Implementation

The parser/evaluator is driven by data contained in the following global variables:

``` D 		digits
F		primitive functions
C 		token classes
M 		finite state machine (used to tokenize formulas)
I 		map from machine-state classes to token-classes
O 		operator table
E 		error table
G 		functions vs aggregators (e.g. AND vs SUM)
V 		operator valences (e.g. '+' is valence 2)
P 		parse table (stack vs. input)
K 		operator translations from Excel to K
R 		reference operators (e.g. RANGE)
T 		trace of parse (transient, showable)```
`.W		dictionary of spreadsheet matrices (read from Excel)`

The entry point is the function:

`calc 		evaluate each formula cell in each spreadsheet in .W.`

Other functions:

```token 		convert a formula string into a list of tokens
parse 		convert a token list into a parse tree```

To tokenize a formula, I use M/ (index-over = pointer-chasing) and a small state-transition table. I think it is possible to do tokenizing using ONLY a state-transition table, but this requires too many states. Consequently, I use this method to bust up a formula into alphanumeric constants, strings of operator symbols, and sequences within quotes and brackets. I then perform several post- processing steps on the resulting list of strings to find unary minus, negative constants, and multi-token symbolic operators (e.g. '<=').

Although K supports several standard iterators (do and while), I've required them in only one place; namely, to resolve the ambiguity of Excel's comma. Within the scope of an aggregator, comma means Union; for example, Sum(x,y,z) means: Sum the values in the union of the ranges x, y, and z. Sum has valence 1. But, within the scope of a function, comma operates as an argument separator; for example, If(x,y,z) has valence 3, and means: if x is true, then y, else z. So far, I have been unable to find a non-looping method for determining which commas are which. (The loop occurs in the function 'sep', at tokenizing-time.) Otherwise, tokenizing, parsing, and evaluation are computed using only array primitives.

## Example

Given an Excel workbook with sheets 'one' and 'two', the following VB module loads the excel.k script, constructs dictionaries .W.one and .W.two, and parses and evaluates the Excel formula-cells:

`Declare Function k Lib "k20" (s, ParamArray a())`
```Sub kload()
k "\l excel.k"
k "set", "one", Worksheets("one").UsedRange.FormulaR1C1Local, Worksheets("one").UsedRange
k "set", "two", Worksheets("two").UsedRange.FormulaR1C1Local, Worksheets("two").UsedRange
k "calc[]"
End Sub```

Before parsing and evaluation, .W might look like this:

```  .W
.((`one
((10.0;20.0;30.0;"=SUM(RC[-3]:RC[-1])")
(40.0;50.0;60.0;"=SUM(RC[-3]:RC[-1])")
("=SUM(R[-2]C:R[-1]C)"
"=SUM(R[-2]C:R[-1]C)"
"=SUM(R[-2]C:R[-1]C)"
"=SUM(R[-2]C:R[-1]C)"))
)
(`two
((10.0;20.0;30.0;"=SUM(RC[-3]:RC[-1])")
(40.0;50.0;60.0;"=SUM(RC[-3]:RC[-1])")
("=SUM(R[-2]C:R[-1]C)"
"=SUM(R[-2]C:R[-1]C)"
"=SUM(R[-2]C:R[-1]C)"
"=SUM(R[-2]C:R[-1]C)"))
))```

Now we run 'calc' and examine .W:

```  calc[]
.W
.((`one
(10 20 30 60.0
40 50 60 150.0
50 70 90 210.0)
)
(`two
(10 20 30 60.0
40 50 60 150.0
50 70 90 210.0)
))```

The function 'test' replaces the value of each numeric cell with a randomly-generated vector of count n:

```  test 3
.W
.((`one
((83 73 40.0
38 14 48.0
38 4 81.0
"=SUM(RC[-3]:RC[-1])")
(30 82 38.0
43 33 11.0
96 82 81.0
"=SUM(RC[-3]:RC[-1])")
("=SUM(R[-2]C:R[-1]C)"
"=SUM(R[-2]C:R[-1]C)"
"=SUM(R[-2]C:R[-1]C)"
"=SUM(R[-2]C:R[-1]C)"))
)
(`two
((81 9 99.0
46 67 59.0
46 85 81.0
"=SUM(RC[-3]:RC[-1])")
(78 71 37.0
65 66 3.0
42 34 71.0
"=SUM(RC[-3]:RC[-1])")
("=SUM(R[-2]C:R[-1]C)"
"=SUM(R[-2]C:R[-1]C)"
"=SUM(R[-2]C:R[-1]C)"
"=SUM(R[-2]C:R[-1]C)"))
))```

Running 'calc' now produces:

```  calc[]
.W
.((`one
((83 73 40.0
38 14 48.0
38 4 81.0
159 91 169.0)
(30 82 38.0
43 33 11.0
96 82 81.0
169 197 130.0)
(113 155 78.0
81 47 59.0
134 86 162.0
328 288 299.0))
)
(`two
((81 9 99.0
46 67 59.0
46 85 81.0
173 161 239.0)
(78 71 37.0
65 66 3.0
42 34 71.0
185 171 111.0)
(159 80 136.0
111 133 62.0
88 119 152.0
358 332 350.0))
))```