PROGRAMMING
The Writing of Opus
by Doug Harrison
Last month, ST-Log contained a full-featured spreadsheet program named Opus. The article that follows is based on that program and explains some of the programming techniques used in the development of the program. Because of the immense size of Opus's source code, it is impossible for us to offer it on the monthly disk. If you would like a copy of the source code, you may download it from the ST SIG on DELPHI. Those of you without access to DELPHI may obtain a copy by sending your original September or October ST-Log disk (make a copy of the disk first; it'll be returned to you with only the Opus source code on it) to: ST-Log, P.O. Box 1413-M.O., Manchester, CT 06040-1413. Note: You must include a self-addressed, stamped mailer with your request!
As you would expect, Opus is a very large and somewhat complex program, consisting of 24 source files totaling over 500,000 bytes and 10,000 lines of code. If you should like to dabble about with the program, you will need Personal Pascal v. 2.01, a megabyte of memory—and probably a hard drive to make compilation practical. You will also need a resource construction program to alter the resource file and hence the menu and dialogs.
In the course of what follows, I will cite procedures by name. Note also that I have implemented many BIOS, XBIOS, GEMDOS, VDI and AES calls not found in "straight" Personal Pascal; you will need references to these, and I've listed mine at the end of the article. Space restrictions preclude any detailed discussion of the inner workings of Opus, so I'll just consider several of the more interesting routines and techniques that could also have implications for other projects.
General
Like most properly written GEM applications, Opus is event-driven; that is, the program essentially consists of a loop containing a single Get__Event call (in C, evnt__multi), and specifically, it waits for keypresses, messages and mouse clicks. The main module, OPUS.PAS, contains a loop that calls window__input, a general purpose minitext editor containing the Get__Event call. Window__input echoes typed information to the edit area and exits upon certain keypresses, messages or mouse-clicks, returning inp__code. Next, evaluate__input is called, and this routine sorts through the various inp__code possibilities and calls the appropriate procedure, be it assign, handle__message, mouse, evaluate__formula or whatever, and once the designated action is completed, control is returned to window__input, and the cycle begins anew.
Window__input also contains the code necessary to control and animate the function menu; briefly, I have used the AES Wind__Set call with the value 14, WF__NewDesk, to redefine the default object tree for the Desktop to draw as its background. My control panel is part of this object tree, and the menu pulldowns have their obj__flags field set to Hide__Tree; so they will not be drawn and the AES will not register mouse clicks on the objects.
When the user clicks on the mouse, Obj__Find is called to determine if he clicked on one of the menu items. If so, I mask out the relevant pulldown's Hide_Tree flag, get the size of the pulldown via Obj__Size, blit the background to my 32K blit buffer, and call Obj_Draw, the AES routine which draws object trees (which is precisely what a dialog is and what Do__Dialog calls). I then enter a loop where Graf_MKState (another AES call) tracks the mouse position, and Obj__Find returns the object currently under the mouse pointer, which is limited to those in the pulldown.
If Obj__Find returns a number other than -1, I know I need to use Obj_SetState to set the "Selected" bit in the obj__state field of the object, and I may also need to clear this bit in a previously selected object. And when Graf_MKState indicates the mouse button is up, the loop terminates, and the chosen item is the one under the mouse pointer at that time. To "clean up," I simply reset the Hide__Tree flag and blit the background back. Then it is a simple matter to enter a CASE construct where I check the chosen object and enter its character representation into the edit area.
The nice thing about redefining the Desktop's background is that the Desktop redraws the tree for me whenever it receives a redraw message. (One less thing to worry about!) One problem with this that is not documented in any of my references is the apparent lack of a GEM call to inform the Desktop it needs to use its own, original background when the program terminates. If you don't do this, GEM thinks your object tree is still available (and it's not), but as long as you return to the Desktop, all is well.
However, should you run such a program from a shell, such as the Personal Pascal manager, the Desktop does not get informed of the change when your program ends, and running another program tends to crash the machine as the Desktop attempts to draw a tree whose data structures have been corrupted. The solution? Just before terminating your program, call Wind__Set again with WF__NewDesk, but pass it a longword zero for an address, rather than an explicit object tree address. This tells Desktop to draw the standard background and keeps everyone happy (my thanks to Charles F. Johnson for this tip).
FIGURE 1 TYPE DepPtr = ^DependentCells; Dependent Cells = PACKED RECORD r, c : INTEGER; { row and column } next : DepPtr; END; CellPtr = ^CellType; CellType = PACKED RECORD c : INTEGER; { column } format : INTEGER; class : ClassType; status : StatusType; num : REAL; str : ^LorFstr; { label or formula } sub : DepPtr; next : CellPtr; END; DataTable = ARRAY [0..n_rows] OF CellPtr; VAR data : DataTable; |
Data structures
I could have used a simple two-dimensional array of records as a data structure, which would have been exactly analogous to the spreadsheet itself, but this would have severely limited the spreadsheet size and would have made impossible the ability to take advantage of larger than standard memory configurations, as the array dimensions would have been fixed at some small size. In order to provide the large apparent spreadsheet size, I employed a sparse matrix design in which cells containing no data do not "exist" and thus consume no memory.
Essentially, a sparse matrix approach is appropriate when you need an array of very large dimensions that typically is very empty. A sparse matrix emulates such an array through the use of linked lists, a dynamic data structure (one that can grow and shrink) implemented through pointers. Examine Figure 1, which contains the TYPE declarations for the sparse matrix.
The variable data, then, is the main variable and it is a 0..999 element array of pointers to records defined by CellType, one element per spreadsheet row (row 0 is used as a buffer by various operations, such as MOVE, COPY, SORT, etc). Pointers contain memory addresses and are always four bytes long, so data requires only 4,000 bytes when the program is started. A record is created by calling the standard Pascal procedure NEW, as in NEW(data[l]), and this allocates space for the pointer type, in this case the record defined by CellType.
Counting up the memory required by each field of CellType shows that each allocated cell consumes 26 bytes. The new element is referenced as data[l] ⋀ .field. Now, when Opus detects data entry, it supplies the row and column numbers to new__cell, which in turn calls locate__cell, which returns the address of the cell if it already exists or NIL. New__cell returns the address of the cell, which is easy in the case of the former; if the cell did not exist, it allocates the cell with NEW if enough memory is available and then calls init__cell or returns NIL. Init__cell initializes the fields of the new cell, and it is very important to make sure any pointer fields contain either a valid address of the value NIL.
Note the "next"field in CellType is itself of type CellPtr; NEWing as in NEW(data[l] ⋀ .next) allocates a cell whose address is found in data[l] ⋀ .next and whose elements are referenced as data[l] ⋀ .next ⋀ field. This can be extrapolated indefinitely, and a linked list is created such that Element 1 points to Element 2 which points to Element 3 and so on.
Since we have a linked list for each row, we store the column numbers for each list member, and we insert cells so that the list is in order by column, we have in effect emulated a two-dimensional array. The end of a list is indicated by a NIL next. The only disadvantage is that we can't refer explicitly to the array elements, as in data[l,l]; rather, we must perform a list traversal to locate a cell, as follows:
FUNCTION LOCATE_CELL ( row, col : INTEGER ) : CellPtr; VAR found, passed : BOOLEAN; ptr : CellPtr; BEGIN ptr : = data[row]; { start at beginning of the list } found : = FALSE; passed : = FALSE; WHILE (ptr <> NIL) AND (NOT found) AND (NOT passed) DO IF ptr^.c = col THEN found : = TRUE { yea! } ELSE IF ptr^.c > col THEN { if we passed it up, we never } passed : = TRUE { will find it, since we store } ELSE { elements in order } ptr : = ptr^.next; { on to next element } IF found THEN locate_cell : = ptr ELSE locate_cell : = NIL END: { LOCATE_CELL }
Note that found is TRUE if and only if the cell exists. Of course, you must also write procedures to delete elements from the list, and you call the standard procedure DISPOSE to return the memory to the system, so that it may be reused.
Now, take a look at the field called Sub in CellType, which stands for subordinate. Sub is itself the first element in a linked list of dependent cells—that is, cells holding formulas that reference the cell. After entering a formula and evaluating it, all__lists is invoked to update all pertinent dependency lists, and it begins by calling scan__for__cells, which returns the position(s) of cell references within the formula. All__lists then calls translate__cell, which converts the cell reference to row and column numbers, and then it supplies list__insert with these values as well as the row and column number of the cell containing the formula.
List__insert then calls new__cell, supplying the row and column for the referenced cell, and that procedure creates the cell and returns its address or returns NIL, as explained previously. Now, list__insert (provided new__cell did not return NIL) either adds the row and column of the cell containing the formula to the referenced cell's dependent cell list or creates the list. Each entry requires eight bytes, as can be seen by summing the data sizes for the fields in DependentCells.
Now, after changing the cell's value, it is simple to traverse the dependent cell list and recalculate just those cells that are affected by the change, as Opus knows these cells by their row and column numbers. These cells may themselves have dependent cell lists, which also will be traversed. Hence, a change to a single cell may cause recalculation of any number of cells, but just those cells that need to be recalculated. Again, specific procedures are required to locate, create and delete list members and also find the end of the list.
The cell format is stored in the CellType field format, which is a word where the status of various bits indicate the various format options. For example, to test whether the cell is to be displayed in bold type, I say:
IF ptr^.format & bold_mask <> 0 THEN display in bold type.
The values for the various masks are found in GCTV.INC (Global Constants, Types and Variables), and some pertinent routines that illustrate their use are change__format, find__prec, find_just and draw__cell.
Now, since we know the amount of memory required by cells in various states and by dependent cell list entries, it is rather simple to keep track of available memory, so that we avoid the dreaded crash. Thus, Opus checks the global variable working__memory before performing a NEW action. After performing a DISPOSE, the amount of memory released is added back to working__memory. Free memory in Pascal is organized into the stack and the heap, and these two entities grow towards one another as local variables are allocated from the stack and dynamic or pointer variables are allocated from the heap. If these collide, you have a "stack overruns heap" crash, and so the importance of keeping track of memory allocation is obvious in a dynamic environment.
The function MemAvail is used to get the initial amount of free heap/stack space, and its use is only valid before allocation of any variables other than the global ones; I've experienced this, and OSS has verified it. My working__memory then is computed as follows:
original_memory : = MemAvail*2; { *2 to get bytes } working_memory : = original_memory-20000;
In effect, this reserves 20,000 bytes for the stack, as Opus will never allocate memory such that working__memory becomes negative. This is important, because some procedures need several Kilobytes for local variables and also because the expression evaluator uses recursion, which may mean "stacking" many local variables. Twenty kilobytes should be adequate almost all the time, but under some circumstances it isn't.
For example, consider a spreadsheet where cell Al has a dependent A2 which has a dependent A3 and so on for 1,000 cells. If automatic recalculation is in effect, evaluate__formula will call itself 1,000 times before it starts "returning" and thus deallocating space for all the local variables, 20 to 30 bytes per call. The recalculation in and of itself requires 20 to 30K just to operate, and if this much memory is not available, the ol' Fl key should be relabeled "Stack overuns heap—CRASH!" because that is precisely the effect pressing it will have.
It would be possible (but extremely tedious) to check memory before doing a recursive call, much like I do before allocating cells. However, it would be hard to say where to draw the line at checking procedure calls, since many are recursive, but some are not, and since this would so rarely be a problem as you would basically have to create a very contrived worksheet to provoke this kind of crash. Thus, I don't do it. However, if you really need to create a worksheet with the characteristics described above, or you simply are not sure, you can always turn off both automatic recalculation and natural order and preclude the possibility of this type of problem.
In summary, the main data structure is an array of linked lists, the members of which are cells of record type CellPtr, and each cell may contain a linked list for dependent cells. Sounds complicated, and it is, especially when it comes to deleting cells. For more information on pointers and linked lists, see the bibliography at the end of this article.
The expression evaluator
A common problem in programs such as Opus is the need to evaluate or parse mathematical expressions. We typically write these expressions in a notation known as infix; that is, the operators (+,-, etc) occur between the operands (numbers, etc). The computer, however, prefers postfix notation, where the operators follow the operands (1 + 2 infix = 1 2 + postfix). The evaluator must also provide for operator precedence, as discussed previously, so that 1 + 2*3 = 7 and not 9. Several parsing techniques have been devised, and I originally used one similar to that used by 8-bit Atari BASIC, described in the COMPUTE! book, The Atari BASIC Source Book.
However, over the course of several months, I became more and more dissatisfied with my parser, as the syntax checking and error recovery were less than ideal. Plus, it was quite difficult to understand, so much so that if I went more than a few days without looking at it, I would completely forget how it worked. So, I began searching through my old computer magazines and encountered an article in Computer Language, called "Parsing an Equation without Recursion" (June 1987). The author described a method quite similar to mine, and it was just as hard to understand. Fortunately, he had done some prior research and cited one reference to an article in BYTE.
A trip to the local library proved most profitable, as I discovered the method I would ultimately use in an article in the August 1985, BYTE, written by Jonathan Amsterdam and entitled "Context-Free Parsing of Arithmetic Expressions." Browsing through subsequent issues led me to, of all things, an article by the same author entitled "Build a Spreadsheet Program" (July 1986, BYTE), which was of further assistance, especially since he provided Modula 2 source code for a very minimal parser, which I found on BIX, the BYTE information network.
In these articles, Mr. Amsterdam described a recursive-descent parser based on a formal grammar, stated in Backus-Naur notation. This technique is also used by many compilers. You may also recall the syntax definitions in the original Personal Pascal manual were presented in this Backus-Naur formalism. I was immediately impressed with the simple elegance of the method and began work adapting it for my program.
Examine Figure 2, which contains my grammar definitions. As an example, the first line is read as: "Full expression is defined as being either a value expression or a boolean expression." Defining the grammar in this fashion both provides the desired operator precedence and suggests the algorithm. We create procedures entitled full__expr, val__expr, term and factor, all local to the main procedure evaluate__formula, and these procedures are more or less direct implementations of the grammar.
FIGURE 2 A. The terms on the left have cooresponding procedures <FullExpr> :: = <valexpr> | <boolexpr> <ValExpr> :: = <term> { <addop> <term> } <Term> :: = <factor> { <mulop> <factor> } <Factor> :: = real | <cellref> | <function call> | ( <fullexpr> ) | -<factor> { ^<factor> } <BoolExpr> :: = <valexpr> <boolop> <valexpr> B. Definitions <CellRange> :: = <cellref : cellref> <CellRef> :: = {$}A{$}1..{$}IU{$}999 <AddOp> :: = + | - <MulOp> :: = * | / <BoolOp> :: = > | < | > = | <= | <> | = ...plus the various functions with their argument requirements already stated. :: = means "is defined as." | means "or." { } means "optional, and nay be repeated". |
The Writing of Opus
Some of the important variables are str, the string to be parsed, str__pos, the position we are currently at within the string, and len, the length of the string. Obviously, we can parse as long as str__pos < = len, and I check for that before any attempt to access a character, typically as "str[str__pos]".
Now, evaluate__formula begins by calling full__expr, and full__expr immediately calls val__expr. Val__expr calls term, and term calls factor. Factor then decides whether the substring beginning at str__pos fits the pattern for a real number, a cell reference, etc., and calls the appropriate routine to convert a string to a real, decode a cell reference, and so on. If a potential function name is encountered, it calls a routine to check the function list (via binary search) to see if there is a match.
If an open parentheses is found, it calls full__expr, and immediately upon return checks for the required close parentheses. If factor successfully terminates, the variable result will hold the real-valued result of the evaluation. Then, as for any subprogram, control returns to the routine that called it, namely Term, and str__pos now indicates the position immediately following the factor. term then enters a loop where it checks for a MulOp (remember a Term is defined as a factor followed optionally by one or several "MulOp factor" pairs).
If it finds one, it stores the type of MulOp encountered and again calls factor, and when factor is done, term either multiplies or divides the two results. The loop keeps repeating until no MulOp is found or an error condition occurs. Control returns to val__expr, which enters a similar loop, except that it checks for an AddOp. When val__expr terminates, full__expr checks for the presence of a BoolOp. If one is found, it stores result in a local variable and again calls full__expr to evaluate the right side of the expression. This ensures the boolean operators have the lowest precedence, and as you may have noted, correct precedence for the AddOps and MulOps has also been established.
In a nutshell, this is how the expression evaluator works. Rather than provide a step by step walkthrough of an evaluation, I ask that you consult the references I cited or better yet, examine my source code. Note that these routines are all mutually recursive, hence the FORWARD declarations.
Lagniappe
Opus contains far more interesting considerations and routines than I have space (or time!) to describe. I have extensively commented some you may find of general use, including my__exp and my__ln, which are accurate replacement functions for the library procedures that ensure calculation of the financial functions accurate between nine to ten digits, which would otherwise be accurate between only five to six digits. These use a combination of techniques, including one for fast calculation of integer powers found in "Computer Approximations" in the April 1986, BYTE, and my own implementation of power series approximations optimized by taking advantage of some logarithmic and exponential properties.
Other useful language extensions include my routines for real number and string conversions, aptly named real__to__string and string__to__real. Real__to__string is very flexible and allows you to specify the number of decimal places and whether scientific notation is to be used.
If you're using a resource construction program, which I highly recommend, don't use Set__DText Set__DEdit and Get__DEdit. Rather, avoid redundancy, add elegance, and save a few kilobytes by using Set__Text and Get__Text. Use Map__Tree to deselect all selectable items within a dialog with a single call, rather than calling Obj__SetState for each one. Also use Map__Tree to return the one selected item from a group of selectable items, rather than explicit "IF Obj__State(tree,item) & Selected< > 0 THEN..." constructs for every possibility. My acknowlegements to Tim Oren and his "ProGem" series for these latter routines.
I designed Opus to be expandable through the use of desk accessories (DAs). Essentially, a DA may communicate with Opus, requesting certain information, and Opus will respond by writing a message containing the desired information to the DA. In the future, I may use this feature to add graphing, which won't require a new version of Opus itself.
Epilogue
Thus concludes our tour of Opus. I hope I anticipated most of your questions, and I further hope you find Opus a truly useful, enjoyable tool. Should you have any questions, comments, or criticisms, I may be reached at:
DELPHI: MEDSTUD
CompuServe: 72277,2315
GEnie: DougH.
I would also be most pleased to hear about your application of my program, and I would enjoy seeing any of your templates, so upload them for all to use.
Bibliography:
Aho, Alfred V, John E. Hopcroft, and Jeffrey D. Ullman. Data Structures and Algorithms. Reading, Mass.: Addison-Wesley, 1985.
Amsterdam, Jonathan. "Context-Free Parsing of Arithmetic Expressions." BYTE, August 1985, pp. 139-144.
Amsterdam, Jonathan. "Build a Spreadsheet Program." BYTE, July 1986, pp. 97-108.
Magruder, Scott. "Parsing an Equation without Recursion." Computer Language, June 1987, pp. 45-48.
Moshier, Stephen L. "Computer Approximations." BYTE, April 1986, pp. 161-178.
Wilkinson, Bill, Kathleen O'Brien, and Paul Laughton. The Atari BASIC Source Book. Greensborough, N.C.: Compute! Books, 1983.
Atari ST references used:
Oren, Tim. ProGem. Available on DELPHI, CompuServe and GEnie.
Leemon, Sheldon. Atari ST: VDI. Greensboro, NC: Compute! Publications, Inc., 1987.
Leemon, Sheldon. Atari ST: AES. Greensboro, NC: Compute! Publications, Inc., 1987.
Gerits, K., L. Englisch and R. Bruckman. Atari ST Internals. Grand Rapids, MI: Abacus Software, Inc., 1985.