MKS YACC and TBASIC for Windows. (compiler writing) (Column)
by Tom Campbell
To those who enjoy it, it's an unforgettable pleasure worth any number of return trips.
If you've read this column more than once, you're probably a pretty inquisitive person. At one point or another, you've probably asked yourself what it takes to (gulp!) create your own programming language.
Compiler writing is often thought to be one of the last black arts in the programming trade. Even though anyone who gets a degree in computer science suffers through a semester or two of compiler writing, most programmers endeavor to forget as quickly as possible what they supposed to have learned in those classes.
Creating your own language is like drinking port. It's an acquired taste, but to those who enjoy it, it's an unforgettable pleasure worth any number of return trips. I'm one of the nut cases who enjoy writing compilers and interpreters, and I've read literally every book in print on the subject, not to mention many more now out of print.
Having done this for a few years now, I can safely assert that almost all the books are way off the mark. Most of them require that you write the whole compiler by hand, omitting such necessary tools as YACC and Lex, and few of them give you the pleasure of presenting the source code to a complete compiler. Usually, the really good stuff is left, in a phrase I've come to dread, as "an exercise for the reader."
Not that printing the source code to a compiler is easy to do. For one thing, it can take up a lot of valuable editorial pages that the publisher would rather see devoted to more "substantial" topics such as theory and exercises left to the reader. For another, there's always a big problem in deciding what the target machine is. Make it a PC, and you leave out a lot of college students who woek on Suns and VAXs.
The truth is that when I had to get the job done on my first compiler, I learned more from MKS Lex and YACC (Mortice Kern Systems, 35 King Street North, Waterloo, Ontario, Canada N2J 2W9; (519) 884-2251). A compiler is made up roughly of these parts. The scanner takes the individual characters that make up the program's keyboards and other "lexical" elements (its words), such as numbers, math operators, and variable names. It converts each item into an integer identifier (such as the number 257 for PRINT [underscore] and 256 for STRING [underscore] in the case of the BASIC line PRINT "Hello, world."). It also associates a value with that identifier, such as the string "Hello, world" in the case of STRING [underscore]. (Often, as with PRINT, no such value is needed.) The scanner normally has little knowledge of the language and will cheerfully accept PRINT CLS WHILE + SELECT SELECT as valid input. The parser takes these lexical elements (viewing them as their integer identifiers) and sees that they're in the right order. This is the heart of the compiler and, normally, the most difficult to write. The parser then adds identifiers such as variables and subroutines to the symbol table, and it generates code or returns an errot if the parser deems the input program incorrect.
What do the oddly named YACC and Lex have to do with this? It turns out that while the parser is incredibly tricky to get right--without automated help you will almost always create a langauge with invalid contracts--it is also quite amendable to being automated. This is where most textbooks get it completely wrong. They'll spend two or three chapters on archaic parsing theory, then provide little or no real code to illustrate their outmoded idead. What YACC lets you do is create a description of the language using an extended version of the Backus-Naur Form you often see in language syntax charts, and it creates a C function called yyparse() that does all the hard work for you. Oh--you don't like C? Then use the -LP option to create a Turbo Pascal routine! Writing scanners is normally less tricky, but that also can be automated using Lex. Lex generates a C (or Turbo Pascal) routine called yylex() that's normally called by yyparse().
The MKS Lex and YACC manual is simply the best applied introduction to compiler writing I've ever seen. Starting with a simple command line calculator, it proceeds to a C-like language with functions, procedures, local and global variables, and a rich set of control structures. It has both an interpreter and a compiler. The only thing I don't like is that the assembler code the compiler generates isn't for a PC, but for an imaginery Motorola-like processor. Using YACC and Lex isn't easy, but this tutorial is great. If you're willing to spend a few months studying, you can come out with the beginnings of your own homemade language, in Turbo Pascal or ANSI C!
I've posted a grammer for an interpreter called Tiny Basic for Windows in the file TBA-SIC?? .ZIP, where ?? stands for a version number such as 10 for version 1.0 or 21 for version 2.1. It has the C and Turbo Pascal source code for everything you need to get your own language going.