Compiled, interpreted, whatever

Article
09/14/2003

Every now and then, people talk about "compiled"
versus "interpreted" languages, and how they are different. Whilst
there are obvious technical differences between the two, the degree to which
the end-user / developer can tell the difference is mostly a function of the
language design and implementation. This topic came up yet again on the JScript
.NET newsgroup (note: the indentation / colour coding on the thread is not
entirely correct...) which I now unfortunately frequent quite rarely, and
rather than send a long reply there, I thought I'd send it here ;-)

Now, I'm not strictly speaking a compiler guy, and I never
took compiler courses in college nor had any deep interest in all the
theoretical stuff, so sue me if I get something wrong (on second thoughts,
please don't!) Nevertheless, I am interested in language (human or otherwise)
and was the PM for JScript / JScript .NET at Microsoft for several years, where
I had the privilege to work with some very talented people working on the
compiler and runtime which blurred the distinction between the two concepts. Tyson would know much much much
more about this stuff, since he actually worked on a team at Melbourne University that implemented a
really weird language <g> called Mercury, which even runs
on the CLR!

So what's the difference between compiled and interpreted
languages? JScript .NET is 100% compiled and yet has an "error
recovery" mode that can deal with most compile time errors (including
horrible syntax errors) and still produce correct output, for some reasonably
low values of "correct". Check the VsaDemo
sample on GotDotNet and play with it
for a bit. Paste in some VB code, check the "Always run" box, and see
what you get out of it ;-). (Note: you may need to move the
"txtWhatever.MultiLine = true" statements around in the source files
and recompile if you are running on the "Everett" version of the CLR...)

Anyway, what most people tend to be talking about, I think,
is essentially how strict the language is. Interpreted languages tend to be
less strict, often out of necessity -- you can't detect that a program is well
formed if you haven't even 'seen' the whole program yet -- and often out of
design -- trading performance and robustness for developer flexibility.
Compiled languages tend to be more strict, mostly because people want them that
way, but also because it is much easier to write a strict language compiler
than it is to write a loose language compiler / runtime. Sometimes other things
come into play, for example speed of compilation. When you are building large
programs, the time it takes to compile them actually matters. You can write a
much faster compiler if you fail early and fail often. For example, the C#
compiler did (and may still do; not sure) very little error recovery. If you
had something outside of a class -- anything -- it gave you a generic
"only using statements and class declarations are allowed". If you
missed a semi-colon or closing brace, it got very upset. It takes time and
effort to take care of these things, and that costs CPU cycles.

Another thing people confuse is syntax errors ("compiler
errors") and type errors. Generally speaking, a compiler error is
something that is wrong with the grammar of the language -- Â caused by things such as missing out a brace,
forgetting to close a comment block, or copying-and-pasting code from Office
after it has "Smart"-erised your quotes -- that stops the compiler
from even figuring out the rest of the program. If it doesn't know where one
statement ends and the next begins, or if it sees a weird character that it has
not been told how to handle, it gets very confused in deed.
Muchasifitypethiswholesentenceinonewordandyoucanttellwhereonewordendsandthenextstartsexceptforthefactthatyouareaverycleverhumanbeingandcanactuallyreadthissentencewithlittleornotroubleatall.
(Word says "Fragment (consider revising)" hehehe).

Most errors though are actually type errors -- the developer
is trying to use a value in a way that is not appropriate for values of a given
type, for example trying to multiply a string by a boolean (egad!). Whilst some
classes of compiler errors will break even interpreters -- if you feed complete
gobbledegook (wow, that's in Word's dictionary!) into them, they'll get
confused -- other classes of compiler errors may or may not break interpreters
depending on how they work. For example, if I'm interpreting a program
line-by-line, it doesn't really matter what the next line looks like until I
get there, and depending on the instructions in the current line, I may never
get to the next line. Conversely, if I hit an if statement and it evaluates to false, even an interpreter must start looking forward to find the else or end if statement, although it may be able to look only at (eg) new
lines and therefore treat the contents of the lines themselves as opaque blobs,
thereby bypassing any syntactic errors in the lines themselves.

Type errors, on the other hand, don't stop the compiler from
looking at the rest of the program, but they do generally cause some kind of
warning (or error) because, being a helpful bit of software, the compiler likes
to tell you when you are about to shoot yourself in the foot. The compiler generally
knows what operations are valid on any given variable, and if you try and
perform an invalid operation it will let you know with some kind of "type
mismatch" or "no such member" or "operation not
allowed" error. In general, programmers want type errors to be caught as
soon as possible (ie, at compile time, or even at "authoring time" if
you have a cool enough editor)
because this makes them more productive and cuts down entire classes of bugs
that are tricky to find at runtime (just ask any script programmer!). But they
also make it tedious for programmers writing simple programs, because they must
think upfront and tell the computer exactly what they are going to use a value
for ("name is a string",
"count is a number") even
they really should just be obvious, right? It also means you can't use the same
variable to hold a string, a number, and a date at different times in the
program; you must have three separate variables to satisfy the compiler.

Since JScript classic has no way for the developer to
express the desired type of a variable, it is impossible for the compiler to
determine at compile-time whether or not the operations you want to perform on
variables are valid or not. Furthermore, because JScript allows the dynamic
modification of types via prototype chains, it wouldn't be possible to do this
(in general) even if the user could give type information. For example:

"Hello,
world".getFullYear()

looks like it should cause a compiler error, since "Hello, world" is obviously a
string, and getFullYear is a method
on the Date object, not the String object, but in fact it will succeed if
somewhere else in the program I have something like:

function
String.prototype.getFullYear()

{

return
"But I'm not a Date object!"

}

Couple this with the fact that JScript almost certainly (90%
sure, but I didn't write the code) generalises operations on literals to be the
same as operations on arbitrary expressions, and the fact that it will perform
conversions between numbers, strings, etc. quite liberally at runtime, and you
get the fact that you can't give decent errors at compile time with Jscript.

JScript .NET does much better at this, even in the absence
of type information, due to its type inferencing capabilities. For example, it
will give a compiler error for this code:

function
foo()

{

var
s = "Hello"

print(s.bar())

}

because it knows that s
must contain a string (it is a local variable that is only ever assigned
to once), and it knows that strings do not have a bar method. (Unlike the prior getFullYear example, JScript .NET
supports a "fast" mode that will disallow such modifications of the
object model, precisely so the compiler can reason more usefully about the
program and give you such errors).

As a broad rule of thumb, "interpreted" languages tend
to be better for writing small programs, whilst "compiled" languages tend
to be better for writing large programs. Developing with
"interpreted" languages tends to be faster for small projects, or
projects where you might be building and testing in small increments. For
example, if I have a 100 line program but I'm only interested in running
20 lines of it, it shouldn't matter whether the remaining 80 lines all have
"compiler errors" or not -- I should just be able to run those 20
lines. "Interpreted" languages let you do this, whilst
"compiled" languages do not. But you'd never write Windows or Office
in an interpreted language, not only for performance reasons, but also because
it would be impossible to debug. (I use the words in "quotes"
because, as I mentioned above, these are generalisations, and languages such as
JScript .NET are fully compiled and yet still have properties such as the one
I've just described).

Anyway, I've talked enough rubbish for one day. I need to go
to sleep so I can get up for another day of filming tomorrow [looks at watch]
errrr today. Yup, I like to crew on independent films in my spare time (and was
even roped into some bit-part acting today - wow, that was unexpected!)

Compiled, interpreted, whatever

Additional resources