zero or one based collection?


moo asks:


Zero based collections or 1 based?

Since programming languages are a bridge between the human concept of a solution and we naturally think the first element is in position 1, why was this not so on the actual language? Why are we made to think like a machine when infact we are not? The infamous “Off by one bug” is there because of the inherant design. So popular its even got a name yet we do nothing to prevent this happening at the language level, as for those who say its easier for the computer, thats why we have smart compilers. Im not a compiler dammit.


****


One of my hobbies is rewriting lines in TV and films, and I am therefore impelled to comment that moo’s last line should be:


I’m a programmer, not a compiler, Jim


To cover this subject, I’m going to have to set the way-back-machine for the 1980s or earlier, back when real men wrote in assembly and our bytes only had 7 bits. (aside – How many of you – and be honest here – have ever heard of machines where the word size wasn’t a multiple of 8? They really did exist).


Anyway, in those ancient times, processors were fairly glacial in their arithmetic speed, though much faster than the early calculators. Saving cycles was very important, so when arrays were first considered, the implementers looked at the code they wrote:


a[x]


translates to


address = base_address + (x – 1) * sizeof(x)


They actually didn’t write it that way because they didn’t have multiplication in those days, but that gives you the idea. Then somebody noticed that if your array starts with zero, you could write it as:


address = base_address + x * sizeof(x)


Thereby saving you a single decrement operation, which was important in those day.


Therefore early programmers got used to zero-based arrays, and the path was set, and it has stayed that way for many years for the majority of languages.


But why? Isn’t it simple enough to change?


It’s obviously trivially easy to change, and Moore’s law has made the efficiency inconsequential in the majority of scenarios. The issue isn’t around technological limitations, but rather human ones.


Understanding how zero-based indexing works is the secret handshake of the programming world. We all started not knowing the secret handshake, but over time we learned and even began to like the secret handshake, and now we don’t know any other way to shake hands.


We’re not going to try to change our brain wiring just because some young whippersnapper is having trouble remembering that the first index is zero.


Or, to put it another way, developers have a huge investment in hardwired things like this, and changing them will not make your customer happy.


[Update: Jack wrote:


Why cant we either have 1 based or user definable array bounds?


The CLR does support this kind of construct (I had a hard time not using the term “travesty“ here…), but there aren’t, to my knowledge, any languages that have built-in syntax to do that.


Which is a very good thing. If you go down that route, rather than having to remember a single rule (C# arrays are zero-based), you have to remember that every array could be either zero-based, so all your loops become:


for (int index = arr.GetLowerBound(); index <= arr.GetUpperBound(); index++)
{
}


If you get this wrong, you get code that works fine for your test cases, but breaks for the people who like 3-based arrays.


Yuck. Many times, “make it an option“ is the worst choice.


]


<Update>


Ferris adds:


Why bother creating a new language if you are still hugging your legacy pillow at night. Waste of time.


One of our main goals was to create a language that was comfortable to C/C++ programmers. We could have taken a different tack, and designed a new language from scratch, and perhaps done some new and exciting things.


But if you look at all the languages out there, you’ll find that having a comfortable syntax is well correlated with language success. Even if you’ve never written C# code before, if you have experience with a C-style language, you’ll be able to read C# code.


</Update>


You can find another discussion of this issue here.

Comments (42)

  1. Curt Hagenlocher says:

    <i>How many of you – and be honest here – have ever heard of machines where the word size wasn’t a multiple of 8?</i>

    Many of the machines in DECs PDP family had 36 bit words. Is there a prize for this? 😉

  2. It should come as no surprise to anyone that Perl lets you define your own array base on a dynamic basis by setting the special $[ variable. It’s, um, not recommended, though.

  3. An obvious reason to keep things as they are is when dealing with coordinate plains. If you use a 2-D array to plot points for example, it would suck if you used a 1 based array. The distance to the point 1,1 would be counterintuitively 0, not sqrt(2).

    I think it helps to think of arrays as a distance. So instead of thinking about the 1st element of an array, think of an element at distance 0.

  4. The first machine I ever did significant programming on was a Decsystem-20 – 36 bit words, segmented into 2 18 bit halfwords.

    Stop by my office sometime and I’ll show you the reference manual :)

  5. pb says:

    The processor used in the Chromatic Mpact of a few years ago used 9-bit bytes. It was great for the days when others had 16-bit color (565) and we had 18 bit color (666). Writing to it from the PC host was painful but hey…

  6. Ferris Beuller says:

    Yes the typical argument of zero based arrays is the underlaying processor, well hey this isnt assembly, get over it, how you implement this in the runtime or compiler I couldnt give a rats butt about, I say, walk over and get me the first item from that collection on the bench, you dont pick the zeroth item, you pick the first. Then you go on to mumble about legacy minesets, well hey, time for a change. I guess people are afraid of change, well I for one arn’t and welcome such changes. Why bother creating a new language if you are still hugging your legacy pillow at night. Waste of time.

  7. Ferris Beuller says:

    Maybe we need a new designer on the C# team with NEW ideas and concepts and one that isnt afraid of change.

  8. travesty is a good word Eric. I always hated that about VBA. Some collections were base 1 others zero. It was nuts. Forcing to base zero was a great move.

  9. Ferris Beuller says:

    Forcing base 1 is a good move.

  10. Ferris Beuller says:

    *takes notes* "Steve Chase is an arse licker"

  11. Dan Crevier says:

    I have a feeling lots of code is going to be ported from C/C++ to C#. If arrays were 1 based in C#, it would lead to tons of errors.

    Also, I think the 0 based array thing made a lot of sense with the tight relationship between pointers and arrays in C. It would be really weird if p[n] = *(p+n-1).

  12. Beam me up! says:

    That should read:

    "Damn it Jim I’m a programmer, not a compiler."

  13. Anand says:

    VB used to support the dynamic lower bounds. But they removed it in VB.NET. There was also support for a 0 based or 1 based array using the Option Base statement. That did give you a lot of flexibility…

  14. Jens Samson says:

    ‘Why cant we either have 1 based or user definable array bounds?

    The CLR does support this kind of construct (I had a hard time not using the term “travesty“ here…), but there aren’t, to my knowledge, any languages that have built-in syntax to do that.’

    Even Pascal lets you determine your own Array bounds. It goes even further, you could construct a set [Monday, Tuesday, …, Friday] (Have forgotten exact syntax) and use that as Array indexes. Some great functionality that other languages sadly never bothered to implement.

  15. Stuart Dootson says:

    Eric: The CLR does support this kind of construct (I had a hard time not using the term “travesty“ here…), but there aren’t, to my knowledge, any languages that have built-in syntax to do that.

    Ada has that built-in..and more…

    type SomeTypeName is array(4..10) of Integer;

    or, for truth tables

    type TruthTable is array(Boolean, Boolean, Boolean) of SomeResultType;

    (I have a love-hate relationship with Ada – it does have some nice features, like named associations, but I’d rather use C++ :-)

    Also on starting numbering at zero – Dijkstra’s argumnet for numbering from zero (see http://www.cs.utexas.edu/users/EWD/ewd08xx/EWD831.PDF) is of some interest.

  16. Shane King says:

    I’m not sure the argument is necessarily efficiency vs logic.

    Mathematically, it does make good sense to start at 0. Like it was mentioned, when you’re representing geometry, it’s good to have a 0 co-ordinate.

    Additionally, I think 0 addressing means that arrays and lists have more in common conceptually. It makes sense that 0 = head, and length = end.

    Finally, it makes no sense to change. I’ve never met anyone who’s got any natural aptitude at programming what so ever, and also can’t grasp 0 based collections. You say it’s a secret handshake, I say it’s just the way life is. If you can’t grasp it, chances are you lack the aptitude at programming anyway, so why make programs less efficient for the sake of people who aren’t going to be programmers?

  17. JasonM says:

    I’ve never used them, but wouldn’t 1 based arrays screw up "looping around" arrays using modulo arithmetic?

    Maybe it’s just me, but I do that _all_ the time.

  18. Ferris Beuller says:

    I have a feeling lots of code is going to be ported from C/C++ to C#. If arrays were 1 based in C#, it would lead to tons of errors.

    Also, I think the 0 based array thing made a lot of sense with the tight relationship between pointers and arrays in C. It would be really weird if p[n] = *(p+n-1).

    well, if you are copy n paste porting, I have only one thing to say; ROFLMAO.

  19. Philip Panyukov says:

    There already was a heated discussion on the subj here: http://discuss.fogcreek.com/newyork/default.asp?cmd=show&ixPost=1822&ixReplies=8

    I personally find it extremely painful to use APIs that mix 1 and 0-based collections/arrays in VB.

  20. That code for looping through an array looks like all of the VB code I use for arrays that I don’t have control over. I never know what the bounds are, and have to code around it.

    Travesty is right.

  21. Dave Mackersie says:

    Your argument doesn’t hold water !

    The expression:

    address = base_address + (x – 1) * sizeof(x)

    can be re-written as:

    address = (base_address – sizeof(x)) + x * sizeof(x)

    The expression (base_address – sizeof(x)) is a constant that can be determined at compile time, so there is no need for a run time decrement !

    The real reason for zero based collections goes back to the original design of the "C" programming language, in which the designers wanted array operations and pointer arithmetic to be equivalent, so that a[x] would be the same as writing *(a + x). Since then, all languages that followed in the tradition of "C" have used zero based indicies: C++, Java, and now C#.

  22. Don Smolen says:

    I was a systems programmer on the Control Data 6000 series. 60 bit words (12 bits in Peripheral Processors)

  23. Doug says:

    <RE>Your argument doesn’t hold water !

    The expression:

    address = base_address + (x – 1) * sizeof(x)

    can be re-written as:

    address = (base_address – sizeof(x)) + x * sizeof(x)

    The expression (base_address – sizeof(x)) is a constant that can be determined at compile time, so there is no need for a run time decrement !</RE>

    Eric was talking about a time before C, a time before optimizing compilers. If you re-read what he wrote, he was actually talking about a time before COMPILERS! Gee I feel old!

    Sorry. Zero rules.

  24. No one else is claiming programming on the 12 bit word PDP-8. Now there was a system for real progammers. :-) None of those frills like multiply, divide and subtract.

    I have to say though that I miss the ability to set different lower bounds for arrays in VB. It let you write things that were intuitive like having arrays that went from 1900 to 2000 for a set of years. Sure you can get around it but it adds complexity that many would just as soon let the compiler worry about.

  25. HateToAdmitIt says:

    Fortran.NET has built in non-zero based arrays…

  26. Luke Stevens says:

    Back in the <em>old</em> old days, everything was 1-based, because zero was not yet considered a number. So you had <small>AD</small> 1 immediately following 1 <small>BC</small>, and most years in the second century began with a 1 rather than a 2. To the Romans, December was the third month after October, and a newborn was in his first year; so we can consider ourselves more sophisticated now that we say, more sensibly, that December is two months later than October and a newborn is of age zero. Unlike centuries, we now start decades (like "the 60’s") on years that end with 0 instead of 1, though we still have no good name for our current decade. The day starts not at 1 <small>AM</small> but at 12! It even sounds downright strange to say that we can’t vote until our nineteenth year. So now that mankind has finally gotten used to the idea that zero is a number in its own right, numbering things from zero is seeming ever more sensible.

  27. Luke Stevens says:

    Oops, no HTML in comments, huh?

  28. Ben Hutchings says:

    Eric, you don’t know your history. FORTRAN, which is one of the oldest high-level languages, uses 1-based arrays.

  29. MBA says:

    Helpful For MBA Fans.

  30. One of the changes from VB6 to VB.NET was the removal of non-zero lower bounded arrays… a concept discussed by Eric Gunnerson