Puzzling through Erasure II


Bruce Eckel has some further thoughts about the use of erasure in Java generics.

Erasure is the way that Java gets generic types without adding VM support. When the developer writes something like:

List<Integer>

in Java generics, the compiler knows that this type can only contain an integer, but when the bytecodes are generated, the type is really:

List<Object>

This means that Java generics doesn’t give you the performance benefit that .NET generics do – not only do you not have the ability to create generics on value types (not that surprising given that Java has previously not had boxing), but you still have typecasts in your generated code.

Advantages? Well, the advantage I always heard was that erasure meant that you could run on downlevel VMs, but it turns out that that isn’t true. It does mean that it’s much easier to write a 5.0 VM than it would be if an approach such as the .NET one was used.

The other advantage is that because the underlying object is really a List<Object>, it’s possible to implement a wildcard feature in which you can write something like (and forgive me if my Java syntax is wrong):

List<?>

which means a List of anything. You can get away with this in erasure because everything’s the same type under the covers, but it would be much more challenging in the .NET implementation because List<double> is very different than List<string>.

My personal opinion is that it’s far more important to be able to have generics over value types and performant implementations than to support wildcards. In VS2003/C# V1.1, programmers sometimes need to ask the following question:

“Hmm, I need a list of integers here. Should I use an ArrayList, which is easy to write, or should I use my own IntegerList class, which is a bit harder to write, but performs better. Or should I use an array”.

The fact that users need to consider that is bad, as is the fact that code contains all three options, and it’s often hard to know why a specific option was chosen.

Comments (23)

  1. Samboy Lims says:

    Well said. Thank’s for the important information. One more high score for .NET

  2. Justin Pitts says:

    pragmatism vs academia

  3. RichB says:

    On a related note, my personal opinion is that all generic classes should be abstract. ie

    List<string> mylist=new List<string>;

    should be dissallowed because List<string> should be an abstract type. This would force all programmers to do the following:

    internal class StringList : List<string> {}

    StringList mylist=new StringList();

    which hides the details of the list of strings behind a class which can be easily extended with additional functionality at a later point. These additional changes would then be localized to the StringList class, rather than having to modify many instance creation statements across the entire codebase.

    For some strange reason, House of Straw, House of Sticks and House of Bricks come to mind…

  4. Pete says:

    Doesn’t the .NET approach involve JITting the collection code for every new type it’s instantiated with (like C++ templates, but at run-time)?

    If this is the case then I imagine the Java approach scales down to PDA’s (and other such small devices, like phones) much better. It should provide lower memory requirements and could be faster code (I would assume that JITting is quite slow on these devices).

  5. Bunnz says:

    I really prefer the .NET way of generics and I would have been really disappointed if left the boxing in the generated code.

    However, there is still some work to do, to support generic operators and stuff like that 😉

  6. Eric says:

    Pete, IIRC, the jitting is handled on a per-appdomain basis.

    All reference types share the same jitted code. For each value type, the code is jitted the first time such a type is created, and then shared.

  7. Pete says:

    So… where does the performance gain come from (in the .NET version of generics)?

    If a List<T> is only jitted once for all reference types of T, surely the jitted code must be no different from List<Object>? If so, how is this different from the Java approach (for ref types)?

  8. Nicholas Allen says:

    > when the bytecodes are generated, the type is really: List<Object>

    List<Object> is another generic type. Erasure leaves you with the concrete type which is List.

    > This means that Java generics doesn’t give you the performance benefit that .NET generics do

    That doesn’t really follow. The reason why is that this statement is looking at the wrong thing:

    > you still have typecasts in your generated code

    Performance is not related to the intermediate code generated by the compiler. This is true in both .NET and Java. The platform specifications guarantee very little about what work is actually done. You cannot benchmark byte code or IL directly. It must always be taken in context with the translation to native code. The same code generated by the 1.0 Java and .NET compilers has different (generally better) performance characteristics when run today.

    What’s probably safe to say is that next week the Java implementation is not going to run generic code faster than non-generic code. But it’s generally not a good idea to base language design decisions on how your compiler works this week.

    > Well, the advantage I always heard was that erasure meant that you could run on downlevel VMs, but it turns out that that isn’t true.

    Officially, that isn’t a supported scenario. Due to the implementation obviously you can round trip to a previous platform.

    > It does mean that it’s much easier to write a 5.0 VM than it would be if an approach such as the .NET one was used.

    It’s certainly true that resources may impact these kinds of decisions.

    > List<?> which means a List of anything

    List means a list of anything. List<?> means any list of something. The difference is important due to the following old problem with generics:

    If T is a subclass of U, then T is promotable to U. Is List<T> promotable to List<U>?

    Answer: no. A List<T> allows us to remove things of type T. A List<U> allows us to add things of type U. If we could treat a List<T> as a List<U> it would punch a hole in our type system by allowing us to a put a U in and expect a T to come out.

    This can be a big problem with API usability. If someone asks for a List<Object> you have to give them a List<Object>. If you happen to have a List<String> you can’t pass it directly. If they ask for a List<? extends Object> you can give them a List<Object> or a List<String> and both will work fine.

    > You can get away with this in erasure because everything’s the same type under the covers, but it would be much more challenging in the .NET implementation because List<double> is very different than List<string>.

    I think you’re selling yourself short. From a compiler perspective wildcards are hard because of the need to do some significant type inferencing. An implementation should be doable. Semantically a List<? extends T> is very similar to an UnmodifiableList<T>…

    > My personal opinion is that it’s far more important to be able to have generics over value types and performant implementations than to support wildcards.

    I don’t really see a conflict between those goals. Because of the history of .NET, it’s relatively cheap to support generics over value types. Because of the history of Java, it’s relatively cheap to support wildcard generics. I see both platforms supporting all of the above in the next few years.

    One advantage you missed is that because a List<T> is really just a List under the covers, you can use generics and still pass a List<T> to library code that was written before generics were even a possibility. Of course, it would be up to you to verify the type safeness of this decision (or be prepared to accept a failed checked cast in the future).

  9. "One advantage you missed is that because a List<T> is really just a List under the covers, you can use generics and still pass a List<T> to library code that was written before generics were even a possibility. Of course, it would be up to you to verify the type safeness of this decision (or be prepared to accept a failed checked cast in the future)."

    That’s what IList is for, and other untyped collection interfaces. Correct? So what is the advantage again of wild cards? I see absolutely none.

  10. Nicholas Allen says:

    Frank-

    As far as I’m aware, an IList<T> is not an IList and the two interfaces don’t directly match up. You’d have to thunk all your calls and make a wrapper going through each time. Also, this is a property of erasure, not wild cards. Wild cards are a particular generic feature; erasure is a way to implement generics. It’s possible to have one without the other (in both directions).

  11. Eric says:

    Nicholas,

    You have found a few cases where I wasn’t precise in what I wrote, which is good. I have a few comments on other things that you wrote.

    When I said, "there are still typecasts in the generated code", I meant "the final result is as if you still had typecasts in the code". With erasure, you still need to have the runtime type checking in the native code. With the .NET implementation, that’s not true – there is no need for runtime type checking, so the effect is as if there are no casts any more. Which improves performance, though the improvement due to removing casts is much smaller than the improvement from getting rid of boxing.

    I’m not an expert on our generics implementation, but I’m not terribly optimistic about the possibility of supporting wildcards. If you define a supertype of both List<double> and List<string>, that type needs to have Add(<some type>).

    But there’s no type that is both a reference type and an unboxed value type. If you use object, you’re back to boxing again.

    List<T> implements both ILIst<T> and IList, so if you want to pass a List<int> to a routine that expects an IList, it will work fine. You just have to know that you’ll be getting boxing on each item.

  12. Nicholas Allen says:

    Eric, that is good news about IList<T> : IList! I had complained about that months ago and got wontfixed. Too bad about the boxing, but that’s a hard case.

    > I meant "the final result is as if you still had typecasts in the code"

    Right, exactly. But there’s some subtlety here. If we can perform an escape analysis on a generic object and see that it’s never accessible to non-generic code (trickier than it sounds), then we can prove cases where the typecasts are unnecessary. I have no idea what Sun has planned, but it’s very similar tech to their virtual method analysis…

    It’s definitely true that eliminating boxing is a bigger win than this in most real world programs though.

    > If you define a supertype of both List<double> and List<string>

    I don’t think it’s necessary to define such a supertype though. The <?> captures types but the implementation doesn’t care what that results in. It’s all just type theoretical bookkeeping on the compiler’s end, like varargs. At some point you have to make a concrete call and pass in either the List<double> or List<string>. Admittedly it would be a bit gooky if the compiler had to emit separate methods for List<? extends object> and List<? extends System.ValueType> but I hope it could be done smoother than that.

    40 characters by 8 lines is a tough interface to explain things in. A big feature like this would take a lot of careful thought. I think it would be great if it happened though.

  13. Pete says:

    I still don’t see how you avoid run-time casts:

    Say I have a class, MyClass, that is a reference type. You have said that all List<RefType>s will share the same code (only jitted once), so how can I get a MyClass object back from the List<MyClass>, without it doing a run-time cast?

  14. > Eric, that is good news about IList<T> : IList! I had complained about that months ago and got wontfixed. Too bad about the boxing, but that’s a hard case.

    That’s not what Eric said. You’re implying this:

    interface IList<T> : IList {…}

    Eric said this:

    class List<T> : IList<T>, IList {…}

    This is probably why your bug got marked as wontfix.

    Regardless, you can still pass List<T> to an IList method:

    void OldMethod (IList list);

    List<int> n = new List<int>();

    n.Add (42);

    OldMethod (n);

  15. > Say I have a class, MyClass, that is a reference type. You have said that all List<RefType>s will share the same code (only jitted once), so how can I get a MyClass object back from the List<MyClass>, without it doing a run-time cast?

    This is possible because the JIT operates at a lower level. The compiler + JIT type check all method boundaries, ensuring that everything is type checked.

    The JIT generated code doesn’t need to redo the type checking; it was already done during the JIT. Consequently, the generated code can just stay in the realm hardware, of pointers. Since Reference Types are really just pointers, they all have the same size, so the generated code doesn’t need to worry about the actual types involved — no fixup or change would actually be required.

    It’s akin to just using a C++ reinterpret_cast everywhere an object reference is used — the cast doesn’t *do* anything, it’s just present to satisfy compiler type safety:

    int *i = new int;

    // doesn’t do anything on the actual hardware; just pleases the compiler

    char *c = reinterpret_cast<char*>(i);

    // which can be bad if you’re not careful…

    printf ("%sn", c); // undefined

    In the context of .NET, "generated reinterpret_casts" (really just a register copy) look terribly unsafe, they’re not — all the type checking has already been done, so there is no loss in type safety (as long as there aren’t any bugs in the type analysis code).

  16. Pete says:

    "This is possible because the JIT operates at a lower level. The compiler + JIT type check all method boundaries, ensuring that everything is type checked."

    So the jitter compiles the code for List<T> the first time it hits one, but checks all future declarations/usages separately when it reaches them? Or can this type-checking be done entirely by the compiler? I guess it must be, because otherwise it would be a possible run-time exception for an invalidly declared container.

    Interesting.. I assume this is why constraints are necessary? Because the code is jitted once, all operations must be valid at that time. I guess this is the biggest restriction of implementing generics this way (unless I’m getting confused).

    In that case, Java has really lost out. They don’t have the power of C++ templates, and they don’t have the speed of .NET generics. I guess C++/CLI wins all round (since it supports both).

  17. "In that case, Java has really lost out."

    Yes, I cannot see why the Java version is even called "generics" — it doesn’t seem to have advantages of generics.

    It looks like the Java implementers are now more concerned with justifying early poor design decisions instead of biting the bullet and throwing out the poor design.

  18. > So the jitter compiles the code for List<T> the first time it hits one, but checks all future declarations/usages separately when it reaches them? Or can this type-checking be done entirely by the compiler?

    Not exactly. The JIT *always* checks declarations and usage, not just the first time, to ensure consistency and safety. Of course, this is at JIT time, so you only pay once (per AppDomain).

    This type checking CANNOT be done ONLY by the compiler, as the compiler may be buggy, or the program executed may be corrupted/trojaned, or ILASM is your compiler, or…

    > I assume this is why constraints are necessary?

    Absolutely. Without these constraints, the JITted code *would* need to perform the runtime check (or some similar mechanism) to ensure that the method dispatch would be correct/safe. This would hurt performance.

    Alas, this implementation also means that a generic class can’t inherit from one of its arguments, which is possible in C++:

    template <class T> class Helper : public T {/* … */}

    Such functionality would be useful for "kludging" multiple-inheritance-like behavior, but obviously isn’t possible under the current Generics mechanism.

  19. Nicholas Allen says:

    Jonathan- I’m not implying anything. What I asked for is exactly what Eric said is happening. You’re reading things into posts that aren’t there. I am glad though that the problem got solved even it if does involve boxing.

  20. Eric Gunnerson says:

    Jonathan,

    I agree that’s it’s possible to build analysis into the system so that you can know that the casts really aren’t possible. It’s a fairly involved process with the prospect of dynamically loaded code and IL generation on the fly (don’t remember if Java has this), but it’s possible.

    I’d prefer, however, a design in which you didn’t have to do this, as it adds a bunch of complexity on the runtime side.

    Constraints are very necessary for the compilers to give a good programming experience. Unlike the C++ model (anything goes, we’ll check at instantiation time), the .NET model tightly constrains based on the constraints of the generic type. That means that if I constrain a type parameter to implement IList<T> (for example), I can depend on it having those members.

    To do it the C++ way would require that there be a validator that checked whether the generic type IL could work correctly for every different type it was instantiated on. There’s a bit of checking in the JIT, but not really of this sort, and you’d have to do it for every type (ie both List<Employee> and List<Address>).

  21. Pete says:

    Jonathon:

    "The JIT *always* checks declarations and usage"

    So if the compiler cannot do all the checking.. what happens if the jitter catches an error? Does an exception get thrown?

    "Alas, this implementation also means that a generic class can’t inherit from one of its arguments, which is possible in C++"

    Bummer, that’s one of C++’s best features.

    Eric:

    "Constraints are very necessary for the compilers to give a good programming experience"

    I don’t like the way constraints are being sold as a good feature. They seem like the biggest limitation of the system to me.

    "To do it the C++ way would require that there be a validator that checked whether the generic type IL could work correctly for every different type it was instantiated on"

    That doesn’t seem like an enormous problem… you’d still only have to do the checking once (per instantiated type), and even then it’s not like jitting the container code each time.

    The biggest downside to this is that you would have to throw an exception instead of catching such errors at compile-time. But then, if you already do some checking how does it deal with errors?

    As a side note, is it not possible to go with a hybrid approach? i.e. allowing someone using a container to use it as-is, or do something like C++’s explicit instantiation (which would generate the code for that type at compile time, like C++):

    // Generate a separate MyClass version of the container, which

    // is generated at compile-time, so should give improved performance:

    generic<> List<MyClass>; // C++/CLI form.

    But I’m not sure how much benefit there’d be.

  22. Eric Gunnerson says:

    Jonathan,

    I meant to comment on this last time…

    While we don’t currently allow it, it is possible for the .NET implementation to be extended to allow

    class Extend<T>: T

    We don’t currently allow it, and it would break some invariants that currently make things simpler in the runtime, but it’s possible. I don’t know if/when we’ll do this, but we do understand the utility of being able to do so.

    Pete,

    If you chose to do that, you would have to have that checking both at the instantiation time *and* at compilation time. You would also lose some of your design-time utility. Right now, Intellisense knows exactly what operations are valid on type parameter T. Without this being limited by constraints, it would have no idea what to present – the valid set would become every field and operation on every type.

    We do agree that the current constraint syntax is less expressive than where we’d like to end up.