Enterprise grade Java.
You'll read about Conferences, Java User Groups, Java, Integration, Reactive, Microservices and other technologies.

Friday, July 29, 2011

Don't Use Java 7? Are you kidding me?

19:55 Friday, July 29, 2011 Posted by Unknown 16 comments:
,
Java 7 was released yesterday and some guys from the Apache Lucene & Apache Solr community quickly came up with a couple of issues which lead them to the point where they are actively rejecting Java 7 and advice anybody else to to likewise. Even a general warning was issued by Apache Lucene PMC Member Uwe Schindler. But what exactly is wrong with Java 7 and why shouldn't you use it after waiting nearly five years for it? Let's look at this.

It's not about Java 7 but about the JVM
First of all, it's not about Java 7 in general but about the HotSpot JVM. The GA release contains three bugs ( 7070134, 7044738 and 7068051) which could affect the users with either JVM crashes or wrong calculations.

Hotspot crashes with sigsegv from PorterStemmer
The first one is about a wrong compiler optimization that changed the loop optimizations. The problem is, that this JVM feature is on by
default, so you have to explicitly disable it by adding -XX:-UseLoopPredicate as an argument. If you are willing to try this by your own, grep the Stemmer.java, a reasonable thick word database (there are some out there) and compile and run it against the text file. What you will see is, that your JVM crashes with a fatal error.

# A fatal error has been detected by the Java Runtime Environment:
#
#  EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x00000000026536da, pid=5432, t
id=6568
#
# JRE version: 7.0-b135
# Java VM: Java HotSpot(TM) 64-Bit Server VM (21.0-b05 mixed mode windows-amd64
compressed oops)
# Problematic frame:
# J  Stemmer.step4()V

It directly happens during code execution, so you will not experience this with JDK 1.6. Especially Lucene has some more recent work going on using a more flexible indexing mechanism based on an algorithm
called PulsingCodec especially this is heavily affected by the bug.

Loop unroll optimization causes incorrect result
This bug refers to the "wrong calculations" part of my introductory section. In very rare situations when OSR (On-Stack Replacement)
compilation is done for nested loops, the control flow breaks and the memory dependencies are not taken into account. That leads to duplicated clones which alter results. (If you like to know more about
the compilation details, have a look at this older overview (PDF)
A minimal workaround is to add -XX:LoopUnrollLimit=1 as an argument.

Clone loop predicate during loop unswitch
This is a bug which relates to an older feature request. It's
introduction finally lead to a new bug. Invalidated jvm stats lead to a jvm crash with loop optimizations again.

Bottom Line
You could be affected. At the moment basically only if you have some parts in your software that make big use of the optimization methods which are broken. But for the average use cases this will not affect you. In general this also will affect Java 6 users but only if they use the optimization options, which are on by default with Java 7 (-XX:+OptimizeStringConcat or -XX:+AggressiveOpts) These problems were detected only 5 days before the official Java 7 release, so Oracle had no time to fix those bugs. At the moment it seems as if they are trying to get this into either the next or the second service release. And last but not least, the source code is open so anyone stubborn enough to dig into it can make a fix.

16 comments:

  1. Another important note, the loop optimization bugs only affect HotSpot Server, so it's only relevant for server apps that run with the Server VM. These use the JDK, which does not update automatically (well, not on Windows at least; in OSes like Linux, I guess the powers that be will wait until this bug is fixed in order to push JDK 7 as a system update).

    ReplyDelete
  2. One of the bugs has been around since May, so Mark Reinhold loses credibility when he says JDK7 passed all its tests with flying colors.

    ReplyDelete
  3. Heath,

    it's true, that the bug has been around since then .. but: Let's face it: Seems as if it has not been caught by tests at all. So, Mark is still right with OpenJDK 7 passing all tests.

    - Markus

    ReplyDelete
  4. Don't forget to read the _real story behind the Java 7 GA bugs_ by Uwe:
    http://bit.ly/pMXstW

    ReplyDelete
  5. I don't understand the gist of this post. "But what exactly is wrong with Java 7 and why shouldn't you use it after waiting nearly five years for it?" What's wrong is that it can incorrectly execute your code. Sure, it's unlikely, but it's happened to a real, widely-deployed application. How do I know it won't happen to mine? And what has "waiting 5 years for it" got to do with it? Does it work properly or not?

    Apparently the JDK passed all its tests. Clearly it doesn't have enough tests yet.

    "Oracle didn't have time to fix the bugs". They had time to change the defaults so these optimizations are off by default. Wouldn't that have been a better course? Or they could have delayed the release until it worked properly out of the box.

    This reminds me of the Intel floating-point division debacle of years ago. Intel tried to convince everyone it was no big deal, and even offered to review people's applications to see if they needed a fixed chip or not. The users revolted against that arrogance, and Intel replaced all the chips.

    JDK 7 has a serious problem, plain and simple. Anything else is hand-waving.

    ReplyDelete
  6. Ned,

    thanks for your comment. The comparison with the pentium bug is nearest I have seen. I am not denying the problem overall, but as you might have seen on the net: There haven't been much people out there being able to reproduce the bug. So, deactivating the JVM optimizations is fine until we get the first fixpack. In the meantime, I am going to use all of the new Java 7 and forget about the HotSpot bugs ..

    Thanks,
    -M

    ReplyDelete
  7. S and S media is picking up on the story.
    English:
    http://jaxenter.com/java-7-causes-headaches-for-lucene-and-solr-users-37195.html

    German:
    http://it-republik.de/jaxenter/news/Wie-gravierend-sind-die-Bugs-in-JDK7-wirklich-059938.html

    http://it-republik.de/jaxenter/artikel/JDK7-Bugs-im-Fokus-Wenn-die-Fehler-auftreten-sind-sie-schwerwiegend-3981.html

    http://it-republik.de/jaxenter/artikel/JDK7-Bugs-im-Fokus-Ich-gehe-davon-aus-dass-aktuell-nicht-viele-Anwender-betroffen-sein-werden-3982.html

    ReplyDelete
  8. So Apache should just let everybody downloads it and have it crashed?

    Sure, people should not use a new JVM fresh from the oven, but should there be some people who tried it, can Apache says "though luck guys, you shouldn't try that"?

    Heck, if Apache don't raise this, would Oracle be scrambling to fix this right now? Probably not.

    And how do anybody would tell the application server, libraries and frameworks that they are using are actually save from this bug? I understand, we are responsible for the choices we make in libraries and frameworks and application server, so now let us go through all the source code and see if this has any effect. Good luck for those who are not using open source frameworks/libraries/application servers.

    Of course, the right way could be let just use it and see if it hits. Its like reverse lottery. And if its happen we can all just blame Oracle right? Well, its the "Oracle", who would've though they'd be this sloppy. Good luck explaining that to your boss or your angry users.

    Especially when Apache has already warned.

    ReplyDelete
  9. Oracle knew about the wrong results bug since May 13, so its not an issue of just 5 days, they had *plenty* of time to fix this bug and prevent wrong results and possible data corruption:

    http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7044738

    I dont know how you cannot take this seriously: "for average use cases this will not affect you"... Seriously? What kind of programs are you writing? Toys? Anyone who cares about the results of their calculations (e.g. money) or cares about their data should take this stuff very seriously.

    ReplyDelete
  10. Fadzian, Robert,

    Thanks for your comments. I guess you misinterpreted this a bit. The only thing I'm trying to tell is, that this probably isn't such a big as everybody tries to make this.
    First of all the biggest mistake in enterprise software is to use a x.0 release. But, if you already did this, you are probably committed to catch up with every patch and workaround available and you are prepared for "things happening". I don't think that this is such a big issue due to the fact, that there ARE workarounds available. As always, I would love to have seen oracle making a bit more transparent, which problems still exist. But regarding the switch of default behavior I could also imagine, that this has gotten a bit out of sight. So nobody probably had this on the radar at all. Yes, I agree: that's not, what we all expect to see from a mature product. But: this would have Happened to sun also. In the same way. Except the fact that no dispute between them and apache would have lead to such a big press around this. Have you read some comments in the bug tracker regarding e.g. the change in class loader architecture? Man... Oracle would have closed those bugs and make them invisible. Sun accepted the kick in the butt and apologized for it. Yes, that probably changed with oracle. They are more silent and act different. And I believe we have to challenge them to get better with community communications in general.
    But back to the bugs: add the workarounds, wait for the fix. If you have professional support: call them and tell them, your not going to pay until they fixed it. Fine, that's it.
    Nobody forces you to use the bleeding edge stuff. I have been running some older stuff on top of openjdk and none of the test failed. So: unless, you run intensive calculations you probably would not notice. And let's recall, that you never ever would think about using a x.0 release in mission critical areas? Right? I wouldn't....

    Thanks,
    Markus

    ReplyDelete
  11. This workaround does not apply when someone silently permanently corrupts their index/database/whatever, and finds out later after the fact!

    Because of this: java 7 is basically POISON to your data, you might have to completely start over and rebuild everything if you get hit by this bug.

    Thats why we issued a warning: if it was just a crash it wouldn't be a big deal.

    ReplyDelete
  12. Robert,

    Thanks again. I completely respect the special situation with your index issues and the effects it could have on your users. And I believe that it was not wrong to issue a warning. Seeing what the press makes out of it, I believe that it could have been issued with a little different focus.
    But as always, a messy situation is not a single persons or organizations fault. And the so called summer slumb was also adding it's two cent probably ...

    M

    ReplyDelete
  13. Markus,

    I agree that the press has gotten this out of hand, particularly hard on Oracle. (deservedly or not is debatable)

    The point I was trying to get across is, despite that, I think it is fair for Apache to issue a warning since they need to take care of their users.

    Now, there is also the side effect of what using Java 7 would means. Plus another reminder not to run *.0 software in the production.

    ReplyDelete
  14. I've worked on several major systems using Lucene. Breaking loop compilation, with faulty optimizations, is completely not trivial.

    This sounds like a whole lot of shonkiness, which Oracle hasn't had proper test cases (hint: three above are mentioned) to ensure reliability for, and are then trying to foist -- in an impolite & irresponsible manner -- onto users.

    Maybe that's acceptable for Microsoft. Not for serious enterprise software. You & Oracle should be ashamed.

    ReplyDelete
    Replies
    1. Hi Thomas,

      Thanks for your thoughts. I think that all has been said here. I respect your comment even if I don't know why I should be ashamed. I agree, that both parties could have been done better. Earlier testing would have been the key. I believe that there are test-cases now.

      -m

      Delete