Sunday, November 3, 2024
HomeJavaDebugging a JVM Crash for LinkedIn

Debugging a JVM Crash for LinkedIn


Introduction

Welcome to Half 3 of our investigation right into a JVM crash for LinkedIn. This weblog publish concludes the investigation we started in Debugging a JVM Crash for LinkedIn – Half 1 and continued in Debugging a JVM Crash for LinkedIn – Half 2. In Half 2, we analyzed the core dump and the instruction the place the JVM crashed to seek out clues as to the trigger.

As a reminder, this sequence is damaged down as follows:

In search of a Repair

From our earlier investigations, it’s clear it is a Simply-In-Time (JIT) compiler bug. It’s a reminiscence addressing error that was launched by the JIT compiler – it’s not one thing a Java programmer might have launched. However is the compiler calculating the incorrect tackle for the learn, or is it studying an excessive amount of?

The subsequent step in an investigation like that is to test if there’s an current JBS bug that’s associated to this challenge. The OpenJDK neighborhood is filled with consultants devoted to making sure that Java has one of the best, most dependable, runtime on the planet. Somebody could have already seen this, and doubtlessly even have mounted it!

We, subsequently, head over to the JDK Bug System and seek for vpxor. What have you learnt? This challenge pops up immediately: c2 loop unrolling by 8 leads to studying reminiscence previous array.

Studying the outline, it appears *precisely* just like the behaviour we’re seeing:

  • The error is a SIGSEGV
  • It’s occurring on a vpxor instruction
  • It’s occurring very occasionally, and solely when the entry is on the finish of a reminiscence area
  • It solely occurs when C2 is performing an optimization referred to as loop unrolling

The bug description exhibits two variations of the issue. One with compressed odd object pointers (oops):

vmovq 0x10(%r8,%rdi,1),%xmm0 <- learn 8 bytes from byteArray1(r8)
vpxor 0x10(%r11,%rdi,1),%xmm0,%xmm0 <- learn 16 bytes from byteArray2 (r11) and xor them with xmm0

And one with out:

vmovq 0x18(%rcx, %r10, 1), %xmm0
vpxor 0x18(%rbp, %r10, 1), %xmm0, %xmm0 <- studying 16 bytes end in studying previous mapped reminiscence area

It appears just like the 0x18 we’re seeing within the LinkedIn code is the dimensions of the item header when compressed oops is disabled. It’s unclear why it’s getting used twice within the tackle calculation, however our assumption at this level is that it’s not related to the issue at hand.

Right here’s the place the LinkedIn code crashes (now together with the instruction proper earlier than the crash):

0x7ffb860d7058:      vmovq  0x18(%rcx,%r9,1),%xmm0
0x7ffb860d705f:      vpxor  0x18(%rdi,%r9,1),%xmm0,%xmm0

Studying the bug description and feedback additional, the issue is as a result of, though vmovq is working on 8 bytes, vpxor reads 16 bytes (see MOVQ and PXOR). That is all nice in the course of the vectorized principal loop if the remaining size of the vector is >= 16 bytes. Nonetheless, if it’s lower than that then we get this misguided learn. The repair recommended is to solely permit vpxor for use when there are >= 16 bytes remaining. In any other case, the remaining bytes are processed within the unvectorized publish loop. The bug right here is that the incorrect instruction is being chosen beneath these circumstances.

Now that we perceive the bug and are assured that it’s what LinkedIn is seeing, what can we do subsequent?

Effectively, once we initially encountered this challenge we regarded on the bug standing and noticed that it had not solely already been mounted in “tip” (the most recent jdk repo), however that the repair was additionally backported to JDK 11 and can be launched as a part of as a part of the 11.0.14 Patch Set Replace (PSU) in January. Nonetheless, we encountered this earlier than the 11.0.14 launch date! We wished to mitigate the problem straight away, so we applied a workaround.

Workaround

Since this bug was encountered earlier than the discharge of 11.0.14, we wanted a workaround whereas ready for the repair to land.

On this case, because it’s a JIT compilation bug, the workaround is to easily cease compiling the offending methodology. All we have to do is inform the JVM to exclude the strategy from compilation and it ought to solely execute it in interpreted mode. On this case, since we’re speculating that the problem could possibly be coming from a technique inlined within the compilation of initSecContext, we must always disable the compilation of that methodology and any methodology that it calls which may have included the situation of the crash.

To disable the compilation of a technique, one strategy is to make use of the CompileCommand flag once we launch the JVM. We discovered a number of methods to specify this exclusion that work, as follows:

-XX:CompileCommand=exclude,solar/safety/jgss/krb5/Krb5Context.initSecContext
-XX:CompileCommand=exclude,solar/safety/jgss/krb5/Krb5Context.initSecContext()
-XX:CompileCommand=exclude,solar/safety/jgss/krb5/Krb5Context,initSecContext
-XX:CompileCommand=exclude,solar.safety.jgss.krb5.Krb5Context::initSecContext

Word {that a} workaround like this could have an effect on efficiency, as the strategy will now not be compiled – it’s going to run in interpreted mode. If it’s an costly methodology that is named typically, efficiency can endure. Nonetheless, correctness typically trumps efficiency, so this type of workaround is normally important for stability till a repair for the problem lands.

We had been fortunate on this case that the strategy in query was not seen to be performance-critical for LinkedIn. In consequence, the workaround was a great stopgap till the repair grew to become out there in an OpenJDK replace.

 

The Repair

When LinkedIn upgraded to model 11.0.14 of the Microsoft Construct of OpenJDK, they bought the repair described within the JBS challenge and this crash went away. As soon as once more, the OpenJDK neighborhood got here to the rescue to seek out the basis reason for a nasty bug and implement a repair!

Conclusion

This was an fascinating effort which confirmed among the work that Microsoft’s Java Engineering Group does for our prospects regularly, and which delivered a great consequence for LinkedIn. It additionally served as a reminder of how the generosity and experience of many JVM engineers world wide assist make Java probably the most secure runtime on the planet!

For additional info on utilizing HotSpot error logs to debug JVM crashes, I like to recommend that you just take a look at Deadly Error Log – Troubleshooting Information for HotSpot VM and Andrei Pangin – JVM crash dump evaluation.

Thanks for studying!

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments