Tuesday, April 16, 2024
HomeJavaBuilders disassemble! Use Java and hsdis to see all of it.

Builders disassemble! Use Java and hsdis to see all of it.


Use the HotSpot Disassembler to see what’s taking place to your code.

Determine 1 beneath is what you would possibly see once you ask your Java Digital Machine (JVM) to point out you the output of a just-in-time (JIT) compilation carried out by the HotSpot JVM after it optimized your program to make the most of the highly effective options in your CPU.

Oracle Java Certified, Oacle Java Certification, Java Exam, Java Prep, Java Exam Preparation, Java Tutorial and Materials, Java Learning

Determine 1. A picture of disassembled Java native code

This output comes from a library referred to as hsdis (HotSpot Disassembler). It’s a plugin for the HotSpot JVM, Oracle’s default JVM for Java, which supplies disassembly of JIT-compiled native code again into human-readable meeting language code.

There’s lots to unpack right here so let’s begin with a refresher on how HotSpot executes Java applications.

After getting written a program in Java, and earlier than you possibly can run it, you should compile this system’s Java supply code into bytecode, the language of the JVM. This compilation can occur in your IDE, by way of your construct instruments (resembling Maven or Gradle), or by way of command-line instruments resembling javac.

The bytecode is written into class information. If you run this system, the category information are loaded into the JVM, after which the bytecode is executed by the JVM’s bytecode interpreter.

Whereas the JVM is operating your program, the JVM additionally profiles the applying in real-time order to resolve which elements are carried out lots (the recent spots) and would possibly profit from being compiled from bytecode into native code for the CPU on which your JVM is operating.

This transformation from interpreted bytecode into native execution in your CPU is carried out at runtime and is named JIT compilation, in distinction to ahead-of-time (AOT) compilation utilized by languages resembling C/C++. One of many benefits of JIT compilation over AOT compilation is the flexibility to optimize primarily based on noticed habits at runtime, which is named profile-guided optimization (PGO).

HotSpot JIT compilation isn’t your solely alternative, after all. Java applications will also be AOT-compiled utilizing the GraalVM Native Picture expertise, which is perhaps of curiosity to you if sooner startup time and decrease runtime reminiscence overhead are vital to you, resembling in a serverless microservices setting.

As a result of JIT compilation occurs at runtime, the compilation course of consumes CPU and reminiscence sources which may in any other case be utilized by the JVM to run your software. Meaning there’s a efficiency price that isn’t current with AOT-compiled binaries.

For the remainder of this text, I’ll talk about JIT compilation utilizing HotSpot, inspecting precisely how that course of works.

Java shopper and Java server

HotSpot incorporates two separate JIT compilers. The primary is C1 (typically referred to as the shopper compiler); the opposite is C2 (the server compiler):

◉ The C1 compiler begins working rapidly and makes use of quick, easy optimizations to assist enhance software startup time. In different phrases, your program begins up sooner.

◉ The C2 compiler spends longer accumulating the profiling info wanted to assist more-advanced optimization strategies. Thus, your program takes a bit longer to begin up however will often run sooner as soon as it has began.

The “shopper” and “server” elements of the names have been assigned throughout a time in Java’s historical past when the efficiency traits of a typical finish person system (resembling a PC or laptop computer) and a server have been very totally different. For instance, again within the days of Java 5, the 32-bit Home windows Java Runtime Setting (JRE) distribution contained solely the C1 compiler as a result of a typical desktop PC may need just one or two CPU threads. In that period, slowing the startup of a desktop program’s execution to carry out superior C2 optimizations would have a damaging impact on the tip person’s expertise.

At the moment, finish person units are nearer to servers of their processing energy. Not too long ago, the default habits of the JVM has been to mix the strengths of each C1 and C2 in a mode referred to as tiered compilation, which supplies the advantages of the sooner optimizations of C1 and the upper peak efficiency of C2.

Tiered compilation may be managed utilizing the -XX:+TieredCompilation and -XX:-TieredCompilation switches.

How the HotSpot JIT compilers work

The fundamental unit of JIT compilation is the tactic, and the JVM will use invocation counters to find which strategies are being referred to as most steadily. These are termed the recent strategies.

The JIT system can function at a unit smaller than a complete technique when it observes a loop with many again branches, that means the loop reaches the tip of its physique and decides it has not completed and wishes to leap again to the beginning for an additional iteration. When the variety of again branches (referred to as the loop back-edge counter) reaches a threshold, the JIT system can change the interpreted loop bytecode with a JIT-compiled model. This is named an on-stack substitute (OSR) compilation.

The code cache. Native code produced by the JIT compilers is saved in a reminiscence area of the JVM referred to as the code cache. Previous to JDK 9, the code cache was a single contiguous piece of reminiscence the place the next three important varieties of native code discovered within the JVM have been saved collectively:

◉ Profiled code

◉ Nonprofiled (totally optimized) code

◉ JVM inside code

The nonsegmented code cache measurement was managed by the -XX:ReservedCodeCacheSize=n change.

Starting with JDK 9, the code cache format was improved by JEP 197 (Segmented code cache). This JEP splits the cache into three areas, relying on the three native code sorts, to cut back fragmentation and higher handle the native code footprints. The segmented code cache sizes are managed by the next three switches:

◉ -XX:NonProfiledCodeHeapSize=n units the scale in bytes of the code heap containing nonprofiled strategies.

◉ -XX:ProfiledCodeHeapSize=n units the scale in bytes of the code heap containing profiled strategies.

◉ -XX:NonMethodCodeHeapSize=n units the scale in bytes of the code heap containing nonmethod code.

JEP draft 8279184 (named “Instruction problem cache {hardware} lodging”) goals to enhance the code cache efficiency even additional.

JIT compilation with out tiered compilation. Desk 1 exhibits the thresholds for triggering technique compilation on x86 programs if tiered compilation is disabled.

Desk 1. Triggering thresholds

Compiler     Invocations

C1                 1,500

C2             10,000

The invocation threshold may be managed utilizing the -XX:CompileThreshold=n change. In case you want to management the thresholds for OSR compilation, you possibly can specific the back-edge set off as a share of the CompileThreshold worth utilizing the -XX:OnStackReplacePercentage=n change.

JIT with tiered compilation. When tiered compilation is enabled (it’s been the default since JDK 8), the JIT system will use the 5 tiers of optimization proven in Desk 2. A way could find yourself being JIT-compiled a number of occasions at totally different tiers because the JVM higher understands the tactic’s utilization by means of profiling.

Desk 2. Tiers of optimization

Oracle Java Certified, Oacle Java Certification, Java Exam, Java Prep, Java Exam Preparation, Java Tutorial and Materials, Java Learning

The set off thresholds at every tier on a typical Linux x86 system are as follows:

◉ The tier 2 compile threshold is 0.

◉ The tier 3 invocation threshold is 200.

◉ The tier 3 compile threshold is 2,000.

◉ The tier 4 invocation threshold is 5,000.

◉ The tier 4 compile threshold is 15,000.

Some typical compilation sequences are proven in Desk 3.

Desk 3. Typical compilation sequences

Oracle Java Certified, Oacle Java Certification, Java Exam, Java Prep, Java Exam Preparation, Java Tutorial and Materials, Java Learning

Compiler threads. The JVM means that you can management the variety of JIT compiler threads with the -XX:CICompilerCount=n change. Every compiler thread incorporates a queue, and when a technique or loop reaches a compilation threshold, a compilation job will probably be created and inserted into one of many compiler queues. When a compilation job is faraway from the queue, it’s handed to the JIT compiler for transformation into optimized native code that’s saved within the code cache. See Determine 2 for particulars.

Oracle Java Certified, Oacle Java Certification, Java Exam, Java Prep, Java Exam Preparation, Java Tutorial and Materials, Java Learning

Determine 2. Compilation duties within the JIT compiler queues

The position of the HotSpot Disassembler

The OpenJDK HotSpot creators added a function that enables builders to examine the native code that’s created by the JIT compilers and saved within the code cache. Nevertheless, this compiled code is in binary format prepared for execution by the CPU, so it’s not human-readable code.

Happily, you need to use hsdis, the HotSpot Disassembler, to show that native code again right into a human-readable meeting language code.

When the JVM begins up, it checks for the presence of the hsdis library and if it’s discovered, the JVM will let you use extra switches to regulate the disassembly output for varied varieties of native code. These hsdis switches are all categorized as diagnostic, so you should first unlock them utilizing the -XX:+UnlockDiagnosticVMOptions change.

As soon as they’re unlocked, you possibly can request disassembly output through the use of the switches proven in Desk 4.

Desk 4. Switches for requesting disassembly output

Oracle Java Certified, Oacle Java Certification, Java Exam, Java Prep, Java Exam Preparation, Java Tutorial and Materials, Java Learning

Observe that when it’s enabled, hsdis is invoked after every blob of native code is inserted into the code cache. The method of disassembling the native code (which may be fairly sizable if in depth inlining occurred) instruction by instruction into human-readable meeting language code (a big chunk of textual content) and writing that textual content to the console or a log file is a reasonably costly operation finest accomplished in your improvement setting and never in manufacturing.

By the best way, should you want to see solely the disassembly of particular strategies, you possibly can management the output by way of the -XX:CompileCommand change. For instance, to output the meeting language code for the tactic size() in java.lang.String you’ll use

java -XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=print,java/lang/String.size

Constructing hsdis. The hsdis supply information discovered within the OpenJDK repository are wrapper code. The precise disassembly is carried out by exterior libraries discovered throughout the GNU Binutils mission.

Let’s say you need to construct hsdis utilizing OpenJDK 17 and binutils model 2.37. The hsdis plugin have to be compiled for every working system on which you want to use it. The hsdis construct course of will produce a file within the dynamic library format for every working system, as proven in Desk 5.

Desk 5. File codecs in response to OS

Oracle Java Certified, Oacle Java Certification, Java Exam, Java Prep, Java Exam Preparation, Java Tutorial and Materials, Java Learning
These directions assume that you’ve got a working UNIX-like construct setting obtainable. In case you would like to make use of a prebuilt hsdis plugin as a binary on your working system and structure, I’ve constructed them on your comfort. Obtain them right here, and see Determine 3 for the choices.

Oracle Java Certified, Oacle Java Certification, Java Exam, Java Prep, Java Exam Preparation, Java Tutorial and Materials, Java Learning

Determine 3. The prebuilt hsdis binaries

Wish to construct your personal? Right here’s what to do.

◉ Obtain binutils-2.37 right here.

◉ Unpack the binutils obtain utilizing the tar -xzf binutils-2.37.tar.gz command. The construct examples beneath assume you’re unpacking into your private home listing.

◉ Clone the OpenJDK 17 supply code from GitHub through the use of git clone.

Observe that on current JDKs, the hsdis information are within the folder <jdk>/src/utils/hsdis.

Beneath are some particular builds.

Constructing on Linux (amd64). Carry out the next steps:

cd jdk17/src/utils/hsdis

make BINUTILS=~/binutils-2.37 ARCH=amd64

# Produces construct/linux-amd64/hsdis-amd64.so

Constructing on Linux (32-bit ARM such because the Raspberry Pi v1 or v2). Carry out the next steps:

cd jdk17/src/utils/hsdis

make BINUTILS=~/binutils-2.37 ARCH=arm

# Produces construct/linux-arm/hsdis-arm.so

Constructing on Linux (64-bit ARM such because the Raspberry Pi v3, v4, or 400). Carry out the next steps:

cd jdk17/src/utils/hsdis

make BINUTILS=~/binutils-2.37 ARCH=aarch64

# Produces construct/linux-aarch64/hsdis-aarch64.so

Constructing on macOS (amd64). Carry out the next steps:

cd jdk17/src/utils/hsdis

make BINUTILS=~/binutils-2.37 ARCH=amd64

# Produces construct/macosx-amd64/hsdis-amd64.dylib

Constructing on MacOS (ARM M1). Carry out the next steps:

cd jdk17/src/utils/hsdis

make BINUTILS=~/binutils-2.37 ARCH=aarch64

# Produces construct/macosx-aarch64/hsdis-aarch64.dylib

Constructing on Home windows (utilizing Cygwin and MinGW). Constructing hsdis on Home windows is extra concerned and makes use of the Cygwin instruments, which give a Linux-like construct setting. The next steps have been examined on Home windows 10.

First, obtain and set up Cygwin utilizing the installer discovered right here.

Then, set up the next extra packages:

gcc-core                   11.2.0-1

mingw64-x86_64-gcc-core    11.2.0-1

mingw64-x86_64-gcc-g++     11.2.0-1

make                          4.3-1

In case you didn’t choose these packages at set up time, rerun the installer to pick extra packages.

Then to construct a Home windows 64-bit DLL file, carry out the next step:

make OS=Linux MINGW=x86_64-w64-mingw32 BINUTILS=~/binutils-2.37/ ARCH=amd64

# Produces construct/Linux-amd64/hsdis-amd64.dll

Alternatively, to construct a Home windows 32-bit DLL file, carry out the next step:

make OS=Linux MINGW=x86_64-w64-mingw3 BINUTILS=~/binutils-2.37/ ARCH=i386

# Produces construct/Linux-i586/hsdis-i386.dll

Putting in hsdis

Now that you’ve got constructed or downloaded hsdis, you should put it in a spot the place the JVM can discover it. The next JDK paths are searched:

<JDK_HOME>/lib/server/hsdis-<arch>.<extension>

<JDK_HOME>/lib/hsdis-<arch>.<extension>

Moreover, the paths proven in Desk 6 are searched on every working system.

Desk 6. Extra search paths

Oracle Java Certified, Oacle Java Certification, Java Exam, Java Prep, Java Exam Preparation, Java Tutorial and Materials, Java Learning

Trace: I like to recommend utilizing an setting variable to level to hsdis. This protects you from having to repeat the binary into each JDK you utilize.

Experimenting within the JITWatch sandbox

If in case you have expertise with C++, you might be accustomed to the Compiler Explorer by Matt Godbolt, which helps you to check snippets of C++ code utilizing varied AOT compilers and examine the native code that’s produced.
Within the Java world you are able to do one thing comparable with a instrument I’ve written referred to as JITWatch. JITWatch processes the JIT compilation logs which are output by the JVM and explains the optimization choices made by the JIT compilers.

JITWatch has a sandbox mode in which you’ll experiment with Java applications in a built-in editor, after which you possibly can click on a single button to compile and execute your applications—and examine the JIT habits in the primary JITWatch person interface. I’ll use the sandbox as an instance and clarify the output of hsdis. Determine 4 exhibits the speedy suggestions loop with JITWatch.

Oracle Java Certified, Oacle Java Certification, Java Exam, Java Prep, Java Exam Preparation, Java Tutorial and Materials, Java Learning

Determine 4. Speedy suggestions loop for JIT experimentation with JITWatch

Putting in JITWatch. Obtain the most recent launch (on the time of writing, it’s model 1.4.7) from my GitHub repo or construct from supply code utilizing the next:

git clone https://github.com/AdoptOpenJDK/jitwatch.git

cd jitwatch

mvn clear bundle

# Produces an executable jar in ui/goal/jitwatch-ui-shaded.jar

Run JITWatch with the next command:

java -jar <path to JITWatch jar>

The sandbox comes with a set of examples that train varied JIT optimizations.

An meeting language primer

Earlier than wanting on the meeting language code from an instance Java program in JITWatch, right here’s a brief primer in Intel x86-64 meeting language. For this platform, every disassembled instruction in meeting language takes the next kind:

<tackle> <instruction mnemonic> <operands>

◉ The tackle is proven for figuring out the goal of a leap instruction.

◉ The mnemonic is the brief identify for the instruction.

◉ The operands may be registers, constants, addresses, or a combination (within the case of offset addressing).

Observe that by default hsdis outputs a format referred to as AT&T meeting, which orders the directions as follows:

<mnemonic> <src> <dst>

However you possibly can change to the next Intel format through the use of the JVM’s -XX:PrintAssemblyOptions=intel change:

<mnemonic> <dst> <src>

Desk 7. Generally used registers

Oracle Java Certified, Oacle Java Certification, Java Exam, Java Prep, Java Exam Preparation, Java Tutorial and Materials, Java Learning

Observe that the rax, rbx, rcx, and rdx registers may be accessed in 32-, 16-, and 8-bit modes by referring to them as proven in Determine 5. When it’s coping with the Java int kind (32 bits), the accumulator will probably be accessed as eax, and when it’s coping with the Java lengthy kind (64 bits), the accumulator will probably be accessed as rax.

Oracle Java Certified, Oacle Java Certification, Java Exam, Java Prep, Java Exam Preparation, Java Tutorial and Materials, Java Learning

Determine 5. The rax and eax registers

Registers r8 to r15 may be additionally accessed in 32-, 16-, and 8-bit modes utilizing a suffix; see Determine 6.

Oracle Java Certified, Oacle Java Certification, Java Exam, Java Prep, Java Exam Preparation, Java Tutorial and Materials, Java Learning

Determine 6. The r8 register

Frequent directions. Desk 8 exhibits some generally encountered meeting language directions and their that means. This info is within the Intel format.

Desk 8. Frequent meeting language directions

Oracle Java Certified, Oacle Java Certification, Java Exam, Java Prep, Java Exam Preparation, Java Tutorial and Materials, Java Learning

There is a superb submit on Stack Overflow explaining the advantages behind a number of nonobvious meeting language idioms.

Utilizing JITWatch to examine native code

It’s time to run some Java code and examine the disassembled JIT output. The next instance is an easy program that assessments whether or not the JIT compilers can optimize a easy loop that modifies an array. The code creates an array of 1,024 int components and makes use of the incrementArray(int[] array, int fixed) technique to entry every ingredient of the array so as, including a continuing to the worth of every array ingredient and storing it again within the array.

Right here is the supply code.

public class DoesItVectorise

{

    public DoesItVectorise()

    {

        int[] array = new int[1024];

        for (int i = 0; i < 1_000_000; i++)

        {

            incrementArray(array, 1);

        }

        for (int i = 0; i < array.size; i++)

        {

            System.out.println(array[i]);

        }

    }

    public void incrementArray(int[] array, int fixed)

    {

        int size = array.size;

        for (int i = 0; i < size; i++)

        {

            array[i] += fixed;

        }

    }

    public static void important(String[] args)

    {

        new DoesItVectorise();

    }

}

The Java bytecode for the incrementArray technique is the next:

 0: aload_1                     // load the reference of ‘array’

 1: arraylength                 // name the ‘arraylength’ instruction to get the size of the array

 2: istore_3                    // retailer the array size into native variable 3 ‘size’

 3: iconst_0                    // push int 0 onto the stack

 4: istore          4           // retailer into native variable 4 ‘i’

 6: iload           4           // load native variable 4 ‘i’ and push onto the stack

 8: iload_3                     // load native variable 3 ‘size’ and push onto the stack

 9: if_icmpge       26          // if (i >= size) leap to BCI 26 (return)

12: aload_1                     // else load the reference of ‘array’ and push onto the stack

13: iload           4           // load native variable 4 ‘i’ and push onto the stack

15: dup2                        // duplicate the highest 2 values on the stack

16: iaload                      // load the worth of array[i] and push onto the stack

17: iload_2                     // load native variable 2 ‘fixed’ and push onto the stack 

18: iadd                        // add array[i] and ‘fixed’ and push consequence onto stack

19: iastore                     // retailer the consequence again into array[i]

20: iinc            4, 1        // increment native variable 4 ‘i’ by 1

23: goto            6           // leap again to BCI 6

26: return

The bytecode is a devoted illustration of the Java supply code with no optimizations carried out.

The category to be examined is named DoesItVectorise, and it asks whether or not the JIT can establish a chance to make use of the options of a contemporary CPU to vectorize this system in order that it may replace a couple of array ingredient per loop iteration utilizing the huge SIMD registers.

These registers can pack a number of 32-bit int components right into a single register and modify all of them with a single CPU instruction, thereby finishing the array replace with fewer loop iterations.

The incrementArray technique is named 1 million occasions, which must be sufficient for the JIT compiler to acquire a great profile of how the tactic behaves.

I’ll load this program into the JITWatch sandbox and see what occurs on my Intel i7-8700 CPU. This chip helps the Intel SSE4.2 and AVX2 instruction units, which suggests the xmm 128-bit and ymm 256-bit registers must be obtainable for a vectorization optimization.

Step 1: Begin JITWatch and open the sandbox, as proven in Determine 7.

Oracle Java Certified, Oacle Java Certification, Java Exam, Java Prep, Java Exam Preparation, Java Tutorial and Materials, Java Learning

Determine 7. The principle JITWatch interface

Step 2: Open DoesItVectorise.java from the samples and click on Run. Observe that if hsdis shouldn’t be detected, JITWatch will provide to obtain it for the present OS and structure. See Determine 8.

Oracle Java Certified, Oacle Java Certification, Java Exam, Java Prep, Java Exam Preparation, Java Tutorial and Materials, Java Learning

Determine 8. The JITWatch sandbox interface exhibiting the code instance

Step 3: After execution is full, find the JITWatch important window and examine the compilations of the incrementArray technique. On my pc, that technique was JIT-compiled 4 occasions earlier than reaching its remaining state of optimization—and that every one occurred within the first 100 milliseconds of this system’s execution. See Determine 9.

Oracle Java Certified, Oacle Java Certification, Java Exam, Java Prep, Java Exam Preparation, Java Tutorial and Materials, Java Learning

Determine 9. The tactic was compiled 4 occasions.

Step 4: Change to the three-view display screen and make sure that the ultimate (fourth, on my system) compilation is chosen. If hsdis is accurately put in in your system, you will note the disassembly output for the incrementArray technique, as proven in Determine 10.

Oracle Java Certified, Oacle Java Certification, Java Exam, Java Prep, Java Exam Preparation, Java Tutorial and Materials, Java Learning

Determine 10. The JITWatch three-view display screen exhibiting supply, bytecode, and meeting language code

Understanding the JITWatch output

When the HotSpot JIT compiler creates the native code, it consists of feedback that assist somebody studying the disassembled code to narrate it again to the unique program. These feedback embrace bytecode index (BCI) references, which permit JITWatch to narrate meeting language directions to the bytecode of this system. The Java class file format incorporates a LineNumberTable knowledge construction that JITWatch makes use of to map the bytecode directions again to the Java supply code.

On entry, register rdx factors to the int[] object named array, whereas rcx incorporates the int worth named fixed, as follows:

# {technique} {0x00007f92cb000410} ‘incrementArray’ ‘([II)V’ in ‘DoesItVectorise’

# this:     rsi:rsi   = ‘DoesItVectorise’

# parm0:    rdx:rdx   = ‘[I’

# parm1:    rcx       = int

The native code performs a stack bang test to ensure sufficient stack space is available; it does this by attempting to store the contents of eax at a fixed offset from the stack pointer rsp. (If this address falls within a guard page, a StackOverflowException will be thrown.)

[Verified Entry Point]

0x00007f92f523b8c0: mov DWORD PTR [rsp-0x14000],eax

The code then masses the array size (an int) into register ebx. With the -XX:+UseCompressedOops change enabled, the array object header consists of a 64-bit mark phrase and a compressed 32-bit klass phrase, so the array size area is discovered after these two values at offset [rdx + 12 (0xc) bytes]. (A klass phrase in an object header factors to the interior kind metadata for the thing.)

0x00007f92f523b8cc: mov ebx,DWORD PTR [rdx+0xc]  ; implicit exception: dispatches to 0x00007f92f523ba71

                                                 ;*arraylength {reexecute=0 rethrow=0 return_oop=0}

                                                 ; – DoesItVectorise::incrementArray@1 (line 20)

The following code assessments whether or not the array size in ebx is 0. Whether it is, execution jumps to the tip of this process to wash up and return.

0x00007f92f523b8cf: check ebx,ebx

0x00007f92f523b8d1: jbe L0007  ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}

                               ; – DoesItVectorise::incrementArray@9 (line 22)

What you will note subsequent is the meeting language code representing the HotSpot loop unrolling optimization. To cut back array bounds checking, the loop is cut up into the next three elements:

◉ A preloop sequence

◉ A important unrolled loop performing a lot of the iterations with no bounds checking

◉ A postloop to finish any remaining iterations

The following part exhibits the setup and preloop sequence.

0x00007f92f523b8d7: mov ebp,ebx

0x00007f92f523b8d9: dec ebp

0x00007f92f523b8db: cmp ebp,ebx

0x00007f92f523b8dd: data16 xchg ax,ax

0x00007f92f523b8e0: jae L0008

0x00007f92f523b8e6: mov r11d,edx

0x00007f92f523b8e9: shr r11d,0x2

0x00007f92f523b8ed: and r11d,0x7

0x00007f92f523b8f1: mov r10d,0x3

0x00007f92f523b8f7: sub r10d,r11d

0x00007f92f523b8fa: and r10d,0x7

0x00007f92f523b8fe: inc r10d

0x00007f92f523b901: cmp r10d,ebx

0x00007f92f523b904: cmovg r10d,ebx

0x00007f92f523b908: xor esi,esi

0x00007f92f523b90a: xor eax,eax  ;*aload_1 {reexecute=0 rethrow=0 return_oop=0}

                                 ; – DoesItVectorise::incrementArray@12 (line 24)

             L0000: add DWORD PTR [rdx+rax*4+0x10],ecx  ;*iastore {reexecute=0 rethrow=0 return_oop=0}

                                                        ; – DoesItVectorise::incrementArray@19 (line 24)

0x00007f92f523b910: mov r9d,eax

0x00007f92f523b913: inc r9d  ;*iinc {reexecute=0 rethrow=0 return_oop=0}

                             ; – DoesItVectorise::incrementArray@20 (line 22)

0x00007f92f523b916: cmp r9d,r10d

0x00007f92f523b919: jge L0001  ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}

                               ; – DoesItVectorise::incrementArray@9 (line 22)

0x00007f92f523b91b: mov eax,r9d

0x00007f92f523b91e: xchg ax,ax

0x00007f92f523b920: jmp L0000

             L0001: mov r10d,ebx

0x00007f92f523b925: add r10d,0xffffffc1

0x00007f92f523b929: mov r11d,0x80000000

0x00007f92f523b92f: cmp ebp,r10d

0x00007f92f523b932: cmovl r10d,r11d

0x00007f92f523b936: cmp r9d,r10d

0x00007f92f523b939: jge L0009

The following two directions present the loop goes to be vectorized, as follows:

◉ The worth of the 32-bit int fixed usually objective register ecx is now pasted 4 occasions throughout the 128-bit SIMD register xmm0 utilizing the vmovd instruction.

◉ Then, utilizing the vpbroadcastd instruction, the contents of 128-bit register xmm0 are broadcast into 256-bit SIMD register ymm0, which now incorporates eight copies of the worth of fixed.

0x00007f92f523b93f: vmovd xmm0,ecx

0x00007f92f523b943: vpbroadcastd ymm0,xmm0

Extra unrolling setup is subsequent.

0x00007f92f523b948: inc eax

0x00007f92f523b94a: mov r8d,0xfa00

             L0002: mov edi,r10d

0x00007f92f523b953: sub edi,eax

0x00007f92f523b955: cmp r10d,eax

0x00007f92f523b958: cmovl edi,esi

0x00007f92f523b95b: cmp edi,0xfa00

0x00007f92f523b961: cmova edi,r8d

0x00007f92f523b965: add edi,eax

0x00007f92f523b967: nop WORD PTR [rax+rax*1+0x0]  ;*aload_1 {reexecute=0 rethrow=0 return_oop=0}

                                                  ; – DoesItVectorise::incrementArray@12 (line 24)

Right here is the unrolled a part of the loop, which performs vectorized array addition utilizing the vpaddd and vmovdqu directions.

◉ vpaddd performs packed integer addition between ymm0 (containing eight copies of fixed) and eight int values learn from the array, storing the end in 256-bit SIMD register ymm1.

◉ vmovdqu shops the eight incremented packed integers in ymm1 again to their areas within the array.

Between label L0003 and the jl again department, this pair of directions seems eight occasions. So by means of vectorization and loop unrolling, a powerful 64 array components are up to date per iteration of the primary loop part.

The instruction add eax,0x40 confirms that the array offset in eax is incremented by 64 (0x40) on the finish of every unrolled loop iteration.

L0003: vpaddd ymm1,ymm0,YMMWORD PTR [rdx+rax*4+0x10]

0x00007f92f523b976: vmovdqu YMMWORD PTR [rdx+rax*4+0x10],ymm1

0x00007f92f523b97c: vpaddd ymm1,ymm0,YMMWORD PTR [rdx+rax*4+0x30]

0x00007f92f523b982: vmovdqu YMMWORD PTR [rdx+rax*4+0x30],ymm1

0x00007f92f523b988: vpaddd ymm1,ymm0,YMMWORD PTR [rdx+rax*4+0x50]

0x00007f92f523b98e: vmovdqu YMMWORD PTR [rdx+rax*4+0x50],ymm1

0x00007f92f523b994: vpaddd ymm1,ymm0,YMMWORD PTR [rdx+rax*4+0x70]

0x00007f92f523b99a: vmovdqu YMMWORD PTR [rdx+rax*4+0x70],ymm1

0x00007f92f523b9a0: vpaddd ymm1,ymm0,YMMWORD PTR [rdx+rax*4+0x90]

0x00007f92f523b9a9: vmovdqu YMMWORD PTR [rdx+rax*4+0x90],ymm1

0x00007f92f523b9b2: vpaddd ymm1,ymm0,YMMWORD PTR [rdx+rax*4+0xb0]

0x00007f92f523b9bb: vmovdqu YMMWORD PTR [rdx+rax*4+0xb0],ymm1

0x00007f92f523b9c4: vpaddd ymm1,ymm0,YMMWORD PTR [rdx+rax*4+0xd0]

0x00007f92f523b9cd: vmovdqu YMMWORD PTR [rdx+rax*4+0xd0],ymm1

0x00007f92f523b9d6: vpaddd ymm1,ymm0,YMMWORD PTR [rdx+rax*4+0xf0]

0x00007f92f523b9df: vmovdqu YMMWORD PTR [rdx+rax*4+0xf0],ymm1  ;*iastore {reexecute=0 rethrow=0 return_oop=0}

                                                               ; – DoesItVectorise::incrementArray@19 (line 24)

0x00007f92f523b9e8: add eax,0x40  ;*iinc {reexecute=0 rethrow=0 return_oop=0}

                                  ; – DoesItVectorise::incrementArray@20 (line 22)

0x00007f92f523b9eb: cmp eax,edi

0x00007f92f523b9ed: jl L0003  ;*goto {reexecute=0 rethrow=0 return_oop=0}

                              ; – DoesItVectorise::incrementArray@23 (line 22)

The remainder of the meeting language code (which I’ve omitted for conciseness) incorporates the postloop iterations and the meeting language epilogue used to wash up the stack and return. Observe that there isn’t any return worth from the incrementArray technique.

Supply: oracle.com

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments