Any software developer who has worked with java based enterprise class backend applications would have run into this infamous or awkward error from a customer reporting one or Q.A engineers filing an issue: java.lang.OutOfMemoryError: Java heap space. To understand this, we have to go back to computer science fundamentals of complexity of algorithms specifically “space” complexity. If we recollect, every application has a worst case performance. Specifically, in the memory dimension, when this is unpredictable or is spiky more than the recommended memory allocated to the application, it leads to an over-usage of the heap memory allocated and hence a “out of memory” condition. The worst part of this specific condition is that the application cannot recover and will crash. And, any attempts to restart of the application even with more max memory(-Xmx option) allocation given is not a long term solution. Without understand what caused the heap usage inflation or spike, memory usage stability hence application stability is not guaranteed. So, what is the more methodical approach for understanding the programming problem related to the memory problem? This is answered by understanding the memory heap of the applications and its distribution when the out of memory happens. With this prelude we will shoot to focus the following:
- Getting a heap dump from a java process when it goes of of memory.
- Understanding the type of Memory issue the application is suffering from.
- Analyzing out of memory issues with a heap analyzer specifically with this great open source project – Eclipse MAT (https://eclipse.org/mat/).
Setting up the application ready for Heap analysis to generate a heap dump
- Any non-deterministic or sporadic problems like an out of memory would be a challenge to do any post-mortem. So, the best way to handle OOMs is to let the JVM dump a heap file of the state of memory of the JVM when it went out of memory.
- Sun HotSpot JVM has a way to instruct the JVM to dump its heap state when the JVM runs out of memory into a file. This standard format is .hprof. So, to enable this feature add -XX:+HeapDumpOnOutOfMemoryError to the JVM startup options. Adding this option is essential to production systems since out of memory could take a long time to happen. This flag adds little or no performance overhead to the application.
- If the heap dump .hprof file has to be written to a specific file system location, then add the directory path to XX:HeapDumpPath. Just make sure the application has write permissions for the particular directory path given here.
- 101 – know the nature of the out of memory
The most preliminary thing to understand when trying to assess and understand an out of memory error is to come to understand the memory growth characteristics. And, conclude about the following possibilities:
- Spikes in usage: This type of out of memory could be drastic based on the type of load. An application can be performing well under allocated memory to the JVM for 20 users. When there was a spike to the 100th user, it might have hit a memory spike which lead to the out of memory error. There are two possibilities to tackle this cause.
- Leaks : This type is one when the the memory usage increases over time which is a problem due to a programming issue.
A leak chart which increases over time after a while of healthy GC collection pattern collection. Note the healthy sawtooth pattern at the start.
A healthy graph with healthy GC
Memory graph with a spike memory leak.
After we got to the point of understanding what is the nature of the memory issue that caused the usage to surge, the following methodology might be made to avoid hitting the OOM error based on what inference comes out of the heap analysis :
- Heap Analysis
- We will be exploring in detail below how to analyze a heap dump using a heap analysis tool. In our case, we will be using the
- Fixing a memory issue
- Fix the OOM causing code
- A leaking object reference – Since an object was added incrementally without clearing their reference (from the object reference of the running application) over a period of time by the application, the programming error has to be fixed. For instance, this could be a hash table which was inserted with business objects incrementally without deleting them after the business logic and transaction was completed.
- Increase the maximum memory as a fix – After understanding the runtime memory characteristics and the heap, the maximum heap memory allocated might have to be increased to avoid OOM errors again since the suggested maximum memory was not enough for the application stability. So, the application might have to be updated to run with a Java flag -Xmx with a higher value based on the assessment made from the heap analysis.
- Fix the OOM causing code
Heap Analysis using MAT
Now, we get to deep dive into the area of heap analysis which is the main focus of this article. We will go through a sequence of steps which will help explore the different features and views of MAT to get to an example of OOM heap dump and think through the analysis.
- Open the heap (.hprof) generated when the out of memory happened. Make sure to copy the dump file to a dedicated folder since MAT creates lots of index files.
- File -> open
- This opens the dump with options for Leak Suspect Reports and Component Report. Choose to run the Leak Suspect report. When the leak suspect chart open, the pie in the overview pane shows the distribution of retained memory on a per-object basis. It shows the biggest objects in memory (objects that have high Retained memory – memory accumulated by it and the objects that it references)
The pie chart above shows 3 problem suspects by aggregating objects which hold the highest aggregated memory references (including shallow and retained).
Let us look at one at a time and assess:
454,570 instances of “java.lang.ref.Finalizer”, loaded by “<system class loader>” occupy 790,205,576 (47.96%) bytes.
The above tells us that there were 454,570 instances of JVM finalizer instances occupying almost 50% of the allocated application memory. Oops! What this leads us to understand based on the basic assumption that the reader knows what Java Finalizers do. Read here : http://stackoverflow.com/questions/2860121/why-do-finalizers-have-a-severe-performance-penalty
Essentially, there are custom finalizers written by the developer to release certain resources held by an instance. These instances that are collected by the finalizers are collected outside the scope of the JVM GC collection algorithms using a separate queue. Essentially, this is a longer path to cleaning up by the GC. So, now we are at a point where we are trying to understand what is getting finalized by these finalizers ?
Potentially, Suspect2 which is sun.security.ssl.SSLSocketImpl which is occupying 20% of the memory. Can we confirm if these are the instances held to be cleared by the finalizers ?
3. Now, let us open the dominator view which is under the tool button on the top of MAT.
we see all the instances by class name listed parsed by MAT available on the heap dump.
4. Next, on the Dominator view, we will try to understand the relationship between java.lang.Finalizer and sun.security.ssl.SSLSocketImpl. We right click on the sun.security.ssl.SSLSocketImpl row and open Path to Gc Roots -> exclude soft/weak references.
Now, MAT will start calculating the memory graph to show the paths to GC root where this instance is referenced. This will show up with another page showing the references as below:
As the above reference chain shows, the instance SSLSocketImpl is held by a reference from java.lang.ref.Finalizer which is about 88k of retained heap by itself at its level. And, we could also notice that the finalizer chain is a linkedlist datastructure with next pointers.
INFERENCE: At this point, we have a clear hint the Java finalizer is trying to collect SSLSocketImpl objects. For the explanation of why so many of them are not collected, we goto code.
5. Inspect code
Code inspection is needed at this point to see if sockets/ I/O stream are closed in finally clauses. In this case, it revealed that all streams related to I/O were in fact correctly closed. At this point, we doubt the JVM being the culprit. And, in fact it was the case, there was a bug in Open JDK 6.0.XX where the GC collection code had a bug.
I hope this article gives a model to analyze heap dumps and infer root causes in Java applications. And, there seems to be some light in the tunnel. Happy heap analysis>!!
Shallow vs. Retained Heap
Shallow heap is the memory consumed by one object. An object needs 32 or 64 bits (depending on the OS architecture) per reference, 4 bytes per Integer, 8 bytes per Long, etc. Depending on the heap dump format the size may be adjusted (e.g. aligned to 8, etc…) to model better the real consumption of the VM.
Retained set of X is the set of objects which would be removed by GC when X is garbage collected.
Retained heap of X is the sum of shallow sizes of all objects in the retained set of X, i.e. memory kept alive by X.
Generally speaking, shallow heap of an object is its size in the heap and retained size of the same object is the amount of heap memory that will be freed when the object is garbage collected.
The retained set for a leading set of objects, such as all objects of a particular class or all objects of all classes loaded by a particular class loader or simply a bunch of arbitrary objects, is the set of objects that is released if all objects of that leading set become unaccessible. The retained set includes these objects as well as all other objects only accessible through these objects. The retained size is the total heap size of all objects contained in the retained set.