In this post, we’d like to share our real-world experience optimizing a Java application that controlled robots in a warehouse. This application would instruct the robots in the warehouse about the actions to be performed. Based on these instructions, the robots perform their work in the warehouse. At times this app would slow down and not give instructions to the robots. If the robots did not receive instructions from the application, they would start making decisions on their own, which would cause them to behave degenerately, which in turn would affect delivery and shipments in the warehouse.
A long break in garbage collection
The best way to start troubleshooting a Java application’s performance is to study its Garbage Collection performance. This is even more true when the application suffers from slowdowns. We took this application’s garbage collection log file and loaded it into the GCeasy tool. (Note: The garbage collection log contains vital statistics that most monitoring tools do not display, and adds almost no overhead to your application. Therefore, it is a best practice to enable garbage collection logging on all your production instances). The tool analyzed the garbage collection log and immediately generated this insightful GC log analysis report.
The tool displayed various interesting metrics and graphs. The garbage collection pause time graph in the report is of most interest to our discussion. Below is that chart:
Whenever the Garbage Collection event is fired, it pauses the entire application. During this pause period, no user transactions will be processed. All transactions in progress will be stopped. From the graph above, you can notice that at 11:35 a.m., the peak traffic time, the Garbage Collection event paused the entire application for 329 seconds (ie 5 minutes and 29 seconds). This means that during the entire 5+ minute window, all robots would not receive any instructions from this application. Decisions would be made autonomously, causing disruptions in business.
What causes a long GC pause?
Two main reasons caused such a long pause in garbage collection:
- Large stack size: The application is configured to work with a heap size of 126 GB. Typically, when the pile is large, the garbage collection time will also be longer. Because a lot of objects would accumulate and it would take a long time to exorcise them from memory.
- CMS (Concurrent Mark and Sweep) algorithm: The CMS GC algorithm works well and is suitable for several applications, however, its main drawback is the occasional long GC pause. The CMS GC algorithm does not always cause long GC pauses, most of its GC pauses are in the acceptable range, but it occasionally causes long GC pauses due to heap fragmentation, which can last for several seconds (sometimes even minutes) as in this situation.
Potential solutions to reduce long GC pauses
This is a blog post highlighting potential solutions to reduce long garbage collection pauses. We have considered several solutions to solve this long GC pause.
1. Reducing the heap size
Reducing the application heap size is a potential solution. However, the object creation rate of this monolithic application was very high, and reducing the heap size could potentially affect the application’s responsiveness. The heap size can only be reduced if the memory consumption of the application can be reduced, which warrants refactoring the application code. The re-architecture of this monolithic application was already underway. This monolithic application began to be disassembled and rewritten as microservices with a much smaller heap size. However, this re-architecture of the application should become live in 6 – 9 months later. Therefore, the customer was reluctant to reduce the size of the pile until then.
2. Switching from CMS to G1 GC algorithm
Another solution was to migrate from the CMS GC algorithm. Regardless of this GC performance, this CMS GC algorithm is deprecated as of Java 9 and will be permanently removed from Java 14. If we want to move away from the CMS GC algorithm, what alternatives do we have? Below are the alternative GC algorithms available in OpenJDK:
- Serial GC
- Parallel GC
- G1 GC
- ZGC
- Shenandoah GC
The serial GC algorithm is only useful for single-threaded desktop applications. Since this application has multiple concurrent threads with a very high object creation rate, we have eliminated the Serial GC algorithm. Since this application ran on Java 8, we excluded ZGC and Shenandoah GC as they are all stable only from Java 17+. So we are left with the choice of either using Parallel GC or G1 GC.
We simulated production traffic volumes in the performance lab and experimented with Parallel GC and G1 GC algorithm settings based on GC tuning best practices. We found that the pause time of the parallel GC was not as bad as the CMS GC, but it was no better than the G1 GC. So we ended up switching from the CMS GC algorithm to the G1 GC algorithm. Here is a report on the GC log analysis of this robotic application in the performance lab when using the G1 GC algorithm. Below is a graph of GC pause duration when using the G1 GC algorithm:
You can notice in the graph that the maximum GC pause time was 2.17 seconds. This is a phenomenal improvement over 5 minutes and 29 seconds. Also, the average GC pause time was only 198 ms; much better than the CMS GC algorithm for this application.
Conclusion
After switching to the G1 GC algorithm, random application slowdowns stopped completely. So, without major architectural changes, without code refactoring, without any JDK/infrastructure upgrades, just by tweaking the GC arguments in the JVM, we were able to bring this significant optimization to the performance of this robot application.