Application Server unresponsive or stuck? Take a deeper look!

Markus Eisele
If you work with Java EE projects, you have probably seen this before. You have a big machine. Some GB of RAM, a couple of cores. And a "normal" application. Nothing to worry about. Profilling is unremarkable and everything works fine on your dev and test environments. This can change, if you do your first load test runs.
What the hell is happening? To figure this out, you have to take a deeper look at all parts of the system. Here is a very brief overview about what and where to look. Don't get me wrong. This is not a handbook or a guide. You have to attend more than a couple of optimizing session to be able to fully track down the problems and solve them. Anyway this is a good overview and could be extended to some kind of checklist.

Your infrastructure
First thing is to make a map of your infrastructure. If you are doing load tests and you experience any kind of too low throughput or even unresponsive applications, you have to make shure you know everything about your setting. Ask yourself questions like:
- How is the network structure? How fast is the network? How is the load?
- Which components are in between your load agents and your server? (switches, router, dispatcher, httpd, etc.)
- What about the database? (Separate maschine? Separate network? How is it's load?)
- What about your appserver? (How many cores? How many RAM? How many HDD? eth cards?)
- How is the cluster setup? (Loadbalancing? Failover?)
There are probably many more information you should have. Try to figure out as much as possible. Even the slightest piece of information is valuable.

Before taking a deeper look at the details, I strongly advice you to request full access to the systems you are going to examine. You should for example have transparent access to any ports (JMX/Debug), the shell, the httpd status monitor, the appserver management utilities and so on. Without beeing able to gather all required information during the runs, you will not be able to find a solution. Working in such a setting is stupid and could even guide you in the wrong direction.
This should be no problem for systems up to the integration stage. If you have to solve problems occuring in already productive systems you should think about different approaches. Beeing on-site working with operations probably is the best approach here. But let's stick to the test environment here.

Reproduce the situation
If you know anything about your infrastructure and have access to any relevant component in it, you should take some effort in reproducing the situation. I have seen different cases, when this was not too easy. Try to play around with the load scenarios. Try different combinations of usecases, different load, try shorter ramp up times, try to overload the system. Without this, you will not be able to solve the issue.

Collect metrics
If you finally reproduced your situation, you can start collecting the relevant metrics. First approach is to do this without any changes to the system. Depending on the infastructure, there are a couple of things to look at:
- Appserver console/monitoring (JVM, DB Pools, Thread Usage, Pending Requests, and more)
- Apache mod_status, mod_proxy (Thread Usage, Dispatching Behaviour, and more)
- Database Monitoring (Connections, Usage, Load, Performance, and more)
- System monitoring (I/O, Network, HDD, CPU, and more)

No.1 suspects are always any kind of external ressources. So, you should look after the db connections first. After that, look at the system ressources. Heap, Memory, CPU and further. Depending on your findings, you are able to eliminate the bottleneck.

Extended metrics
If the basic metrics did not show any problems, you have to dig deeper. This is the point, where you start enabling external monitoring and extended tracing.
- Enable JMX Management Agent and connect via JConsole or your favorite JMX monitor
- Enable verbose GC output
- Enable extended diagnostic in your appserver (e.g. Oracle Weblogic WLDF)
- Use other visualizing/tracing tools available

set screws
If you have all your metrics, you are basically on your own. There is nothing like a cookbook for solving your problems. But you did not really expected this, right? :)
Anyway, there are a couple of things to do. First is to identify the ressource, that is causing the trouble. You do this by watching out for any hint for full or close to full ressource usage. This could be a connection pool or the JVM heap. Simplest case is to experiment with increasing the size.
Some of the extended metrics support you in identifying more special situations (e.g. stuck threads, memory leak).
If none of the above works, you are going to become a specialist in optimizing or performance tuning for your environment. This means, you have to look at the product documentation and other information around to find the things to change.

time-consuming team game
Anyway, this is a team game. A time-consuming one. You have to work closely with operations, the dev team and the guys doing the load test. It is not too unusual that it takes some time. A typical load tests lasts about 60 minutes. Including ramp up and down, analysis, configuration changes and redeployment it could last 2 hours. Given an eight hour work day, this gives your time for four runs. Not too many, if you do not have a clue where to look.

Post a Comment


Post a Comment (0)