记一次Java程序CPU占用率100%的问题排查、定位、解决

错误告知

今天收到阿里云的警告邮件,告知我的实例当前CPU负载过高,当时就奇怪了,怎么实例负载过高了?于是登录的服务器查看当前CPU以及系统负载情况,果真CPU当前负载为百分之百,并且是自己部署上去的Java应用程序,于是开始进行错误信息定位

错误定位

  • 使用jps or lsof -i:{port} 得到目前Java程序的PID
  • 使用命令ps -mp <PID> -o THREAD,tid,time进行打印所有的线程信息状态
  • java pid
  • 筛选出CPU占用率较高的线程tid
  • 使用命令printf "%x\n" tid得到线程的16进制表示
  • printf
  • 使用jstack <PID> > xxx.txt将当前Java程序的线程栈打印出来,然后根据上面得到的线程的16进制表示,定位到具体的线程进行具体分析
  • jstack

代码定位

发现是HBase_MQ_THREAD-*的线程出现了高CPU占用,于是查看代码,发现是因为Distruptor的替换等待策略选择不当的问题

1
2
3
4
private Disruptor<Message> disruptor(int ringBufferSize) {
EventFactory<Message> factory = Message::new;
return new Disruptor<>(factory, ringBufferSize, Schedule.MQ, ProducerType.SINGLE, new YieldingWaitStrategy());
}

查看代码注释,发现YieldingWaitStrategy策略会通过消耗CPU时钟周期来达到优化延迟的目的,会导致CPU占用率达到100%

This strategy will use 100% CPU, but will more readily give up the CPU than a busy spin strategy if other threads require CPU resource.

附上代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
public final class YieldingWaitStrategy implements WaitStrategy {
private static final int SPIN_TRIES = 100;

@Override
public long waitFor(final long sequence, Sequence cursor, final Sequence dependentSequence, final SequenceBarrier barrier) throws AlertException, InterruptedException {
long availableSequence;
int counter = SPIN_TRIES;

while ((availableSequence = dependentSequence.get()) < sequence) {
counter = applyWaitMethod(barrier, counter);
}
return availableSequence;
}

@Override
public void signalAllWhenBlocking(){}

private int applyWaitMethod(final SequenceBarrier barrier, int counter) throws AlertException {
barrier.checkAlert();

if (0 == counter) {
Thread.yield();
} else {
--counter;
}
return counter;
}
}

因此进行了策略的替换,将YieldingWaitStrategy改为了默认的策略BlockingWaitStrategy