系统频繁重启问题排查

问题描述

某日下午,测试反馈测试环境系统不能使用,经查看服务监控发现服务不断重启数十次。

排查过程

使用 Arthas 对系统进行分析。首先,查看所有运行中的线程。

thread --state RUNNABLE

发现有几个定时任务执行CPU使用率过高,合计占用70%+。

打印显示CPU使用率过高的线程的运行栈。结合代码分析,异常原因。初步判断,可能是执行批量更新过程出现问题。

但是结合业务数据量,正常情况下数据量应该不多。因为系统反复重启无法直接通过系统查看数据量,直接连接数据库查询,发现数据量异常。正常情况下批量执行数据小于100,发现当前实际执行10w数据更新。到这里可以确定问题点就在这里。

分析最近开发的功能需求,最终定位到问题:拉取数据时,过滤逻辑和原来不一样,没有过滤原有数据,每次定时任务执行时,都会复制现有的记录,导致数据不断膨胀。

thread 130

线程运行栈内容具体如下:

    at java.io.FileOutputStream.writeBytes(Native Method)
    at java.io.FileOutputStream.write(FileOutputStream.java:326)
    at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
    at java.io.PrintStream.write(PrintStream.java:480)
    at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
    at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
    at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:104)
    at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:185)
    at java.io.PrintStream.write(PrintStream.java:527)
    at java.io.PrintStream.print(PrintStream.java:669)
    at java.io.PrintStream.println(PrintStream.java:806)
    at com.baomidou.mybatisplus.extension.p6spy.StdoutLogger.logText(StdoutLogger.java:31)
    at com.p6spy.engine.spy.appender.FormattedLogger.logSQL(FormattedLogger.java:35)
    at com.p6spy.engine.common.P6LogQuery.doLog(P6LogQuery.java:121)
    at com.p6spy.engine.common.P6LogQuery.doLogElapsed(P6LogQuery.java:91)
    at com.p6spy.engine.common.P6LogQuery.logElapsed(P6LogQuery.java:203)
    at com.p6spy.engine.logging.LoggingEventListener.logElapsed(LoggingEventListener.java:107)
    at com.p6spy.engine.logging.LoggingEventListener.onAfterAnyAddBatch(LoggingEventListener.java:64)
    at com.p6spy.engine.event.SimpleJdbcEventListener.onAfterAddBatch(SimpleJdbcEventListener.java:151)
    at com.p6spy.engine.event.CompoundJdbcEventListener.onAfterAddBatch(CompoundJdbcEventListener.java:103)
    at com.p6spy.engine.wrapper.PreparedStatementWrapper.addBatch(PreparedStatementWrapper.java:382)
    at sun.reflect.GeneratedMethodAccessor704.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.ibatis.logging.jdbc.PreparedStatementLogger.invoke(PreparedStatementLogger.java:59)
    at com.sun.proxy.$Proxy504.addBatch(Unknown Source)
    at org.apache.ibatis.executor.statement.PreparedStatementHandler.batch(PreparedStatementHandler.java:58)
    at org.apache.ibatis.executor.statement.RoutingStatementHandler.batch(RoutingStatementHandler.java:69)
    at sun.reflect.GeneratedMethodAccessor703.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.ibatis.plugin.Plugin.invoke(Plugin.java:63)
    at com.sun.proxy.$Proxy490.batch(Unknown Source)
    at org.apache.ibatis.executor.BatchExecutor.doUpdate(BatchExecutor.java:77)
    at org.apache.ibatis.executor.BaseExecutor.update(BaseExecutor.java:117)
    at sun.reflect.GeneratedMethodAccessor684.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.ibatis.plugin.Invocation.proceed(Invocation.java:49)
    at com.baomidou.mybatisplus.extension.plugins.MybatisPlusInterceptor.intercept(MybatisPlusInterceptor.java:106)
    at org.apache.ibatis.plugin.Plugin.invoke(Plugin.java:61)
    at com.sun.proxy.$Proxy489.update(Unknown Source)
    at sun.reflect.GeneratedMethodAccessor684.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.ibatis.plugin.Plugin.invoke(Plugin.java:63)
    at com.sun.proxy.$Proxy489.update(Unknown Source)
    at org.apache.ibatis.session.defaults.DefaultSqlSession.update(DefaultSqlSession.java:197)
    at com.baomidou.mybatisplus.extension.service.impl.ServiceImpl.lambda$updateBatchById$3(ServiceImpl.java:194)
    at com.baomidou.mybatisplus.extension.service.impl.ServiceImpl$$Lambda$2191/703036573.accept(Unknown Source)
    at com.baomidou.mybatisplus.extension.toolkit.SqlHelper.lambda$executeBatch$0(SqlHelper.java:211)
    at com.baomidou.mybatisplus.extension.toolkit.SqlHelper$$Lambda$2188/793575531.accept(Unknown Source)
    at com.baomidou.mybatisplus.extension.toolkit.SqlHelper.executeBatch(SqlHelper.java:175)
    at com.baomidou.mybatisplus.extension.toolkit.SqlHelper.executeBatch(SqlHelper.java:207)
    at com.baomidou.mybatisplus.extension.service.impl.ServiceImpl.executeBatch(ServiceImpl.java:239)
    at com.baomidou.mybatisplus.extension.service.impl.ServiceImpl.updateBatchById(ServiceImpl.java:191)
    at com.baomidou.mybatisplus.extension.service.IService.updateBatchById(IService.java:177)
    at com.baomidou.mybatisplus.extension.service.IService$$FastClassBySpringCGLIB$$f8525d18.invoke(<generated>)
    at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218)
    at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:771)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
    at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:749)
    at org.springframework.transaction.interceptor.TransactionInterceptor$$Lambda$2113/1586949210.proceedWithInvocation(Unknown Source)
    at org.springframework.transaction.interceptor.TransactionAspectSupport.invokeWithinTransaction(TransactionAspectSupport.java:367)
    at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:118)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
    at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:749)
    at com.baomidou.dynamic.datasource.aop.DynamicDataSourceAnnotationInterceptor.invoke(DynamicDataSourceAnnotationInterceptor.java:51)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
    at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:749)
    at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:691)
// 省略其他日志

调整代码后,验证结论是否正常。如图,左侧为系统反复重启CPU与内存变化;右侧为问题修复后,系统CPU与内存变化。

服务修复,前后对比

复盘过程

排查思路:

  1. 打印正在运行的线程情况,并打印CPU使用率高的线程栈。
  2. 线程运行堆栈结合代码,分析异常可能存在问题的代码。
  3. 异常代码结合业务需求,分析可能出现问题原因。
  4. 修复问题,并验证效果。

优化流程:

可以直接,如下(命令),一键展示****当前最忙的前 N 个线程并打印堆栈

thread -n 3