YARN-导致cpu飙升的异常日志详细解析以及解决方案
问题描述:
NodeManager1 cpu负载飙高,进程还在但是不再向ResourceManager发送心跳,不断重复下文2的动作。心跳停止一段时间后会重连上RM但是cpu仍然很高,再过一段时间心跳再停,一直循环。
NodeManager的日志解析
1.localizing:container开始从hdfs下载resource,hdfs文件的状态从INIT变成DOWNLOADING。
2018-08-25 16:15:38,592 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:Resource hdfs://mycluster/user/hdfs/.staging/application_1444990016246_29569/libjars/avro-mapred-hadoop2.jar transitioned from INIT to DOWNLOADING
2.container在localizing过程中被stop或者kill,导致hdfs文件状态保持为DOWNLOADING。
non-zero refcount表示当前没有其他container在使用这个资源,说明这个资源将无法删除。
2018-08-25 19:15:38,592 ERROR org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl: Attempt to remove resource: { { hdfs://mycluster/user/hdfs/.staging/application_1444990016246_29569/libjars/avro-mapred-hadoop2.jar, 1448139497492, FILE, null },pending,[],920074451410033,DOWNLOADING} with non-zero refcount
3.任务已经被kill所以报了CancellationException
2018-08-25 19:25:34,592 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: {...}failed;
java.util.concurrent.CancellationException
4.一段时间后状态从DOWNLOADING转为FAILED,hdfs资源可以删除
2018-08-25 20:15:38,592 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:Resource hdfs://mycluster/user/hdfs/.staging/application_1444990016246_29569/libjars/avro-mapred-hadoop2.jar(->/data/nm-local-dir/usercache/hadoop/filecache/5432524/avro-mapred-hadoop2.jar) transitioned from DOWNLOADING to FAILED
5.删除本地缓存的文件(可能已损坏)
2018-08-25 19:15:38,592 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl:Removed /data/nm-local-dir/usercache/hadoop/filecache/5432524/avro-mapred-hadoop2.jar from localized cache
6.请求的资源不在缓存中,将重新请求
2018-08-25 19:15:38,592 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl:Container container_152345432_4324_3_4324234 sent RELEASE event on a resource request {hdfs://mycluster/user/hdfs/.staging/application_1444990016246_29569/libjars/avro-mapred-hadoop2.jar,,,} not present in cache
原因总结
container被stop,原因可能是与外部组件rpc失败,或者任务被人为kill等等异常。导致hdfs资源异常无法删除而container又会一直尝试去删除
解决办法
1.Low的办法:手动删除hdfs中无法删除的文件(难实现,不知道删那些文件且很多时操作麻烦)
2.高端的办法:异常的位置
LocalResourcesTrackerImpl(line339)
public boolean remove(LocalizedResource rem, DeletionService delService) {
// current synchronization guaranteed by crude RLS event for cleanup
LocalizedResource rsrc = localrsrc.get(rem.getRequest());
if (null == rsrc) {
LOG.error("Attempt to remove absent resource: " + rem.getRequest()
+ " from " + getUser());
return true;
}
if (rsrc.getRefCount() > 0
|| ResourceState.DOWNLOADING.equals(rsrc.getState()) || rsrc != rem) {
// internal error
LOG.error("Attempt to remove resource: " + rsrc
+ " with non-zero refcount");
return false;
} else { // ResourceState is LOCALIZED or INIT
localrsrc.remove(rem.getRequest());
if (ResourceState.LOCALIZED.equals(rsrc.getState())) {
delService.delete(getUser(), getPathToDelete(rsrc.getLocalPath()));
}
decrementFileCountForLocalCacheDirectory(rem.getRequest(), rsrc);
LOG.info("Removed " + rsrc.getLocalPath() + " from localized cache");
return true;
}
}
ResourceState.DOWNLOADING.equals(rsrc.getState())
文件状态为DOWNLOADING则报错,可在源码中删除这个条件。
参考添加补丁:
https://issues.apache.org/jira/browse/YARN-2902
https://issues.apache.org/jira/secure/attachment/12685562/YARN-2902.patch
3.无敌的办法:重启大法。。。重启nodemanager,spark等任务会自动failover,不会影响线上的业务
总结:这个问题和资源分配或者container的资源占用没有关系,因为是nodemanager的cpu飙高,而不是container。产生这个问题的原因是在刚提交任务的时候,container开始初始化并且开始从hdfs拉依赖资源到本地,此时任务挂了或者container挂了(人为的或者超时等原因),并且此时没有其他container在使用这个资源,则这个资源就会保持在DownLoading状态,则会报上面第二个错误。正常情况下不用理会这个报错,一段时间后会把DownLoading改为Failed,然后直接将资源删除。但是我这里观察到的情况是DownLoading状态的文件太多,状态转换速度非常慢,甚至一直都无法转换成功,导致无法删除,日志里出现大量类似2的报错且把cpu拉得特别高偶尔出现nodemanager假死的情况,最终的解决办法是重启。
《每日五分钟搞定大数据》原创系列每周不定期更新