当前位置:主页 > 资料 >

Spark一些问题集锦
栏目分类:资料   发布日期:2017-05-28   浏览次数:

导读:本文为去找网小编(www.7zhao.net)为您推荐的Spark一些问题集锦,希望对您有所帮助,谢谢! 最近常跑Spark程序,主要是分布式机器学习和分布式深度学习这块,因为模型经常很大,比如V

本文为去找网小编(www.7zhao.net)为您推荐的Spark一些问题集锦,希望对您有所帮助,谢谢!

内容来自www.7zhao.net



最近常跑Spark程序,主要是分布式机器学习和分布式深度学习这块,因为模型经常很大,比如VGG等,集群空余节点又不是很多,跑起来有时候会吃力,也遇到很多问题,积累一下以备后查。

欢迎访问www.7zhao.net

错误集锦

ClosedChannelException

1 ERROR YarnClientSchedulerBackend:70 - Yarn application has already exited with state FINISHED!
2 ERROR SparkContext:91 - Error initializing SparkContext.
java.lang.IllegalStateException: Spark context stopped while waiting for backend
3 ERROR TransportClient:245 - Failed to send RPC 7202466410763583466 to /xx.xx.xx.xx:54864: java.nio.channels.ClosedChannelException
4 ERROR YarnSchedulerBackend$YarnSchedulerEndpoint:91 - Sending RequestExecutors(0,0,Map()) to AM was unsuccessful
 欢迎访问www.7zhao.net 

上面这几个错误通常一起爆出。

去找(www.7zhao.net欢迎您

【原因分析】 copyright www.7zhao.net

可能是分配给node的内存太小,Spark默认启动两个executor,使用每个executor的内存为1G,而数据太大,导致yarn直接Kill掉了executor,IO也一并关闭,所以出现了 ClosedChannelException 异常。 去找(www.7zhao.net欢迎您

分析[错误1]也有可能是由于Java 8的excessive memory allocation strategy

内容来自www.7zhao.net

【解决方案】

内容来自www.7zhao.net

根据 在 yarn-site.xml 中添加如下配置:

www.7zhao.net

<property>
    <name>yarn.nodemanager.pmem-check-enabled</name>
    <value>false</value>
</property>

<property>
    <name>yarn.nodemanager.vmem-check-enabled</name>
    <value>false</value>
</property>
 去找(www.7zhao.net欢迎您 

或者在执行命令时附带参数: --driver-memory 5g --executor-memory 5g ,将Job可用内存显式地增大。 www.7zhao.net

或者在 spark/conf/spark-defaults.conf 添加如下Poperty:

copyright www.7zhao.net

spark.driver.memory              5g
spark.executor.memory            5g
 欢迎访问www.7zhao.net 

甚至可以继续添加如下Property: 内容来自www.7zhao.net

spark.yarn.executor.memoryOverhead          4096
spark.yarn.driver.memoryOverhead            8192
spark.akka.frameSize                        700
 

copyright www.7zhao.net

Lost Executors et. al.

5. ERROR YarnScheduler:70 - Lost executor 3 on simple23: Container marked as failed: container_1490797147995_0000004 on host: simple23. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Killed by external signal

[Stage 16:===========================================>              (6 + 2) / 8]
6. ERROR TaskSetManager:70 - Task stage 17.2 failed 4 times; aborting job
7. ERROR DistriOptimizer$:655 - Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task age 17.2 failed 4 times, most recent failure: Lost task 0.3 in stage 17.2 (TID 90, simple21, executor 4): java.util.concurrent.EnException: 

[Stage 23:>                                                         (0 + 3) / 3]
8. ERROR YarnScheduler:70 - Lost executor 4 on simple21: Container marked as failed: container_1490797147995_0004_01_000005 on host: simple21. Exit status: 143. Diagn Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Killed by external signal

[Stage 23:>                                                         (0 + 3) / 3]
9. ERROR TransportResponseHandl- Still have 1 requests outstanding when connection from /xx.xx.xx.22:51442 is closed
 

内容来自www.7zhao.net

【原因分析】

欢迎访问www.7zhao.net

由报错信息可以看出,yarn丢失了executor,极有可能还是因为executor被关闭了,所以还是要检查一下自己的driver-memory和executor-memory是不是够大。

本文来自去找www.7zhao.net

【解决方案】 本文来自去找www.7zhao.net

如上一个

本文来自去找www.7zhao.net

References

copyright www.7zhao.net


本文原文地址:http://whatbeg.com/2017/05/28/sparkerror.html

以上为Spark一些问题集锦文章的全部内容,若您也有好的文章,欢迎与我们分享!

欢迎访问www.7zhao.net

Copyright ©2008-2017去找网版权所有   皖ICP备12002049号-2 皖公网安备 34088102000435号   关于我们|联系我们| 免责声明|友情链接|网站地图|手机版