whitelilis / whitelilis.github.io

5 stars 0 forks source link

集群提交作业失败,数据写入缓慢 #9

Open whitelilis opened 6 years ago

whitelilis commented 6 years ago

有同学反应集群提交作业失败,日志为 image

或者

18/01/19 14:52:19 INFO hdfs.DFSClient: Exception in createBlockOutputStream
java.net.ConnectException: Connection timed out
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
        at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
        at org.apache.hadoop.hdfs.DFSOutputStream.createSocketForPipeline(DFSOutputStream.java:1515)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1318)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1271)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:525)

找一个 datanode 看一下: image

真是写不动 image

trace 的结果 image

肯定是在往集群上写东西 那就看一下 hdfs 的 audit log,都在写什么

tail -F hdfs-audit.log | grep --color create

发现大量的都是一个 hive 用户在写东西,把对应的作业杀掉, image

再去 datanode 上看看 tt 已经找不到超时的了 monitor 命令的结果也很好看了 image