Hadoop data volume failures and solution – cloudera

I stumbled upon the following error in Cloudera, which shows Hadoop data volume failures, due to which the datanode is down and the HDFS service in cloudera is down too.

I had no clue where to look for errors in this case as comprehending and managing the Cloudera 8 node cluster (1 master + 7 datanodes) is very complex with so many components involved, like Hadoop roles,  HDFS, YARN, Hive, Spark,  and intricacies of how each of the components interacts.

Since metadata information is present Namenode and a heartbeat handshake happens between namenode and datanode, namenode or data node logs should have this information, showing reasons for the data volume failures.

So i checked the namenode and datanode logs in cloudera setup, in the namenode machine and followed below steps to fix the issue.

Problem: (HDFS service down as datanode is down due data volume failures)

soln-dead-datanodes1

i checked to see if which datanode volume maybe a probelm with below hdfs report. It shows dfs used 100%

hdfs dfsadmin –report

HDFS Disk capacity explained:

https://community.cloudera.com/t5/Community-Articles/Details-of-the-output-hdfs-dfsadmin-report/ta-p/245505

Master Node: bdw21-13

[root@bdw21-13 logs]# hdfs dfsadmin -report

Configured Capacity: 107099623178240 (97.41 TB)

Present Capacity: 101596650946560 (92.40 TB)

DFS Remaining: 98975373618999 (90.02 TB)

DFS Used: 2621277327561 (2.38 TB)

DFS Used%: 2.58%

Under replicated blocks: 0

Blocks with corrupt replicas: 0

Missing blocks: 0

Missing blocks (with replication factor 1): 0

————————————————-

Live datanodes (7):

Name: 1.1.21.14:50010 (bdw21-14)

Hostname: bdw21-14

Rack: /default

Decommission Status : Normal

Configured Capacity: 15578127007744 (14.17 TB)

DFS Used: 371912425472 (346.37 GB)

Non DFS Used: 0 (0 B)

DFS Remaining: 14405782257664 (13.10 TB)

DFS Used%: 2.39%

DFS Remaining%: 92.47%

Configured Cache Capacity: 4294967296 (4 GB)

Cache Used: 0 (0 B)

Cache Remaining: 4294967296 (4 GB)

Cache Used%: 0.00%

Cache Remaining%: 100.00%

Xceivers: 2

Last contact: Wed Aug 21 01:21:31 CDT 2019

Name: 1.1.21.15:50010 (bdw21-15)

Hostname: bdw21-15

Rack: /default

Decommission Status : Normal

Configured Capacity: 15578127007744 (14.17 TB)

DFS Used: 344836427776 (321.15 GB)

Non DFS Used: 0 (0 B)

DFS Remaining: 14432858255360 (13.13 TB)

DFS Used%: 2.21%

DFS Remaining%: 92.65%

Configured Cache Capacity: 4294967296 (4 GB)

Cache Used: 0 (0 B)

Cache Remaining: 4294967296 (4 GB)

Cache Used%: 0.00%

Cache Remaining%: 100.00%

Xceivers: 2

Last contact: Wed Aug 21 01:21:32 CDT 2019

Name: 1.1.21.16:50010 (bdw21-16)

Hostname: bdw21-16

Rack: /default

Decommission Status : Normal

Configured Capacity: 15578127007744 (14.17 TB)

DFS Used: 386013941827 (359.50 GB)

Non DFS Used: 0 (0 B)

DFS Remaining: 14391680741309 (13.09 TB)

DFS Used%: 2.48%

DFS Remaining%: 92.38%

Configured Cache Capacity: 4294967296 (4 GB)

Cache Used: 0 (0 B)

Cache Remaining: 4294967296 (4 GB)

Cache Used%: 0.00%

Cache Remaining%: 100.00%

Xceivers: 2

Last contact: Wed Aug 21 01:21:31 CDT 2019

Name: 1.1.21.17:50010 (bdw21-17)

Hostname: bdw21-17

Rack: /default

Decommission Status : Normal

Configured Capacity: 14604494069760 (13.28 TB)

DFS Used: 433112637440 (403.37 GB)

Non DFS Used: 0 (0 B)

DFS Remaining: 13420976128000 (12.21 TB)

DFS Used%: 2.97%

DFS Remaining%: 91.90%

Configured Cache Capacity: 4294967296 (4 GB)

Cache Used: 0 (0 B)

Cache Remaining: 4294967296 (4 GB)

Cache Used%: 0.00%

Cache Remaining%: 100.00%

Xceivers: 2

Last contact: Wed Aug 21 01:21:33 CDT 2019

Name: 1.1.21.18:50010 (bdw21-18)

Hostname: bdw21-18

Rack: /default

Decommission Status : Normal

Configured Capacity: 14604494069760 (13.28 TB)

DFS Used: 207213023299 (192.98 GB)

Non DFS Used: 0 (0 B)

DFS Remaining: 13646875742141 (12.41 TB)

DFS Used%: 1.42%

DFS Remaining%: 93.44%

Configured Cache Capacity: 4294967296 (4 GB)

Cache Used: 0 (0 B)

Cache Remaining: 4294967296 (4 GB)

Cache Used%: 0.00%

Cache Remaining%: 100.00%

Xceivers: 3

Last contact: Wed Aug 21 01:21:31 CDT 2019

Name: 1.1.21.20:50010 (bdw21-20)

Hostname: bdw21-20

Rack: /default

Decommission Status : Normal

Configured Capacity: 15578127007744 (14.17 TB)

DFS Used: 421409095747 (392.47 GB)

Non DFS Used: 0 (0 B)

DFS Remaining: 14356285587389 (13.06 TB)

DFS Used%: 2.71%

DFS Remaining%: 92.16%

Configured Cache Capacity: 4294967296 (4 GB)

Cache Used: 0 (0 B)

Cache Remaining: 4294967296 (4 GB)

Cache Used%: 0.00%

Cache Remaining%: 100.00%

Xceivers: 2

Last contact: Wed Aug 21 01:21:31 CDT 2019

Name: 1.1.21.21:50010 (bdw21-21)

Hostname: bdw21-21

Rack: /default

Decommission Status : Normal

Configured Capacity: 15578127007744 (14.17 TB)

DFS Used: 456779776000 (425.41 GB)

Non DFS Used: 0 (0 B)

DFS Remaining: 14320914907136 (13.02 TB)

DFS Used%: 2.93%

DFS Remaining%: 91.93%

Configured Cache Capacity: 4294967296 (4 GB)

Cache Used: 0 (0 B)

Cache Remaining: 4294967296 (4 GB)

Cache Used%: 0.00%

Cache Remaining%: 100.00%

Xceivers: 2

Last contact: Wed Aug 21 01:21:31 CDT 2019

[root@bdw21-13 logs]#

[root@bdw21-13 logs]# hdfs dfsadmin -report > /home/tpc/hdfs-space-21Aug2019.txt

[root@bdw21-13 logs]# grep -i “DFS Used” /home/tpc/hdfs-space-21Aug2019.txt

DFS Used: 2621277425664 (2.38 TB)

DFS Used%: 2.58%

DFS Used: 371912437760 (346.37 GB)

Non DFS Used: 0 (0 B)

DFS Used%: 2.39%

DFS Used: 344836440064 (321.15 GB)

Non DFS Used: 0 (0 B)

DFS Used%: 2.21%

DFS Used: 386013954048 (359.50 GB)

Non DFS Used: 0 (0 B)

DFS Used%: 2.48%

DFS Used: 433112653824 (403.37 GB)

Non DFS Used: 0 (0 B)

DFS Used%: 2.97%

DFS Used: 207213031424 (192.98 GB)

Non DFS Used: 0 (0 B)

DFS Used%: 1.42%

DFS Used: 421409116160 (392.47 GB)

Non DFS Used: 0 (0 B)

DFS Used%: 2.71%

DFS Used: 456779792384 (425.41 GB)

Non DFS Used: 0 (0 B)

DFS Used%: 2.93%

DFS Used: 0 (0 B)

Non DFS Used: 0 (0 B)

DFS Used%: 100.00%

[root@bdw21-13 logs]# grep -i “DFS Used” /home/tpc/hdfs-space-21Aug2019.txt

[root@bdw21-13 logs]# grep -i datanodes /home/tpc/hdfs-space-21Aug2019.txt

Live datanodes (7):

[root@bdw21-13 logs]#

References:

https://community.cloudera.com/t5/Support-Questions/quot-No-host-heartbeat-CDH-versions-cannot-be-verified-quot/td-p/5283

https://community.cloudera.com/t5/Support-Questions/Volume-failure-reported-while-disks-seem-fine/m-p/22706

Solution:

NN-Log-Location

Error seen:

datanode-volume-failure

So I removed the /data/2/df/dn volume from HDFS configuration in cdh and started the datanode on node-17 which was giving the volume failure error.

Caveat: Before trying above steps I tried to set dfs.datanode.failed.volumes.tolerated to 0 so that volume failures won’t affect the datanode live status. (setting 7 would have been better) and hence we had under replicated blocks in the cluster as shown below (Fig1) . Hence rebalanced the cluster to distribute the blocks among 7 datanodes. But on resolving the volume failure issues by removing the failed volume from HDFS configuration, now has corrupt blocks (Fig2) (initially it was showing as missing)

Fig (1)

fig1

Fig(2) :

fig2

Fixed above with below steps:

  1. Corrupt blocks found with below command:

[root@bdw21-13 hadoop-conf]# hdfs fsck / | egrep -v ‘^\.+$’ | grep -v replica | grep -v Replica

Connecting to namenode via http://bdw21-13:50070/fsck?ugi=root&path=%2F

FSCK started by root (auth:SIMPLE) from /1.1.21.13 for path / at Thu Aug 22 15:50:45 CDT 2019

/home/mukund/Big-Data-Benchmark-for-Big-Bench/data/store_sales/store_sales_206.dat: CORRUPT blockpool BP-1704754621-1.1.21.13-1558413910334 block blk_1079657961

/home/mukund/Big-Data-Benchmark-for-Big-Bench/data/store_sales/store_sales_206.dat: MISSING 1 blocks of total size 199777 B……………….

/home/mukund/Big-Data-Benchmark-for-Big-Bench/data/web_returns/web_returns_156.dat: CORRUPT blockpool BP-1704754621-1.1.21.13-1558413910334 block blk_1079658663

/home/mukund/Big-Data-Benchmark-for-Big-Bench/data/web_returns/web_returns_156.dat: MISSING 1 blocks of total size 14152 B………………………………………………………………………………..

/home/tpc/Big-Data-Benchmark-for-Big-Bench/data/reason/reason_023.dat: CORRUPT blockpool BP-1704754621-1.1.21.13-1558413910334 block blk_1077955383

/home/tpc/Big-Data-Benchmark-for-Big-Bench/data/reason/reason_023.dat: MISSING 1 blocks of total size 80 B……………………………………

/home/tpc/Big-Data-Benchmark-for-Big-Bench/data_refresh/store_sales/store_sales_172.dat: CORRUPT blockpool BP-1704754621-1.1.21.13-1558413910334 block blk_1077956851

/home/tpc/Big-Data-Benchmark-for-Big-Bench/data_refresh/store_sales/store_sales_172.dat: MISSING 1 blocks of total size 382385 B…………….

/user/hive/warehouse/bigbench10tb.db/store_returns/000172_0: CORRUPT blockpool BP-1704754621-1.1.21.13-1558413910334 block blk_1075509573

/user/hive/warehouse/bigbench10tb.db/store_returns/000172_0: MISSING 1 blocks of total size 45693754 B………..

…………………………………………………………………………………Status: CORRUPT

Total size:    863019635593 B

Total dirs:    3930

Total files:   44593

Total symlinks:                0

Total blocks (validated):      38733 (avg. block size 22281249 B)

********************************

UNDER MIN REPL’D BLOCKS:      5 (0.012908889 %)

CORRUPT FILES:        5

MISSING BLOCKS:       5

MISSING SIZE:         46290148 B

CORRUPT BLOCKS:       5

********************************

Corrupt blocks:                5

Number of data-nodes:          7

Number of racks:               1

FSCK ended at Thu Aug 22 15:50:45 CDT 2019 in 479 milliseconds

The filesystem under path ‘/’ is CORRUPT

[root@bdw21-13 hadoop-conf]#

  1. So the corrupt files/blocks are below (5 matches):

/home/mukund/Big-Data-Benchmark-for-Big-Bench/data/store_sales/store_sales_206.dat

/home/mukund/Big-Data-Benchmark-for-Big-Bench/data/web_returns/web_returns_156.dat

/home/tpc/Big-Data-Benchmark-for-Big-Bench/data/reason/reason_023.dat

/home/tpc/Big-Data-Benchmark-for-Big-Bench/data_refresh/store_sales/store_sales_172.dat

/user/hive/warehouse/bigbench10tb.db/store_returns/000172_0

  1. Removed the corrupt files (removed once still exist in trash)
  2. Now we have some corrupt files shown in Trash (above removed files itself) and so removing them permanently now.

/user/root/.Trash/Current/home/mukund/Big-Data-Benchmark-for-Big-Bench/data/store_sales/store_sales_206.dat: CORRUPT blockpool BP-1704754621-1.1.21.13-1558413910334 block blk_1079657961

/user/root/.Trash/Current/home/mukund/Big-Data-Benchmark-for-Big-Bench/data/store_sales/store_sales_206.dat: MISSING 1 blocks of total size 199777 B..

/user/root/.Trash/Current/home/mukund/Big-Data-Benchmark-for-Big-Bench/data/web_returns/web_returns_156.dat: CORRUPT blockpool BP-1704754621-1.1.21.13-1558413910334 block blk_1079658663

/user/root/.Trash/Current/home/mukund/Big-Data-Benchmark-for-Big-Bench/data/web_returns/web_returns_156.dat: MISSING 1 blocks of total size 14152 B..

/user/root/.Trash/Current/home/tpc/Big-Data-Benchmark-for-Big-Bench/data/reason/reason_023.dat: CORRUPT blockpool BP-1704754621-1.1.21.13-1558413910334 block blk_1077955383

/user/root/.Trash/Current/home/tpc/Big-Data-Benchmark-for-Big-Bench/data/reason/reason_023.dat: MISSING 1 blocks of total size 80 B..

/user/root/.Trash/Current/home/tpc/Big-Data-Benchmark-for-Big-Bench/data_refresh/store_sales/store_sales_172.dat: CORRUPT blockpool BP-1704754621-1.1.21.13-1558413910334 block blk_1077956851

/user/root/.Trash/Current/home/tpc/Big-Data-Benchmark-for-Big-Bench/data_refresh/store_sales/store_sales_172.dat: MISSING 1 blocks of total size 382385 B..

/user/root/.Trash/Current/user/hive/warehouse/bigbench10tb.db/store_returns/000172_0: CORRUPT blockpool BP-1704754621-1.1.21.13-1558413910334 block blk_1075509573

/user/root/.Trash/Current/user/hive/warehouse/bigbench10tb.db/store_returns/000172_0: MISSING 1 blocks of total size 45693754 B…………………………………………………………………………..

……………………………………………………………………………………….

……………………………………………………………………………………….

…………………………………………………………………………………Status: CORRUPT

Total size:    863019635593 B

Total dirs:    3900

Total files:   44593

Total symlinks:                0

Total blocks (validated):      38733 (avg. block size 22281249 B)

********************************

UNDER MIN REPL’D BLOCKS:      5 (0.012908889 %)

dfs.namenode.replication.min: 1

CORRUPT FILES:        5

MISSING BLOCKS:       5

MISSING SIZE:         46290148 B

CORRUPT BLOCKS:       5

********************************

Minimally replicated blocks:   38728 (99.98709 %)

Over-replicated blocks:        0 (0.0 %)

Under-replicated blocks:       0 (0.0 %)

Mis-replicated blocks:         0 (0.0 %)

Default replication factor:    3

Average block replication:     2.9996128

Corrupt blocks:                5

Missing replicas:              0 (0.0 %)

Number of data-nodes:          7

Number of racks:               1

FSCK ended at Thu Aug 22 16:11:30 CDT 2019 in 1118 milliseconds

The filesystem under path ‘/’ is CORRUPT

[root@bdw21-13 hadoop-conf]#

So no more corrupt files now and HDFS service is up with 7 datanodes.

Please comment if above helps you or for any clarifications. 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s