1. Filesystem counters
Local
file system |
FILE_BYTES_READ |
FILE_BYTES_WRITTEN |
|
HDFS
file system |
HDFS_BYTES_READ |
HDFS_BYTES_WRITTEN |
FILE_BYTES_WRITTEN consists of two parts. The first part comes from mappers. All the mappers will spill intermediate output to disk. All the bytes that mappers write to disk will be included in FILE_BYTES_WRITTEN. The second part comes from reducers. In the shuffle phase, all the reducers will fetch intermediate data from mappers and merge and spill to reducer-side disks. All the bytes that reducers write to disk will also be included in FILE_BYTES_WRITTEN.
HDFS_BYTES_READ denotes the bytes read by mappers from HDFS when the job starts. This data includes not only the content of source file but also metadata about splits.
HDFS_BYTES_WRITTEN denotes the bytes written to HDFS. It’s the number of bytes of the final output.
Note that since HDFS and local file systems are different file systems so the data from the two file systems will never overlap.
2. Comparison of compression in three places
1) No compression
Counter |
Map |
Reduce |
Total |
FILE_BYTES_READ |
0 |
4,579,057,545 |
4,579,057,545 |
FILE_BYTES_WRITTEN |
6,645,450,502 |
6,110,995,198 |
12,756,445,700 |
HDFS_BYTES_READ |
6,616,478,456 |
0 |
6,616,478,456 |
HDFS_BYTES_WRITTEN |
0 |
4,584,208,671 |
4,584,208,671 |
2) Only compressing input
Counter |
Map |
Reduce |
Total |
FILE_BYTES_READ |
11,989,663,380 |
6,645,432,358 |
18,635,095,738 |
FILE_BYTES_WRITTEN |
18,633,824,537 |
6,645,432,358 |
25,279,256,895 |
HDFS_BYTES_READ |
1,049,256,588 |
0 |
1,049,256,588 |
HDFS_BYTES_WRITTEN |
0 |
6,653,296,922 |
6,653,296,922 |
We can see that HDFS_BYTES_READ is significantly reduced. This indicates that the total bytes read by mappers from HDFS is significantly reduced.
3) Only compressing intermediate map ouput
Counter |
Map |
Reduce |
Total |
FILE_BYTES_READ |
0 |
1,020,775,438 |
1,020,775,438 |
FILE_BYTES_WRITTEN |
990,084,052 |
1,020,775,438 |
2,010,859,490 |
HDFS_BYTES_READ |
6,616,478,456 |
0 |
6,616,478,456 |
HDFS_BYTES_WRITTEN |
0 |
6,653,296,922 |
6,653,296,922 |
We can see that FILE_BYTES_READ and FILE_BYTES_WRITTEN is significantly reduced. This means that data transfer between node local file systems is significantly reduced.
4) Only compressing output
Counter |
Map |
Reduce |
Total |
FILE_BYTES_READ |
0 |
6,645,432,490 |
6,645,432,490 |
FILE_BYTES_WRITTEN |
6,645,450,502 |
6,645,432,490 |
13,290,882,992 |
HDFS_BYTES_READ |
6,616,478,456 |
0 |
6,616,478,456 |
HDFS_BYTES_WRITTEN |
0 |
997,479,368 |
997,479,368 |
We can see that HDFS_BYTES_WRITTEN is significantly reduced. This suggests that the final output to HDFS is significantly reduced.
3. Comparison of different compression formats: gzip, lzo
Compression |
File |
Size(GB) |
Compression Time(s) |
Decompression Time(s) |
None |
logs |
8.0 |
- |
- |
Gzip |
logs.gz |
1.3 |
241 |
72 |
LZO |
logs.lzo |
2.0 |
55 |
35 |
Snappy |
- |
4.2 |
40 |
27 |
Also we can see that Snappy file is larger than the corresponding lzo file, but is still half of the original file. In addition, Snappy compress and decompress even more faster than LZO. In sum, Snappy is faster in compress and decompress time but less efficient in terms of compression ratio.