1. Setting up Amazon EMR
Tutorial on how to set up Amazon Elastic MapReduce
2. Getting SSH on windows
3. Using compression in MapReduce
As described in the introduction section, if the input files are compressed, they will be decompressed automatically as they are read by MapReduce, using the filename extension to determine which codec to use. This is input compression.
Here we list some code for setting up output compression in Hadoop for some common compression formats.
Here we list some code for setting up output compression in Hadoop for some common compression formats.
Gzip
For final output, we can use the static convenience methos on FileOutputFormat to set the properties.
FileOutputFormat.setCompressOutput(job, true);
FileOutputFormat.setOutputCompressorClass(job, GzipCodec,class);
For map output
Configuration conf = new Configuration();
conf.setBoolean("mapred.compress.map.output",true);
conf.setClass("mapred.map.output.compression.codec", GzipCodec.class, CompressionCodec.class);
Job job=new Job(conf);
For final output, we can use the static convenience methos on FileOutputFormat to set the properties.
FileOutputFormat.setCompressOutput(job, true);
FileOutputFormat.setOutputCompressorClass(job, GzipCodec,class);
For map output
Configuration conf = new Configuration();
conf.setBoolean("mapred.compress.map.output",true);
conf.setClass("mapred.map.output.compression.codec", GzipCodec.class, CompressionCodec.class);
Job job=new Job(conf);
LZO
For final output
FileOutputFormat.setCompressOutput(conf, true);
FileOutputFormat.setOutputCompressorClass(conf, LzoCodec.class);
In addition, to make LZO splittable, we need to make a LZO index file.
For final output
FileOutputFormat.setCompressOutput(conf, true);
FileOutputFormat.setOutputCompressorClass(conf, LzoCodec.class);
In addition, to make LZO splittable, we need to make a LZO index file.
Snappy
For final output
conf.setOutputFormat(SequenceFileOutputFormat.class);
SequenceFileOutputFormat.setOutputCompressionType(conf, CompressionType.BLOCK);
SequenceFileOutputFormat.setCompressOutput(conf, true); conf.set("mapred.output.compression.codec","org.apache.hadoop.io.compress.SnappyCodec");
For map output
Configuration conf = new Configuration();
conf.setBoolean("mapred.compress.map.output", true); conf.set("mapred.map.output.compression.codec","org.apache.hadoop.io.compress.SnappyCodec");
For final output
conf.setOutputFormat(SequenceFileOutputFormat.class);
SequenceFileOutputFormat.setOutputCompressionType(conf, CompressionType.BLOCK);
SequenceFileOutputFormat.setCompressOutput(conf, true); conf.set("mapred.output.compression.codec","org.apache.hadoop.io.compress.SnappyCodec");
For map output
Configuration conf = new Configuration();
conf.setBoolean("mapred.compress.map.output", true); conf.set("mapred.map.output.compression.codec","org.apache.hadoop.io.compress.SnappyCodec");