Data Platform/Systems/Hive/Compression
Uncompressed vs. Snappy compressed Sequence Files
I just ran some rough comparisons of data sizes and Hive queries of webrequest data stored in HDFS uncompressed vs. as Snappy compressed Sequence Files.
Uncompressed
Size
In the first 8 hours of January 7th, 2014, uncompressed JSON formatted mobile webrequest logs imported into HDFS via Kafka totaled 91.1 GB. Each hourly import was between 8 and 12 GB each.
hdfs dfs -du -s -h /wmf/data/external/webrequest_mobile/hourly/2014/01/07/{00..08}
10.6 G /wmf/data/external/webrequest_mobile/hourly/2014/01/07/00
10.9 G /wmf/data/external/webrequest_mobile/hourly/2014/01/07/01
11.3 G /wmf/data/external/webrequest_mobile/hourly/2014/01/07/02
11.7 G /wmf/data/external/webrequest_mobile/hourly/2014/01/07/03
10.6 G /wmf/data/external/webrequest_mobile/hourly/2014/01/07/04
9.9 G /wmf/data/external/webrequest_mobile/hourly/2014/01/07/05
9.1 G /wmf/data/external/webrequest_mobile/hourly/2014/01/07/06
8.6 G /wmf/data/external/webrequest_mobile/hourly/2014/01/07/07
8.4 G /wmf/data/external/webrequest_mobile/hourly/2014/01/07/08
91.1 G
Query Time
Running a select count(*) query on a single hour took 44.276 seconds, and launched 42 mappers. Running the same query on 8 hours of data took 158.627 seconds and launched 343 mappers.
-- select count(*) from webrequest_mobile where year=2014 and month=01 and day=07 and hour=00; ... MapReduce Total cumulative CPU time: 16 minutes 11 seconds 670 msec Ended Job = job_1387838787660_0365 MapReduce Jobs Launched: Job 0: Map: 42 Reduce: 1 Cumulative CPU: 971.67 sec HDFS Read: 11422909543 HDFS Write: 9 SUCCESS Total MapReduce CPU Time Spent: 16 minutes 11 seconds 670 msec OK _c0 16641115 Time taken: 44.276 seconds select count(*) from webrequest_mobile where year=2014 and month=01 and day=07 and hour between 00 and 08; ... MapReduce Total cumulative CPU time: 0 days 4 hours 16 minutes 12 seconds 420 msec Ended Job = job_1387838787660_0363 MapReduce Jobs Launched: Job 0: Map: 343 Reduce: 1 Cumulative CPU: 15372.42 sec HDFS Read: 98055786272 HDFS Write: 10 SUCCESS Total MapReduce CPU Time Spent: 0 days 4 hours 16 minutes 12 seconds 420 msec OK _c0 143199253 Time taken: 158.627 seconds
Snappy compressed Sequence Files
I recently got SequenceFileRecordWriterProvider.java merged upstream in LinkedIn's Camus. Using this rather than StringRecordWriterProvider.java writes out the same data as Snappy compressed Hadoop Sequence Files .
Size
JSON data imported for the same 8 hour period and Snappy compressed was 21.9 GB, 24% of the original size.
hdfs dfs -du -s -h /user/otto/data/compressed/webrequest_mobile/hourly/2014/01/07/{00..08}
2.5 G data/compressed/webrequest_mobile/hourly/2014/01/07/00
2.6 G data/compressed/webrequest_mobile/hourly/2014/01/07/01
2.7 G data/compressed/webrequest_mobile/hourly/2014/01/07/02
2.8 G data/compressed/webrequest_mobile/hourly/2014/01/07/03
2.5 G data/compressed/webrequest_mobile/hourly/2014/01/07/04
2.4 G data/compressed/webrequest_mobile/hourly/2014/01/07/05
2.2 G data/compressed/webrequest_mobile/hourly/2014/01/07/06
2.1 G data/compressed/webrequest_mobile/hourly/2014/01/07/07
2.1 G data/compressed/webrequest_mobile/hourly/2014/01/07/08
21.9 G
Query Time
The same select count(*) query on a single hour of compressed data took 86.232 seconds, about twice as long as on uncompressed data. Running the query on 8 hours worth of compressed data took 158.627 seconds, which is only 8% longer than when run on 8 hours of uncompressed data. The number of mappers launched was the same as in the uncompressed case.
select count(*) from webrequest_mobile where year=2014 and month=01 and day=07 and hour=00; ... MapReduce Total cumulative CPU time: 16 minutes 11 seconds 670 msec Ended Job = job_1387838787660_0365 MapReduce Jobs Launched: Job 0: Map: 42 Reduce: 1 Cumulative CPU: 971.67 sec HDFS Read: 11422909543 HDFS Write: 9 SUCCESS Total MapReduce CPU Time Spent: 16 minutes 11 seconds 670 msec OK _c0 16641115 Time taken: 44.276 seconds select count(*) from webrequest_mobile where year=2014 and month=01 and day=07 and hour between 00 and 08; ... MapReduce Total cumulative CPU time: 0 days 4 hours 16 minutes 12 seconds 420 msec Ended Job = job_1387838787660_0363 MapReduce Jobs Launched: Job 0: Map: 343 Reduce: 1 Cumulative CPU: 15372.42 sec HDFS Read: 98055786272 HDFS Write: 10 SUCCESS Total MapReduce CPU Time Spent: 0 days 4 hours 16 minutes 12 seconds 420 msec OK _c0 143199253 Time taken: 158.627 seconds
Summary
Using Snappy to compress the JSON webrequest logs results in significant space savings, and only a slight reduction in performance for large queries. Query performance is affected for smaller data sets. I will run another test once I have more data to compare (a month), but if results are approximately the same I will not update this page.
Recommendation: use snappy compression for all webrequest imports.