Jump to content

This is a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Data Platform/Systems/Hive/Compression

From Wikitech

Uncompressed vs. Snappy compressed Sequence Files

I just ran some rough comparisons of data sizes and Hive queries of webrequest data stored in HDFS uncompressed vs. as Snappy compressed Sequence Files.

Uncompressed

Size

In the first 8 hours of January 7th, 2014, uncompressed JSON formatted mobile webrequest logs imported into HDFS via Kafka totaled 91.1 GB. Each hourly import was between 8 and 12 GB each.

hdfs dfs -du -s -h /wmf/data/external/webrequest_mobile/hourly/2014/01/07/{00..08}
10.6 G  /wmf/data/external/webrequest_mobile/hourly/2014/01/07/00
10.9 G  /wmf/data/external/webrequest_mobile/hourly/2014/01/07/01
11.3 G  /wmf/data/external/webrequest_mobile/hourly/2014/01/07/02
11.7 G  /wmf/data/external/webrequest_mobile/hourly/2014/01/07/03
10.6 G  /wmf/data/external/webrequest_mobile/hourly/2014/01/07/04
9.9 G  /wmf/data/external/webrequest_mobile/hourly/2014/01/07/05
9.1 G  /wmf/data/external/webrequest_mobile/hourly/2014/01/07/06
8.6 G  /wmf/data/external/webrequest_mobile/hourly/2014/01/07/07
8.4 G  /wmf/data/external/webrequest_mobile/hourly/2014/01/07/08

91.1 G


Query Time

Running a select count(*) query on a single hour took 44.276 seconds, and launched 42 mappers. Running the same query on 8 hours of data took 158.627 seconds and launched 343 mappers.

-- 
select count(*) from webrequest_mobile where year=2014 and month=01 and day=07 and hour=00;
...
MapReduce Total cumulative CPU time: 16 minutes 11 seconds 670 msec
Ended Job = job_1387838787660_0365
MapReduce Jobs Launched:
Job 0: Map: 42  Reduce: 1   Cumulative CPU: 971.67 sec   HDFS Read: 11422909543 HDFS Write: 9 SUCCESS
Total MapReduce CPU Time Spent: 16 minutes 11 seconds 670 msec
OK
_c0
16641115
Time taken: 44.276 seconds

select count(*) from webrequest_mobile where year=2014 and month=01 and day=07 and hour between 00 and 08;
...
MapReduce Total cumulative CPU time: 0 days 4 hours 16 minutes 12 seconds 420 msec
Ended Job = job_1387838787660_0363
MapReduce Jobs Launched:
Job 0: Map: 343  Reduce: 1   Cumulative CPU: 15372.42 sec   HDFS Read: 98055786272 HDFS Write: 10 SUCCESS
Total MapReduce CPU Time Spent: 0 days 4 hours 16 minutes 12 seconds 420 msec
OK
_c0
143199253
Time taken: 158.627 seconds

Snappy compressed Sequence Files

I recently got SequenceFileRecordWriterProvider.java merged upstream in LinkedIn's Camus. Using this rather than StringRecordWriterProvider.java writes out the same data as Snappy compressed Hadoop Sequence Files .

Size

JSON data imported for the same 8 hour period and Snappy compressed was 21.9 GB, 24% of the original size.

hdfs dfs -du -s -h /user/otto/data/compressed/webrequest_mobile/hourly/2014/01/07/{00..08}

2.5 G  data/compressed/webrequest_mobile/hourly/2014/01/07/00
2.6 G  data/compressed/webrequest_mobile/hourly/2014/01/07/01
2.7 G  data/compressed/webrequest_mobile/hourly/2014/01/07/02
2.8 G  data/compressed/webrequest_mobile/hourly/2014/01/07/03
2.5 G  data/compressed/webrequest_mobile/hourly/2014/01/07/04
2.4 G  data/compressed/webrequest_mobile/hourly/2014/01/07/05
2.2 G  data/compressed/webrequest_mobile/hourly/2014/01/07/06
2.1 G  data/compressed/webrequest_mobile/hourly/2014/01/07/07
2.1 G  data/compressed/webrequest_mobile/hourly/2014/01/07/08

21.9 G

Query Time

The same select count(*) query on a single hour of compressed data took 86.232 seconds, about twice as long as on uncompressed data. Running the query on 8 hours worth of compressed data took 158.627 seconds, which is only 8% longer than when run on 8 hours of uncompressed data. The number of mappers launched was the same as in the uncompressed case.

select count(*) from webrequest_mobile where year=2014 and month=01 and day=07 and hour=00;
...
MapReduce Total cumulative CPU time: 16 minutes 11 seconds 670 msec
Ended Job = job_1387838787660_0365
MapReduce Jobs Launched:
Job 0: Map: 42  Reduce: 1   Cumulative CPU: 971.67 sec   HDFS Read: 11422909543 HDFS Write: 9 SUCCESS
Total MapReduce CPU Time Spent: 16 minutes 11 seconds 670 msec
OK
_c0
16641115
Time taken: 44.276 seconds

select count(*) from webrequest_mobile where year=2014 and month=01 and day=07 and hour between 00 and 08;
...
MapReduce Total cumulative CPU time: 0 days 4 hours 16 minutes 12 seconds 420 msec
Ended Job = job_1387838787660_0363
MapReduce Jobs Launched:
Job 0: Map: 343  Reduce: 1   Cumulative CPU: 15372.42 sec   HDFS Read: 98055786272 HDFS Write: 10 SUCCESS
Total MapReduce CPU Time Spent: 0 days 4 hours 16 minutes 12 seconds 420 msec
OK
_c0
143199253
Time taken: 158.627 seconds

Summary

Using Snappy to compress the JSON webrequest logs results in significant space savings, and only a slight reduction in performance for large queries. Query performance is affected for smaller data sets. I will run another test once I have more data to compare (a month), but if results are approximately the same I will not update this page.

Recommendation: use snappy compression for all webrequest imports.