Wednesday, October 10, 2012

Creating and inserting into bucketed table

The clause for bucketing is:

[CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]

The CLUSTERED BY and SORTED BY creation commands do not affect how data is inserted into a table – only how it is read. This means that users must be careful to insert data correctly by specifying the number of reducers to be equal to the number of buckets, and using CLUSTER BY and SORT BY commands in their query.

According to https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL+BucketedTables, with hive.enforce.bucketing=true, we don't need CLUSTER BY clause in the insert query. This is NOT correct. We still need CLUSTER BY in insert query.

What hive.enforce.bucketing does is "DISTRIBUTE BY". In order to do sorting automatically, we need "hive.enforce.sorting=true" or have CLUSTER BY in the insert query.

So what "CLUSTERED BY" in the table definition means is "DISTRIBUTE BY" and hive.enforce.bucketing only enforces "DISTRIBUTE BY".

Wednesday, October 03, 2012

Memory units

Different units for size are used in different contexts.

context           unit
ps                1024 bytes
cat /proc/meminfo 1024 bytes
free              1024 bytes
du                1024 bytes
ls -l             1 byte