Wei's tips: 2012

Wednesday, December 12, 2012

Compile octave

Latest version of octave needs texinfo and newer version of gcc.

1. Install gcc 4.7.1. gcc needs additional 3 components which have to be downloaded separatedly:

ftp://ftp.gnu.org/gnu/gmp/gmp-4.3.2.tar.gz
http://www.mpfr.org/mpfr-2.4.2/mpfr-2.4.2.tar.gz

http://www.multiprecision.org/mpc/download/mpc-0.8.1.tar.gz

unzip and copy them to gmp mpfr and mpc or gcc-4.7.1 source dir

mkdir ~/gcc-4.7.1

cd ~/gcc-4.7.1

~/downloads/gcc-4.7.1/configure --enable-languages=all

make

sudo make install

2. Install texinfo

wget http://ftp.gnu.org/gnu/texinfo/texinfo-4.13.tar.gz

./configure

make

sudo make install

3. Install blas and lapack. On my environment, this is done by:

yum install blas

yum install lapack

And for me, I have to manually create a sym link libblas.so and liblapack.so as octave configuration cannot find the versioned so.

sudo ln -s libblas.so.3.0.3 /usr/lib64/libblas.so

sudo ln -s liblapack.so.3.0.3 /usr/lib64/liblapack.so

4. Install gnuplot-4.4

5. Install octave

cd octave-3.6.2

LD_RUN_PATH=/usr/local/lib64 LDFLAGS=-L/usr/local/lib64 ./configure --without-curl

LD_RUN_PATH=/usr/local/lib64 make -j 8

You may need to use different directories other than /usr/local/lib64 depending on where your blas/lapack/gfortran libraries are located. If error happens, look at config.log to get more detail.

Wednesday, October 10, 2012

Creating and inserting into bucketed table

The clause for bucketing is:

[CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]

The CLUSTERED BY and SORTED BY creation commands do not affect how data is inserted into a table – only how it is read. This means that users must be careful to insert data correctly by specifying the number of reducers to be equal to the number of buckets, and using CLUSTER BY and SORT BY commands in their query.

According to https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL+BucketedTables, with hive.enforce.bucketing=true, we don't need CLUSTER BY clause in the insert query. This is NOT correct. We still need CLUSTER BY in insert query.

What hive.enforce.bucketing does is "DISTRIBUTE BY". In order to do sorting automatically, we need "hive.enforce.sorting=true" or have CLUSTER BY in the insert query.

So what "CLUSTERED BY" in the table definition means is "DISTRIBUTE BY" and hive.enforce.bucketing only enforces "DISTRIBUTE BY".

Wednesday, October 03, 2012

Memory units

Different units for size are used in different contexts.

context unit
ps 1024 bytes
cat /proc/meminfo 1024 bytes
free 1024 bytes
du 1024 bytes
ls -l 1 byte