Big Data Hadoop Hive Practice Session

Hi Friends…This Blog is about the Main Basic Standard Tool and Eco-System in Hadoop Framework Architecture “Hive”…

HIVE Practice:

Hive Lab1 : Inner tables and External Tables

hive> show databases;

— lists existed databases;

hive> create table mysamp(line string);
OK
Time taken: 0.543 seconds

hive> load data local inpath ‘file1’
> into table mysamp;

[training@localhost ~]$ hadoop fs -ls /user/hive/warehouse/mysamp
Found 1 items
-rw-r–r–   1 training supergroup         50 2016-06-15 07:21 /user/hive/warehouse/mysamp/file1

when ever a table is created under default database, in hdfs one directory will created with table name.

when you load a file into table,
the file will be copied into backend table’s directory.

location of default database:
/user/hive/warehouse

hive> select * from mysamp;
OK
aaaaaaaaaaa
aaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaa
Time taken: 0.166 seconds
hive>

when select statement is submitted hive starts reading from table’s backend directory(all files of directory).

[training@localhost ~]$ cat > file2
bbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbb

hive> load data local inpath ‘file2’
>  into table mysamp;

[training@localhost ~]$ hadoop fs -ls /user/hive/warehouse/mysamp
Found 2 items
-rw-r–r–   1 training supergroup         50 2016-06-15 07:21 /user/hive/warehouse/mysamp/file1
-rw-r–r–   1 training supergroup         55 2016-06-15 07:28 /user/hive/warehouse/mysamp/file2
[training@localhost ~]$

hive> select * from mysamp;
OK
aaaaaaaaaaa
aaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbb
Time taken: 0.065 seconds
hive>

— here, hive reading from all files of backend directory.

[training@localhost ~]$ hadoop fs -ls /user/hive/warehouse/mysamp
ls: Cannot access /user/hive/warehouse/mysamp: No such file or directory.

— as above operation,
when table is dropped,
in hdfs, the backend directory will also be deleted.

— means, metadata and data both are dropped.
this is behavior of “inner table”

two types of tables.
____________________
i) inner table. (default)
–if table is dropped, both metadata and data will be deleted
from hive table will be deleted.
from hdfs, backend directory will be deleted.

ii) external table.
— if table is dropped, only metadata will be deleted.

from hive table only will be deleted. but still backend directory is available with data files.

so that we can re use.
creating external table:

hive> create external table ursamp(line  string);

hive> load data local inpath ‘file1’
>  into table ursamp;

hive> load data local inpath ‘file2’
>  into table ursamp;

hive>

[training@localhost ~]$ hadoop  fs -ls /user/hive/warehouse/ursamp
Found 2 items
-rw-r–r–   1 training supergroup         50 2016-06-15 07:40 /user/hive/warehouse/ursamp/file1
-rw-r–r–   1 training supergroup         55 2016-06-15 07:40 /user/hive/warehouse/ursamp/file2
[training@localhost ~]$

hive> select * from ursamp;
OK
aaaaaaaaaaa
aaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbb
Time taken: 0.081 seconds
hive>

hive> drop table ursamp;

[training@localhost ~]$ hadoop  fs -ls /user/hive/warehouse/ursamp
Found 2 items
-rw-r–r–   1 training supergroup         50 2016-06-15 07:40 /user/hive/warehouse/ursamp/file1
-rw-r–r–   1 training supergroup         55 2016-06-15 07:40 /user/hive/warehouse/ursamp/file2
[training@localhost ~]$

from hive ursamp is deleted,
but still ursamp directory with data files safely available in warehouse location.

adv of exteral table reuse of the data.

___________________________________

how to reuse.?

hive> create table ursamp(line string);

— create behavior,

if table’s directory (ursamp) is already available , it will use it.
if not available, it will create directory.

_____________________________

creating databases.

hive> create database mydb;
OK
Time taken: 0.029 seconds
hive> use mydb;
OK
Time taken: 0.013 seconds
hive> create table urtab(line string);
OK
Time taken: 0.04 seconds
hive>

[training@localhost ~]$ hadoop fs -ls /user/hive/warehouse/mydb.db
Found 1 items
drwxr-xr-x   – training supergroup          0 2016-06-15 07:56 /user/hive/warehouse/mydb.db/urtab
[training@localhost ~]$

in hdfs, under warehouse location with name database and with .db extension one directory will be created.

mydb.db

hive> load data local inpath ‘file1’
>   into table urtab;

[training@localhost ~]$ hadoop fs -ls /user/hive/warehouse/mydb.db/urtab
Found 1 items
-rw-r–r–   1 training supergroup         50 2016-06-15 07:59 /user/hive/warehouse/mydb.db/urtab/file1

________________________________

create table’s directory under custom location.

hive> create table mytab(line string)
>  location ‘/user/myloc’;
OK
Time taken: 0.034 seconds
hive> load data local inpath ‘file1’ into table mytab;
Copying data from file:/home/training/file1
Copying file: file:/home/training/file1
Loading data to table default.mytab
OK
Time taken: 0.131 seconds
hive>

[training@localhost ~]$ hadoop fs -ls /user/myloc
Found 1 items
-rw-r–r–   1 training supergroup         50 2016-06-15 08:02 /user/myloc/file1
[training@localhost ~]$

— with name /user/myloc one directory is created for mytab table

 

 

Hive Lab2 :

Temperature data Handling and Other operations

 

[training@localhost ~]$ gedit tmpr.txt
[training@localhost ~]$ cat tmpr.txt
xxxxx2006xxx34xxx
xxxxx2006xxx35xxx
xxxxx2006xxx32xxx
xxxxx2007xxx24xxx
xxxxx2007xxx21xxx
xxxxx2008xxx37xxx
xxxxx2008xxx39xxx
[training@localhost ~]$

hive> create database mytmpr;
OK
Time taken: 1.756 seconds
hive> use mytmpr;
OK
Time taken: 0.032 seconds
hive> create table raw(line string);
OK
Time taken: 0.345 seconds
hive>

[training@localhost ~]$ hadoop fs -ls /user/hive/warehouse/mytmpr.db
Found 1 items
drwxr-xr-x   – training supergroup          0 2016-06-23 20:31 /user/hive/warehouse/mytmpr.db/raw
[training@localhost ~]$

— when database is created, with name database and .db extension one directory will be created in warehouse location.

/user/hive/warehouse/mytmpr.db

hive> load data local inpath  ‘tmpr.txt’
>  into table raw;

[training@localhost ~]$ hadoop fs -ls /user/hive/warehouse/mytmpr.db/raw
Found 1 items
-rw-r–r–   1 training supergroup        126 2016-06-23 20:35 /user/hive/warehouse/mytmpr.db/raw/tmpr.txt
[training@localhost ~]$
hive> select * from raw;
OK
xxxxx2006xxx34xxx
xxxxx2006xxx35xxx
xxxxx2006xxx32xxx
xxxxx2007xxx24xxx
xxxxx2007xxx21xxx
xxxxx2008xxx37xxx
xxxxx2008xxx39xxx
Time taken: 0.186 seconds
hive>

hive> create table tmpr(y int, t int);
OK
Time taken: 0.058 seconds
hive> insert overwrite table tmpr
>   select substr(line,6,4),
>     substr(line,13,2) from raw;

hive> select * from tmpr;
OK
2006    34
2006    35
2006    32
2007    24
2007    21
2008    37
2008    39
Time taken: 0.057 seconds
hive> describe tmpr;
OK
y       int
t       int
Time taken: 0.065 seconds
hive>
hive> create table results(y int,
max int, min int);
hive> insert overwrite table results
select y, max(t), min(t)
from tmpr group by y;

hive> select * from results;
OK
2006    35      32
2007    24      21
2008    39      37
Time taken: 0.058 seconds
hive>

[training@localhost ~]$ hadoop fs -cat /user/hive/warehouse/mytmpr.db/tmpr/000000_0
2006 34
2006 35
2006 32
2007 24
2007 21
2008 37
2008 39
[training@localhost ~]$

default delimiter for hive table is  \001.

[training@localhost ~]$ hadoop fs -ls /user/hive/warehouse/mytmpr.db/results
Found 1 items
-rw-r–r–   1 training supergroup         33 2016-06-23 20:48 /user/hive/warehouse/mytmpr.db/results/000000_0
[training@localhost ~]$ hadoop fs -cat /user/hive/warehouse/mytmpr.db/results/000000_0
2006 35 32
2007 24 21
2008 39 37
[training@localhost ~]$
______________________________

if temperature contains both positives and negatives..

[training@localhost ~]$ cat tmpr2.txt
xxxxx2006xxx34xxx
xxxxx2006xxx-35xxx
xxxxx2006xxx-14xxx
xxxxx2006xxx32xxx
xxxxx2007xxx-24xxx
xxxxx2007xxx-21xxx
xxxxx2007xxx35xxx
xxxxx2008xxx-37xxx
xxxxx2008xxx39xxx
[training@localhost ~]$
hive> create database urtmpr;
OK
Time taken: 0.022 seconds
hive> use urtmpr;
OK
Time taken: 0.011 seconds
hive> create table raw(line string);
OK
Time taken: 0.032 seconds
hive> load data local inpath ‘tmpr2.txt’
>  into table raw;

hive> create table tmpr(y int, t int);
OK
Time taken: 0.035 seconds
hive> insert overwrite table tmpr
>  select * from (
>    select substr(line,6,4),
>      substr(line,13,2) from raw
>   where substr(line,13,1)!=’-‘
>      union all
>    select substr(line,6,4),
>      substr(line,13,3) from raw
>   where substr(line,13,1)=’-‘) x;

— in hive only “union all” available., which allows duplicates.
— hive union should be placed as subquery.
— subquery should have aliase.

hive> select * from tmpr;
OK
2006    34
2006    -35
2006    -14
2006    32
2007    -24
2007    -21
2007    35
2008    -37
2008    39
Time taken: 0.056 seconds
hive>

hive> create table results(y int,
>   max int, min int);
OK
Time taken: 0.047 seconds
hive> insert overwrite table results
>   select y, max(t) , min(t)
>   from tmpr group by y;

hive> select * from results;
OK
2006    34      -35
2007    35      -24
2008    39      -37
Time taken: 0.054 seconds
hive>

_____________________________

Both inner and external tables can use custom locations.

hive> create table hvtab1(line string)
location ‘/user/ursloc’;
hive> load data local inpath ‘file1’
into table hvtab1;

[training@localhost ~]$ hadoop fs -ls /user/ursloc
Found 1 items
-rw-r–r–   1 training supergroup         61 2016-06-23 21:15 /user/ursloc/file1
[training@localhost ~]$

A Single Location can be applied for mutliple tables.

hive> create table hvtab2(line string)
location ‘/user/ursloc’;

now tables use same data and same location.

_____________

how to reuse:

hive> create external table
>   xtab(line string);
OK
Time taken: 0.055 seconds
hive> load data local inpath ‘file1’
>  into table xtab;

hive> select * from xtab;
OK
aaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaa
Time taken: 0.059 seconds
hive> drop table xtab;
OK
Time taken: 0.062 seconds
hive> show tables;
OK
raw
results
tmpr
Time taken: 0.042 seconds
hive>

hive> create table xtab(line string);
OK
Time taken: 0.031 seconds
hive> select * from xtab;
OK
aaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaa
Time taken: 0.06 seconds
hive>

— create behavior:
if directory is already existed, it will use it, if not existed, it will create a new directory.

_________________

what if, backend directory is deleted.
what will happen to table?

table will not be deleted.

[training@localhost ~]$ hadoop fs -ls /user/hive/warehouse/urtmpr.db/raw
Found 1 items
-rw-r–r–   1 training supergroup        167 2016-06-23 21:00 /user/hive/warehouse/urtmpr.db/raw/tmpr2.txt
[training@localhost ~]$ hadoop fs -rmr /user/hive/warehouse/urtmpr.db/raw
Deleted hdfs://localhost/user/hive/warehouse/urtmpr.db/raw

hive> show tables;
OK
raw
results
tmpr
xtab
Time taken: 0.045 seconds
hive> select * from raw;
OK
Time taken: 0.036 seconds
hive>
________________

 

 

Hive Lab4 : Bucketing Tables

hive> create database bucks;
OK
Time taken: 1.775 seconds
hive> use bucks;
OK
Time taken: 0.022 seconds
hive> create table sales(pid string, pr int)
> row format delimited
>   fields terminated by ‘,’;
OK
Time taken: 0.364 seconds
hive> load data local inpath ‘sales’
>   into table sales;
Copying data from file:/home/training/sales
Copying file: file:/home/training/sales
Loading data to table bucks.sales
OK
Time taken: 0.187 seconds
hive>

hive> create table buckstab(pid string,
>   pr int)
>  clustered by (pid)
> into 3 buckets;
OK
Time taken: 0.051 seconds
hive>

loading data into buckets.

hive> set hive.enforce.bucketing=true;
hive> insert overwrite table buckstab
>   select * from sales;

[training@localhost ~]$ hadoop fs -ls /user/hive/warehouse/bucks.db/sales
Found 1 items
-rw-r–r–   1 training supergroup        185 2016-06-21 07:35 /user/hive/warehouse/bucks.db/sales/sales
[training@localhost ~]$ hadoop fs -ls /user/hive/warehouse/bucks.db/buckstab
Found 3 items
-rw-r–r–   1 training supergroup         62 2016-06-21 07:40 /user/hive/warehouse/bucks.db/buckstab/000000_0
-rw-r–r–   1 training supergroup         35 2016-06-21 07:40 /user/hive/warehouse/bucks.db/buckstab/000001_0
-rw-r–r–   1 training supergroup         88 2016-06-21 07:40 /user/hive/warehouse/bucks.db/buckstab/000002_0
[training@localhost ~]$

 

_____________________________________________________________________

 

Hive Lab5 :

xml data processing

[training@localhost ~]$ cat sampxml1
<rec>
<name>Ravi</name>
<age>25</age>
<sex>m</sex>
</rec>
<rec>
<name>Rani</name>
<sex>f</sex>
<city>Hyd</city>
</rec>
[training@localhost ~]$

hive> create table samp(line string);
OK
Time taken: 0.309 seconds
hive> load data local inpath ‘sampxml1’
>  into table samp;
Copying data from file:/home/training/sampxml1
Copying file: file:/home/training/sampxml1
Loading data to table ourxml.samp
OK
Time taken: 0.202 seconds
hive>

hive> select * from samp;
OK
<rec>
<name>Ravi</name>
<age>25</age>
<sex>m</sex>
</rec>
<rec>
<name>Rani</name>
<sex>f</sex>
<city>Hyd</city>
</rec>
Time taken: 0.184 seconds
hive>

above xml file has 2 records,
when this file is loaded into hive table, hive treats, each line as a record,
as per xml == two records.
as per hive –> 10 records(rows.).

this data is not ready for hive,
hive expects total xml record in a single line.
so data preperation step is required.

Prepare a mapreduce Job , which can convert xml record into single row.

(Later we see in Mapreduce session).,

[training@localhost ~]$ cat xml1
<rec><name>Ravi</name><age>25</age><city>hyd</city></rec>
<rec><name>Rani</name><age>24</age><sex>f</sex></rec>
<rec><name>Sampath</name><sex>m</sex><city>Del</city></rec>
[training@localhost ~]$

hive> create table raw(line string);
OK
Time taken: 0.05 seconds
hive> load data local inpath ‘xml1′
>  into table raw;
Copying data from file:/home/training/xml1
Copying file: file:/home/training/xml1
Loading data to table ourxml.raw
OK
Time taken: 0.123 seconds
hive>

hive> select * from raw;
OK
<rec><name>Ravi</name><age>25</age><city>hyd</city></rec>
<rec><name>Rani</name><age>24</age><sex>f</sex></rec>
<rec><name>Sampath</name><sex>m</sex><city>Del</city></rec>
Time taken: 0.066 seconds
hive> select count(*) from raw;
3
— 3 rows.

hive> select xpath_string(line,’rec/name’) from raw;
Ravi
Rani
Sampath

hive> select
>  xpath_string(line,’rec/name’),
>  xpath_int(line,’rec/age’),
>  xpath_string(line,’rec/sex’),
>  xpath_string(line,’rec/city’)
> from raw;

Ravi    25              hyd
Rani    24      f
Sampath 0       m       Del

if string fields is missed, it returns blank string, if numeric field is missed it returns 0.

_________________

xml with nested tags.

[training@localhost ~]$ cat xml2
<rec><name><fname>Ravi</fname><lname>kumar</lname></name><age>25</age><contact><email><personal>ravi@gmail.com</personal><official>ravi@infy.com</official></email><phone><mobile>12345</mobile><office>12346</office><residence>12347</residence></phone></contact></rec>
[training@localhost ~]$

hive> create table xraw(line string);
OK
Time taken: 0.075 seconds
hive> load data local inpath ‘xml2′
>  into table xraw;

hive> select * from xraw;
OK
<rec><name><fname>Ravi</fname><lname>kumar</lname></name><age>25</age><contact><email><personal>ravi@gmail.com</personal><official>ravi@infy.com</official></email><phone><mobile>12345</mobile><office>12346</office><residence>12347</residence></phone></contact></rec>
Time taken: 0.064 seconds
hive>

hive> create table info(fname string,
>   lname string,
> age int, personal_email string,
>  official_email string,
>  mobile string, office string,
>  residence string);
OK
Time taken: 0.042 seconds
hive>

hive> insert overwrite table info
>   select
> xpath_string(line,’rec/name/fname’),
> xpath_string(line,’rec/name/lname’),
> xpath_int(line,’rec/age’),
> xpath_string(line,’rec/contact/email/personal’),
> xpath_string(line,’rec/contact/email/official’),
> xpath_string(line,’rec/contact/phone/mobile’),
> xpath_string(line,’rec/contact/phone/office’),
> xpath_string(line,’rec/contact/phone/residence’)
>  from xraw;

hive> select * from info;
OK
Ravi    kumar   25      ravi@gmail.com  ravi@infy.com     12345   12346   12347
Time taken: 0.064 seconds
hive>
_____________

xml with collections.

[training@localhost ~]$ cat xml3
<rec><name>Amar</name><qual>Btech</qual><qual>Mtech</qual></rec>
<rec><name>Amala</name><qual>Bsc</qual><qual>Msc</qual><qual>Mtech</qual></rec>
<rec><name>Akash</name><qual>Btech</qual><qual>Mba</qual></rec>
[training@localhost ~]$

hive> create table yraw(line string);
OK
Time taken: 0.043 seconds
hive> load data local inpath ‘xml3′
>  into table yraw;

hive> select * from yraw;
<rec><name>Amar</name><qual>Btech</qual><qual>Mtech</qual></rec>
<rec><name>Amala</name><qual>Bsc</qual><qual>Msc</qual><qual>Mtech</qual></rec>
<rec><name>Akash</name><qual>Btech</qual><qual>Mba</qual></rec>

hive>
hive> create table raw2(name string,
>    qual  array<string>);
OK
Time taken: 0.039 seconds
hive> insert overwrite table raw2
>   select xpath_string(line,’rec/name’),
>  xpath(line,’rec/qual/text()’)
>   from yraw;

hive> select * from raw2;
OK
Amar    [“Btech”,”Mtech”]
Amala   [“Bsc”,”Msc”,”Mtech”]
Akash   [“Btech”,”Mba”]
Time taken: 0.061 seconds
hive>

hive> select name, size(qual) from raw2;

Amar    2
Amala   3
Akash   2

how to access array elements,

by using index numbers

indexing starts from 0.

hive> select qual[0], qual[1],
qual[2] from raw2;
Btech   Mtech   NULL
Bsc     Msc     Mtech
Btech   Mba     NULL

search for elements with in array.

hive> select * from raw2
>  where array_contains(qual,’Mtech’);

Amar    [“Btech”,”Mtech”]
Amala   [“Bsc”,”Msc”,”Mtech”]

_______________
Flattening Array elements:

hive> select explode(qual) as q
from raw2;

Btech
Mtech
Bsc
Msc
Mtech
Btech
Mba

hive> select name, explode(qual) as q from raw2;
FAILED: Error in semantic analysis: UDTF’s are not supported outside the SELECT clause, nor nested in expressions
hive>

— above statement is invalid,
bcoz, udtf s can not be applied with other column expressions.

hive> create table  yinfo(name string,
>      qual string);
OK
Time taken: 0.035 seconds
hive> insert overwrite table yinfo
>   select name, myq from raw2
>    lateral view explode(qual) q as myq;

hive> select * from yinfo;
OK
Amar    Btech
Amar    Mtech
Amala   Bsc
Amala   Msc
Amala   Mtech
Akash   Btech
Akash   Mba
Time taken: 0.055 seconds

hive> select * from yinfo
>   where qual in (‘Msc’,’Mtech’);

Amar    Mtech
Amala   Msc
Amala   Mtech

hive> create table yres(qual string, cnt int);
OK
Time taken: 0.036 seconds
hive> insert overwrite table yres
>   select qual, count(*) from yinfo
>    group by qual;

hive> select * from yres;
OK
Bsc     1
Btech   2
Mba     1
Msc     1
Mtech   2
Time taken: 0.051 seconds
hive>

____________________

Assignment:

[training@localhost ~]$ cat xml4
<tr><cid>101</cid><pr>1000</pr><pr>3000</pr></tr>
<tr><cid>102</cid><pr>1200</pr><pr>2000</pr><pr>5000</pr></tr>
<tr><cid>101</cid><pr>4000</pr></tr>
[training@localhost ~]$

hive> create database sales;
hive> use sales;
hive> create table raw(line string);
hive> load data local inpath ‘xml4′
into table raw;
hive> create table raw2(cid string,
pr array<string>);

hive> insert overwrite table raw2
select xpath_string(line,’tr/cid’),
xpath(line,’tr/pr/text()’)
from raw;
hive> create table raw3(cid string,
pr int);
hive> insert overwrite table raw3
select cid, mypr from  raw2
lateral view explode(pr) p as mypr;

hive> create table results(cid string,
totbill int);
hive> insert overwrite table results
select cid, sum(pr)
from raw3
group by cid;

hive> select * from results;

— my students are really super and energitic —
_______________________________________

Hive Lab6 : xml data processing 2

Hive xml2
______________________

hive> create database xxmls;
OK
Time taken: 0.182 seconds
hive> use xxmls;
OK
Time taken: 0.017 seconds

[training@localhost ~]$ cat xml5
<tr><cid>101</cid><pr>3000</pr><qnt>3</qnt><pr>5000</pr><qnt>2</qnt><pr>1000</pr><qnt>5</qnt></tr>
<tr><cid>102</cid><pr>2000</pr><qnt>5</qnt><pr>5000</pr><qnt>2</qnt></tr>
<tr><cid>101</cid><pr>6000</pr><qnt>5</qnt></tr>
[training@localhost ~]$

hive> create table raw(line string);
OK
Time taken: 0.252 seconds
hive> load data local inpath ‘xml5′
>  into table raw;
Copying data from file:/home/training/xml5
Copying file: file:/home/training/xml5
Loading data to table xxmls.raw
OK
Time taken: 0.178 seconds
hive> select * from raw;
OK
<tr><cid>101</cid><pr>3000</pr><qnt>3</qnt><pr>5000</pr><qnt>2</qnt><pr>1000</pr><qnt>5</qnt></tr>
<tr><cid>102</cid><pr>2000</pr><qnt>5</qnt><pr>5000</pr><qnt>2</qnt></tr>
<tr><cid>101</cid><pr>6000</pr><qnt>5</qnt></tr>
Time taken: 0.155 seconds
hive>

hive> create table raw2(cid int,
> pr array<string>, qnt array<string>);
OK
Time taken: 0.069 seconds
hive> insert overwrite table raw2
>   select xpath_int(line,’tr/cid’),
>     xpath(line,’tr/pr/text()’),
>     xpath(line,’tr/qnt/text()’)
> from raw;

hive> create table cidpr(cid int , pr int);
hive> insert overwrite table cidpr
>   select cid, mypr from raw2
>    lateral view explode(pr) p as mypr;

hive> create table cidqnt(cid int, qnt int);

hive> insert overwrite table cidqnt
>    select cid, myqnt from raw2
>   lateral view explode(qnt) q as myqnt;

hive> select * from cidpr;
OK
101     3000
101     5000
101     1000
102     2000
102     5000
101     6000
Time taken: 0.06 seconds
hive> select * from cidqnt;
OK
101     3
101     2
101     5
102     5
102     2
101     5
Time taken: 0.046 seconds
hive>

hive> add jar Desktop/hivejars.jar;
hive> create temporary function auto
>  as ‘hive.analytics.SeqNumber’;

hive> alter table cidpr add columns(n int);
hive> insert overwrite table cidpr
select cid, pr, auto() from cidpr;

hive> alter table cidqnt add columns(n int);
hive> insert overwrite table cidqnt
select cid, qnt, auto() from cidqnt;

hive> create table trans(cid int, pr int,
qnt int, bill int);
hive> insert overwrite table trans
select l.cid, pr, qnt, pr*qnt
from cidpr l join cidqnt r
on (l.n=r.n);

hive> create table report(cid int, totbill int);
hive> insert overwrite table report
select cid, sum(bill) from trans
group by cid;
hive> select * from report;
________________________

hive.analytics.SeqNumber.java
______________________
package hive.analytics;

import java.io.IOException;

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.IntWritable;

public class SeqNumber extends UDF
{
int cnt = 0;
public IntWritable  evaluate()
throws IOException
{
cnt++;
return  new IntWritable(cnt);
}

}

______________________________

keep all above statements into a script file.

hscript1.hql
________________

create database urheroes;
use urheroes;
create table raw(line string);
load data local inpath ‘xml5′ into table raw;
create table raw2(cid int, pr array<string>, qnt array<string>);
insert overwrite table raw2
select xpath_int(line,’tr/cid’),
xpath(line,’tr/pr/text()’),
xpath(line,’tr/qnt/text()’) from raw;
create table cidpr(cid int, pr int);
insert overwrite table cidpr
select cid, mypr from raw2
lateral view explode(pr) p as mypr;
create table cidqnt(cid int, qnt int);
insert overwrite table cidqnt
select cid, myq from raw2
lateral view explode(qnt) q as myq;
alter table cidpr add columns(n int);
alter table cidqnt add columns(n int);
add jar Desktop/hivejars.jar;
create temporary function auto
as ‘hive.analytics.SeqNumber’;
insert overwrite table cidpr
select cid, pr, auto() from cidpr;
insert overwrite table cidqnt
select cid, qnt, auto() from cidqnt;
create table trans(cid int, pr int, qnt int,
bill int);
insert overwrite table trans
select l.cid, pr, qnt, pr*qnt
from cidpr l join cidqnt r
on (l.n=r.n);
create table report(cid int, totbill int);
insert overwrite table report
select cid, sum(bill) from trans
group by cid;
select * from trans;
select * from report;
drop table raw;
drop table raw2;
drop table cidpr;
drop table cidqnt;
________________________

how to submit script:

$ hive -f hscript1.hql
_________________________________________

 

Hive Lab7 : Hive Partitioned Tables (A simple Case)

[training@localhost ~]$ cat trans
01/01/2010,30000
01/07/2010,40000
01/08/2010,30000
01/26/2010,40000
02/01/2010,30000
02/07/2010,40000
02/09/2010,30000
02/28/2010,40000
03/01/2010,80000
03/07/2010,40000
04/01/2010,30000
04/17/2010,70000
04/18/2010,80000
04/26/2010,90000
05/01/2010,10000
05/07/2010,20000
05/08/2010,30000
05/26/2010,60000
06/01/2010,30000
06/17/2010,50000
07/01/2010,30000
07/07/2010,40000
08/08/2010,10000
08/26/2010,20000
09/01/2010,90000
09/07/2010,40000
09/18/2010,30000
09/30/2010,40000
10/01/2010,30000
10/07/2010,40000
10/08/2010,30000
11/26/2010,40000
11/01/2010,30000
12/07/2010,40000
12/08/2010,30000
12/26/2010,40000
[training@localhost ~]$

hive> create table sales(dt string,

amt int)
>  row format delimited
>   fields terminated by ‘,’;
OK
Time taken: 0.046 seconds
hive> load data local inpath ‘trans’
>  into table sales;
Copying data from

file:/home/training/trans
Copying file:

file:/home/training/trans
Loading data to table cases.sales
OK
Time taken: 0.194 seconds
hive>

insert overwrite table sales2
select * from (
select dt, amt from sales
union all
select concat(substr(dt,1,9),”1″)

as dt,
amt+1000 as amt  from sales
union all
select concat(substr(dt,1,9),”2″) as

dt,
amt+3000  as amt from sales

union all
select concat(substr(dt,1,9),”3″) as

dt,
amt+5000 as amt from sales

union all
select concat(substr(dt,1,9),”4″) as

dt,
amt+10000 as amt from sales ) x;
hive> create table sales02(dt

array<string>,
>     amt int);
OK
Time taken: 0.042 seconds
hive> insert overwrite table sales02
>   select split(dt,’/’), amt from

sales2;

hive> select * from sales02 limit 5;
OK
[“01″,”01″,”2010”]      30000
[“01″,”01″,”2013”]      35000
[“01″,”01″,”2011”]      31000
[“01″,”01″,”2014”]      40000
[“01″,”01″,”2012”]      33000
Time taken: 0.083 seconds
hive>
hive> create table sales3 like sales;
hive> insert overwrite table sales3
>   select concat(dt[2],’-‘,dt

[0],’-‘,dt[1]),
>  amt from sales02;

hive> create table salespart(dt

string,
>    amt int)
> partitioned by (y int, m int, d

int)
> row format delimited
>   fields terminated by ‘,’;
OK
Time taken: 0.034 seconds
hive> set

hive.exec.dynamic.partition=true;
hive> set

hive.exec.dynamic.partition.mode=nonst

rict;
hive> set

hive.exec.max.dynamic.partitions.perno

de=10000;
hive> set

hive.exec.max.dynamic.partitions=10000

0;

hive> insert overwrite table salespart
partition(y,m,d)
select dt, amt, year(dt), month

(dt),
day(dt) from sales3;

— to access a perticular day

hive> select * from salespart
where y=2010 and m=1 and d=8; –> 1

part

— to access a perticular month.
hive> select * from salespart
where y=2010 and m=1; —> 31 parts

— to access a year.

hive> select * from salespart
where y=2014; —> 366 part

—> range of dates.
Hive> select * from salespart
where (y=2002 and m>=5)
or
(y>2002 and y<2009)
or
(y=2009 and m<=6);

 

 

Hive Lab8 : Xml data processing 3

Hive xml3
______________________

xml tags(nodes) with parameters.
__________________________

[training@localhost ~]$ cat xml6
<tr><cid>101</cid><pr pid=”p1″>1000</pr><pr pid=”p3″>3000</pr></tr>
<tr><cid>102</cid><pr pid=”p1″>1200</pr><pr pid=”p2″>2000</pr></tr>
<tr><cid>103</cid><pr pid=”p2″>5000</pr><pr pid=”p3″>3000</pr></tr>
[training@localhost ~]$

[training@localhost ~]$ cat customers
<rec><cid>101</cid><name>Mani</name><city>Hyd</city></rec>
<rec><cid>102</cid><name>Vani</name><city>Pune</city></rec>
<rec><cid>103</cid><name>Veni</name><city>Hyd</city></rec>
[training@localhost ~]$

[training@localhost ~]$ cat products
<rec><pid>p1</pid><brand>Samsung</brand><pname>Mobile</pname><mrp>5000</mrp></rec>
<rec><pid>p2</pid><brand>Lg</brand><pname>Mobile</pname><mrp>7000</mrp></rec>
<rec><pid>p3</pid><brand>Lg</brand><pname>Laptop</pname><mrp>6000</mrp></rec>
[training@localhost ~]$

hive> create database xml10;
OK
Time taken: 0.034 seconds
hive> use xml10;
OK
Time taken: 0.027 seconds
hive> create table transraw(line string);
OK
Time taken: 0.057 seconds
hive> create table custraw(line string);
OK
Time taken: 0.034 seconds
hive> create table prodraw(line string);
OK
Time taken: 0.038 seconds
hive>

hive> load data local inpath ‘customers’ into table custraw;
Copying data from file:/home/training/customers
Copying file: file:/home/training/customers
Loading data to table xml10.custraw
OK
Time taken: 0.155 seconds
hive> select * from custraw;
OK
<rec><cid>101</cid><name>Mani</name><city>Hyd</city></rec>
<rec><cid>102</cid><name>Vani</name><city>Pune</city></rec>
<rec><cid>103</cid><name>Veni</name><city>Hyd</city></rec>
Time taken: 0.065 seconds
hive>

hive> create table customers(cid string,
>   name string, city string);
OK
Time taken: 0.043 seconds
hive> insert overwrite table customers
>   select xpath_string(line,’rec/cid’),
>    xpath_string(line,’rec/name’),
>    xpath_string(line,’rec/city’)
> from custraw;

hive> select * from customers;
OK
101     Mani    Hyd
102     Vani    Pune
103     Veni    Hyd
Time taken: 0.059 seconds
hive>

hive> load data local inpath ‘products’
>  into table prodraw;
Copying data from file:/home/training/products
Copying file: file:/home/training/products
Loading data to table xml10.prodraw
OK
Time taken: 0.109 seconds
hive> select * from prodraw;
OK
<rec><pid>p1</pid><brand>Samsung</brand><pname>Mobile</pname><mrp>5000</mrp></rec>
<rec><pid>p2</pid><brand>Lg</brand><pname>Mobile</pname><mrp>7000</mrp></rec>
<rec><pid>p3</pid><brand>Lg</brand><pname>Laptop</pname><mrp>6000</mrp></rec>
Time taken: 0.057 seconds
hive>
hive> create table products(pid string,
>  pname string, brand string,
>   mrp int);
OK
Time taken: 0.036 seconds
hive> insert overwrite table products
>   select xpath_string(line,’rec/pid’),
>    xpath_string(line,’rec/pname’),
>    xpath_string(line,’rec/brand’),
>    xpath_int(line,’rec/mrp’)
> from prodraw;

hive> select * from products;
OK
p1      Mobile  Samsung 5000
p2      Mobile  Lg      7000
p3      Laptop  Lg      6000
Time taken: 0.06 seconds
hive>

hive> select * from transraw;
OK
Time taken: 0.052 seconds
hive> load data local inpath ‘xml6′
>  into table transraw;
Copying data from file:/home/training/xml6
Copying file: file:/home/training/xml6
Loading data to table xml10.transraw
OK
Time taken: 0.094 seconds
hive> select * from transraw;
OK
<tr><cid>101</cid><pr pid=”p1″>1000</pr><pr pid=”p3″>3000</pr></tr>
<tr><cid>102</cid><pr pid=”p1″>1200</pr><pr pid=”p2″>2000</pr></tr>
<tr><cid>103</cid><pr pid=”p2″>5000</pr><pr pid=”p3″>3000</pr></tr>
Time taken: 0.056 seconds
hive>

hive> create table transraw2(cid string,
>   pid array<string>,
>  pr array<string>);
OK
Time taken: 0.034 seconds
hive> insert overwrite table transraw2
>   select xpath_string(line,’tr/cid’),
>  xpath(line,’tr/pr/@pid’),
>  xpath(line,’tr/pr/text()’) from transraw;

hive> select * from transraw2;
OK
101     [“p1″,”p3”]     [“1000″,”3000”]
102     [“p1″,”p2”]     [“1200″,”2000”]
103     [“p2″,”p3”]     [“5000″,”3000”]
Time taken: 0.058 seconds
hive> create table cidpid(cid string, pid string);
OK
Time taken: 0.041 seconds
hive> insert overwrite table cidpid
>   select cid, mypid from transraw2
>    lateral view explode(pid) p as mypid;

hive> select * from cidpid;
OK
101     p1
101     p3
102     p1
102     p2
103     p2
103     p3
Time taken: 0.054 seconds
hive> create table cidpr(cid string, pr int);
OK
Time taken: 0.072 seconds
hive> insert overwrite table cidpr
>   select cid, mypy from transraw2
>     lateral view explode(pr) p as  mypy;

hive> select * from cidpr;
OK
101     1000
101     3000
102     1200
102     2000
103     5000
103     3000
Time taken: 0.069 seconds
hive> alter table cidpid add columns(n int);
OK
Time taken: 0.056 seconds
hive>

add jar Desktop/hivejars.jar;
create temporary function auto
as ‘hive.analytics.SeqNumber’;

hive> insert overwrite table cidpid
>   select cid, pid, auto() from cidpid;

hive> select * from cidpid;
OK
101     p1      1
101     p3      2
102     p1      3
102     p2      4
103     p2      5
103     p3      6
Time taken: 0.066 seconds
hive>

hive> alter table cidpr add columns(n int);
OK
Time taken: 0.057 seconds
hive> insert overwrite table cidpr
>   select cid, pr, auto() from cidpr;

hive> select * from cidpr;
OK
101     1000    1
101     3000    2
102     1200    3
102     2000    4
103     5000    5
103     3000    6
Time taken: 0.061 seconds
hive>

hive> create table trans(cid string, pid string,
>  pr int);
OK
Time taken: 0.031 seconds
hive> insert overwrite table trans
>   select l.cid, pid, pr
>   from
>  cidpid l join cidpr r
>  on (l.n=r.n);

hive> select * from trans;
OK
101     p1      1000
101     p3      3000
102     p1      1200
102     p2      2000
103     p2      5000
103     p3      3000
Time taken: 0.063 seconds
hive>

hive> select * from products;
OK
p1      Mobile  Samsung 5000
p2      Mobile  Lg      7000
p3      Laptop  Lg      6000
Time taken: 0.052 seconds
hive> create table  info1(cid string,
>   pid string, pname string,
>   brand string, mrp int, pr int);
OK
Time taken: 0.075 seconds
hive> insert overwrite table info1
>    select cid, l.pid, pname, brand,
>    mrp , pr
>   from trans l join products r
>  on (l.pid = r.pid);
hive> select * from info1;
OK
101     p1      Mobile  Samsung 5000    1000
102     p1      Mobile  Samsung 5000    1200
102     p2      Mobile  Lg      7000    2000
103     p2      Mobile  Lg      7000    5000
101     p3      Laptop  Lg      6000    3000
103     p3      Laptop  Lg      6000    3000
Time taken: 0.057 seconds
hive> select * from customers;
OK
101     Mani    Hyd
102     Vani    Pune
103     Veni    Hyd
Time taken: 0.059 seconds
hive>
hive> create table info(cid string,
>  name string, pid string,
>  pname string, brand string,
>  mrp int, pr int, city string)
> row format delimited
>   fields terminated by ‘,’;
OK
Time taken: 0.082 seconds
hive>

hive> insert overwrite table info
>   select l.cid, name, pid, pname, brand,
>  mrp , pr , city
> from info1 l join customers r
>  on (l.cid = r.cid);

hive> select * from info;
OK
101     Mani    p1      Mobile  Samsung 5000    1000      Hyd
101     Mani    p3      Laptop  Lg      6000    3000      Hyd
102     Vani    p1      Mobile  Samsung 5000    1200      Pune
102     Vani    p2      Mobile  Lg      7000    2000      Pune
103     Veni    p2      Mobile  Lg      7000    5000      Hyd
103     Veni    p3      Laptop  Lg      6000    3000      Hyd
Time taken: 0.053 seconds

 

 

Hive Lab9: Url Data Processing

$ cat  urls
http://yoursite.com/bigdata/mongodb?

name=Raj&age=25&city=hyd
http://yoursite.com/bigdata/spark?

name=Swapna&city=del&desig=se
http://yoursite.com/bigdata/spark?

name=Satwik&age=26&city=hyd
http://yoursite.com/oops/java?

name=Manoj&sex=m&age=35

Url contains , 3 parts of information.

i) Host name:

yoursite.com

ii) Path:

ex:    bigdata/mongodb

iii) Query string.

name=Manoj&sex=m&age=35

Hive hase predefined parsers ,
to extract basic entities of url.

i) parse_url() —> udf
ii) parse_url_tuple() —> udtf.

hive> create database urls;
OK
Time taken: 1.727 seconds
hive> use urls;
OK
Time taken: 0.019 seconds
hive> create table raw(line string);
OK
Time taken: 0.349 seconds
hive> load data local inpath ‘urls’
>  into table raw;
Copying data from

file:/home/training/urls
Copying file: file:/home/training/urls
Loading data to table urls.raw
OK
Time taken: 0.2 seconds
hive>

hive> select * from raw;
OK
http://yoursite.com/bigdata/mongodb?

name=Raj&age=25&city=hyd
http://yoursite.com/bigdata/spark?

name=Swapna&city=del&desig=se
http://yoursite.com/bigdata/spark?

name=Satwik&age=26&city=hyd
http://yoursite.com/oops/java?

name=Manoj&sex=m&age=35
Time taken: 0.064 seconds
hive>

hive> select
>     parse_url(line,’HOST’),
>     parse_url(line,’PATH’),
>     parse_url(line,’QUERY’)
> from raw;

hive> create table raw2(host string,
>  path string,
>  query string);
OK
Time taken: 0.039 seconds
hive> insert overwrite table raw2
>   select x.* from raw
>  lateral view
> parse_url_tuple

(line,’HOST’,’PATH’,’QUERY’)
>  x as h,p,q;

hive> select * from raw2;
OK
yoursite.com    /bigdata/mongodb

name=Raj&age=25&city=hyd
yoursite.com    /bigdata/spark

name=Swapna&city=del&desig=se
yoursite.com    /bigdata/spark

name=Satwik&age=26&city=hyd
yoursite.com    /oops/java

name=Manoj&sex=m&age=35
Time taken: 0.063 seconds
hive>

hive> create table raw3(host string,
>  path array<string>,
>  qmap map<string,string>);

hive> insert overwrite table raw3
>  select host,
>   split(path,’/’),
>  str_to_map(query,’&’,’=’)
> from raw2;

hive> select * from raw3;

hive> select * from raw3;
OK
yoursite.com

[“”,”bigdata”,”mongodb”]

{“age”:”25″,”name”:”Raj”,”city”:”hyd”}
yoursite.com    [“”,”bigdata”,”spark”]

{“desig”:”se”,”name”:”Swapna”,”city”:”

del”}
yoursite.com    [“”,”bigdata”,”spark”]

{“age”:”26″,”name”:”Satwik”,”city”:”hy

d”}
yoursite.com    [“”,”oops”,”java”]

{“sex”:”m”,”age”:”35″,”name”:”Manoj”}
Time taken: 0.064 seconds
hive>

hive> create table info(host string,
>  catogiry string,  course

string,
>  name string, age int,
>  sex string, city string, desig

string)
> row format delimited
>   fields terminated by ‘,’;
OK
Time taken: 0.038 seconds
hive> insert overwrite table info
>   select host, path[1], path[2],
>  qmap[‘name’], qmap[‘age’],
>  qmap[‘sex’], qmap[‘city’],
>  qmap[‘desig’] from  raw3;

hive> select * from info;
OK
yoursite.com    bigdata mongodb Raj

25      NULL      hyd     NULL
yoursite.com    bigdata spark   Swapna

NULL    NULL      del     se
yoursite.com    bigdata spark   Satwik

26      NULL      hyd     NULL
yoursite.com    oops    java    Manoj

35      mNULL     NULL
Time taken: 0.055 seconds

___________________________________

to extract all keys of a map

collection/

hive> select  map_keys(qmap)      from

raw3;

[“age”,”name”,”city”]
[“desig”,”name”,”city”]
[“age”,”name”,”city”]
[“sex”,”age”,”name”]

hive> select  map_values(qmap)

from raw3;

[“25″,”Raj”,”hyd”]
[“se”,”Swapna”,”del”]
[“26″,”Satwik”,”hyd”]
[“m”,”35″,”Manoj”]

____________________________

hive> create table raw4(host string,
>   k   array<string>,
>  v array<string>);
OK
Time taken: 0.037 seconds
hive> insert overwrite table raw4
>   select host,
>     map_keys(qmap), map_values

(qmap)
>   from raw3;

hive> select * from raw4;
OK
yoursite.com    [“age”,”name”,”city”]

[“25″,”Raj”,”hyd”]
yoursite.com

[“desig”,”name”,”city”]

[“se”,”Swapna”,”del”]
yoursite.com    [“age”,”name”,”city”]

[“26″,”Satwik”,”hyd”]
yoursite.com    [“sex”,”age”,”name”]

[“m”,”35″,”Manoj”]
Time taken: 0.061 seconds

 

 

Hive Lab10 : json data processing using hive

Json Data Processing Using Hive.

[training@localhost ~]$ cat json1
{“name”:”Sankar”,”age”:25,”sex”:”m”}
{“name”:”Sankari”,”age”:24,”sex”:”f”,”city”:”hyd”}
[training@localhost ~]$

hive> create database jsonlab;
OK
Time taken: 1.694 seconds
hive> use jsonlab;
OK
Time taken: 0.018 seconds
hive> create table raw(line string);
OK
Time taken: 0.219 seconds
hive> load data local inpath ‘json1′ into table raw;
hive> select * from raw;
OK
{“name”:”Sankar”,”age”:25,”sex”:”m”}
{“name”:”Sankari”,”age”:24,”sex”:”f”,”city”:”hyd”}
Time taken: 0.146 seconds
hive>
hive> select get_json_object(line,’$.name’) from raw;
OK
Sankar
Sankari
hive> select
>   get_json_object(line,’$.name’) ,
>   get_json_object(line,’$.age’) ,
>   get_json_object(line,’$.sex’) ,
>   get_json_object(line,’$.city’)
> from raw;
OK
Sankar  25      m       NULL
Sankari 24      f       hyd

hive> create table info(name string, age int,
>   sex string, city string);
OK
Time taken: 0.04 seconds
hive> insert overwrite table info
>   select x.* from raw
>    lateral view json_tuple

(line,’name’,’age’,’sex’,’city’) x
>   as n,a,s,c;

hive> select * from info;

OK
Sankar  25      m       NULL
Sankari 24      f       hyd
Time taken: 0.063 seconds
hive>

[training@localhost ~]$ cat json2
{“name”:”ravi”,”age”:25,”wife”:

{“name”:”banu”,”age”:24},”city”:”hyd”}
{“name”:”kavi”,”age”:25,”wife”:

{“name”:”kavitha”,”age”:21},”city”:”del”}
[training@localhost ~]$

hive> create table jraw(line string);
OK
Time taken: 0.048 seconds
hive> load data local inpath ‘json2′ into table jraw;
Copying data from file:/home/training/json2
Copying file: file:/home/training/json2
Loading data to table jsonlab.jraw
OK
Time taken: 0.127 seconds
hive> select * from jraw;
OK
{“name”:”ravi”,”age”:25,”wife”:

{“name”:”banu”,”age”:24},”city”:”hyd”}
{“name”:”kavi”,”age”:25,”wife”:

{“name”:”kavitha”,”age”:21},”city”:”del”}
Time taken: 0.06 seconds
hive>
hive> create table raw2(name string, age int, wife string, city

string);
OK
Time taken: 0.033 seconds
hive> insert overwrite table raw2
>   select x.* from jraw
>  lateral view
>  json_tuple(line,’name’,’age’,’wife’,’city’) x as

n,a,w,c;

hive> select * from raw2;
OK
ravi    25      {“age”:24,”name”:”banu”}        hyd
kavi    25      {“age”:21,”name”:”kavitha”}     del
Time taken: 0.061 seconds
hive> create table jinfo(hname string, wname string,
>    hage int, wage int, city string);

hive> insert overwrite table jinfo
>   select name, get_json_object(wife,’$.name’),
>   age , get_json_object(wife,’$.age’),
>  city from raw2;

hive> select * from jinfo;
OK
ravi    banu    25      24      hyd
kavi    kavitha 25      21      del
Time taken: 0.079 seconds

 

 

Hive Lab 11 : Json Array Processing

[training@localhost ~]$ cat json3
{“name”:”Ravi”,”qual”:[“btech”,”mtech”]}
{“name”:”avani”,”qual”:[“btech”,”mtech”,”mba”]}
{“name”:”avinash”,”qual”:[“btech”,”mba”]}
[training@localhost ~]$
hive> create database jsons;
OK
Time taken: 1.747 seconds
hive> use jsons;
OK
Time taken: 0.032 seconds
hive> create table raw(line string);
OK
Time taken: 0.382 seconds
hive>

hive>  load data local inpath ‘json3′
>  into table raw;

hive> select * from raw;
OK
{“name”:”Ravi”,”qual”:[“btech”,”mtech”]}
{“name”:”avani”,”qual”:[“btech”,”mtech”,”mba”]}
{“name”:”avinash”,”qual”:[“btech”,”mba”]}
Time taken: 0.233 seconds
hive>

hive> create table raw2(name string,
>   qual  string);
OK
Time taken: 0.047 seconds
hive> insert overwrite table raw2
>   select x.* from raw
> lateral view
>  json_tuple(line,’name’,’qual’) x
> as n,q;

hive> select * from raw2;
OK
Ravi    [“btech”,”mtech”]
avani   [“btech”,”mtech”,”mba”]
avinash [“btech”,”mba”]
Time taken: 0.064 seconds
hive> create table raw3(name string,
>   qual  array<string>);
OK
Time taken: 0.037 seconds
hive> insert overwrite table raw3
>   select name, split(qual,’,’) from raw2;

hive> select * from raw3;
OK
Ravi    [“[\”btech\””,”\”mtech\”]”]
avani   [“[\”btech\””,”\”mtech\””,”\”mba\”]”]
avinash [“[\”btech\””,”\”mba\”]”]
Time taken: 0.063 seconds
hive> create table raw4(name string, qual string);
OK

hive> insert overwrite table raw4
>   select name, myq from raw3
> lateral view explode(qual) q as myq;

> create table info(name string,
>    qual string);
OK
Time taken: 0.039 seconds
hive> insert overwrite table info
>   select name, split(qual,'”‘)[1]
>  from raw4;

hive> select * from info;
OK
Ravi    btech
Ravi    mtech
avani   btech
avani   mtech
avani   mba
avinash btech
avinash mba
Time taken: 0.065 seconds

 

 

Hive Lab 12 : Loading Array, Struct, Map collection Items

[training@localhost ~]$ cat profile1
101,Venu,btech#mtech,25,hyd
102,Veni,bsc#msc#mtech,26,pune
[training@localhost ~]$

hive> create database pract;
OK
Time taken: 1.798 seconds
hive> use pract;
OK
Time taken: 0.027 seconds
hive> create table profile1(
>   id int, name string,
>     qual  array<string>, age int,
>     city string)
> row format delimited
>     fields terminated by ‘,’
>     collection items terminated by ‘#’;

hive> load data local inpath ‘profile1’
>   into table profile1;
Copying data from file:/home/training/profile1
Copying file: file:/home/training/profile1
Loading data to table pract.profile1
OK
Time taken: 0.208 seconds
hive> select * from profile1;
OK
101     Venu    [“btech”,”mtech”]       25      hyd
102     Veni    [“bsc”,”msc”,”mtech”]   26      pune
Time taken: 0.199 seconds
hive>

__________________
[training@localhost ~]$ cat profile2
101,Venu,Vani#25#btech,hyd
102,Varun,Varuna#24#mba,pune
[training@localhost ~]$

hive> create table profile2(
>  id int, name string,
>   wife struct<name:string,age:int,qual:string>, city string)
>   row format delimited
>     fields terminated by ‘,’
>     collection items terminated by ‘#’;
OK
Time taken: 0.044 seconds
hive> load data local inpath ‘profile2’
>   into table profile2;
Copying data from file:/home/training/profile2
Copying file: file:/home/training/profile2
Loading data to table pract.profile2
OK
Time taken: 0.1 seconds
hive>
hive> select * from profile2;
OK
101     Venu    {“name”:”Vani”,”age”:25,”qual”:”btech”}   hyd
102     Varun   {“name”:”Varuna”,”age”:24,”qual”:”mba”}   pune
Time taken: 0.066 seconds
hive> select name, wife.name from profile2;

Venu    Vani
Varun   Varuna

hive> create table info(id int, name string,
>   wname string, wage int, wqual string,
>   city string)
> ;
OK
Time taken: 0.039 seconds
hive> insert overwrite table info
>   select id, name, wife.name, wife.age,
>   wife.qual , city  from profile2;

hive> select * from info;
OK
101     Venu    Vani    25      btech   hyd
102     Varun   Varuna  24      mba     pune
Time taken: 0.066 seconds
hive>

____________________________________
[training@localhost ~]$ cat profile3
101,p1#1000$p3#2000$p7#4000,hyd
102,p1#1200$p2#5000,del
[training@localhost ~]$

hive> create table  trans(cid int,
>   plist map<string,int>,
>  city string)
> row format delimited
>    fields terminated by ‘,’
>    collection items terminated by ‘$’
>   map keys terminated by ‘#’;
OK
Time taken: 0.048 seconds
hive> load data local inpath ‘profile3’
>   into table trans;
Copying data from file:/home/training/profile3
Copying file: file:/home/training/profile3
Loading data to table pract.trans
OK
Time taken: 0.103 seconds

hive> select cid, plist[‘p1’] , city from trans;

hive>
101     1000    hyd
102     1200    del

hive> create table sales(cid int,
>    prid  array<string>,
>   pr array<int>);
OK
Time taken: 0.05 seconds
hive> insert overwrite table sales
>   select cid, map_keys(plist),
>    map_values(plist)  from trans;

hive> select * from sales;
OK
101     [“p1″,”p3″,”p7”]        [1000,2000,4000]
102     [“p1″,”p2”]     [1200,5000]
Time taken: 0.056 seconds

 

 

Hive Lab 13 : Eliminating Duplicates and Unions (merging)

[training@localhost ~]$ cat dupes
101,aaa,10000
101,bbb,20000
101,aaa,10000
101,aaa,10000
101,aaa,10000
102,bbb,40000
103,cccc,50000
102,bbb,40000
102,bbb,40000
[training@localhost ~]$

hive> create database hdp;
OK
Time taken: 1.702 seconds
hive> use hdp;
OK
Time taken: 0.018 seconds
hive> create table info(id int, name string,
>   sal int)
>  row format delimited fields terminated
>  by ‘,’;
OK
Time taken: 0.439 seconds
hive>

hive> load data local inpath ‘dupes’
>  into table info;
Copying data from file:/home/training/dupes
Copying file: file:/home/training/dupes
Loading data to table hdp.info
OK
Time taken: 0.209 seconds
hive> select * from info;
OK
101     aaa     10000
101     bbb     20000
101     aaa     10000
101     aaa     10000
101     aaa     10000
102     bbb     40000
103     cccc    50000
102     bbb     40000
102     bbb     40000
Time taken: 0.201 seconds
hive>
hive> select distinct(id),name,sal
>   from info;

101     aaa     10000
101     bbb     20000
102     bbb     40000
103     cccc    50000

way2)
hive> select id,name,sal
>   from info
>  group by id,name,sal;

101     aaa     10000
101     bbb     20000
102     bbb     40000
103     cccc    50000

Way3)
if table has so many columns

hive> create table dummy(line string);
hive> load data inpath
‘/user/hive/warehouse/hdp.db/info/dupes’
into table dummy;
hive> insert overwrite table dummy
select line from dummy group by line;

hive> load data inpath
‘/user/hive/warehouse/hdp.db/dummy/000000_0’
into table info;

hive> drop table dummy;

________________________________________

hive> select * from emp;
OK
101     vino    26000   m       11
102     Sri     25000   f       11
103     mohan   13000   m       13
104     lokitha 8000    f       12
105     naga    6000    m       13
101     janaki  10000   f       12
201     aaa     30000   m       12
202     bbbb    50000   f       13
203     ccc     10000   f       13
204     ddddd   50000   m       13
304     xxx     70000   m       14
305     yyy     80000   f       15
Time taken: 0.053 seconds
hive>

hive> create table todayemp(id int, name string,
>  sal int, sex string, dno int)
> row format delimited
>   fields terminated by ‘,’;
OK
Time taken: 0.031 seconds
hive> load data local inpath ’emp2′
>  into table todayemp;
Copying data from file:/home/training/emp2
Copying file: file:/home/training/emp2
Loading data to table hdp.todayemp
OK
Time taken: 0.095 seconds
hive> select * from todayemp;
OK
401     aaa     50000   m       11
402     bbbbbb  60000   f       12
403     cc      90000   m       11
Time taken: 0.051 seconds
hive>

appending  one table data another table data.

way1)
insert overwrite table emp
select * from (
select * from emp
union all
select * from todayemp) e;

— if first table has 1lakh and 2nd has 10 rows, just to 10 rows, hive is scanning 1lakh+10 rows.  [ bad ]

way2).

if delimiters of both tables is same.

hadoop fs -cp /user/hive/warehouse/hdp.db/todayemp/emp2   /user/hive/warehouse/hdp.db/emp

_______________________________________

union all:

if table schema is different.

tab1 –> id name  sal   sex dno
tab2  —> id name dno sex sal

select * from (
select id, name,sal,sex,dno from tab1
union all
select id, name,sal,sex,dno from tab2 ) t;

________________

if tables have different columns.

tab1 –> id, name,  sal, sex , dno , desig

tab2 –> id, name, income, gender, city

select * from (
select id, name, sal, sex, dno, desig, ‘neevuru’ as city from  tab1
union all
select id,name,income as sal, gender as sex,
0 as dno , ‘nopani’ as desig, city
from tab2 ) t;

 

 

Hive Lab 14: Joins and Cartisian Product

Joins
______________

used to collect data from two or more tables.

joins are two types.

i) Inner Joins
ii) External Joins.

Inner Joins:
matching rows with given join condition.

ex:
tab1 —> name,city
___________
A,hyd
B,del
c,hyd
d,del
e,pune
f,mumbai

tab2 —> city, dept
_________
hyd,production
del,hr
pune,accounts
chennai,admin
_______________

inner join –> based on city.

a,hyd,hyd,production
B,del,del,hr
c,hyd,hyd,production
d,del,del,hr
e,pune,pune,accounts

outer join –>
matching records and non-matching records.

outer joins are 3 types.
i) left outer join.
ii) right outer join
iii) full outer join.

left outer join:

matchings + non matchings of left side.
( complete presence from left side table).

tabx:–> id, dno
____________
101,11
102,14
___________

taby —> dno,loc
_____________
11,hyd
13,del
_____________–

tabx  lefter outer join taby –> based on dno.

101,11,11,hyd
102,14,null,null

right outer join –>
matchings and non matchings of right side table.

(complete presence from right side table ).

tabx right outer join taby –>based on dno.

101,11,11,hyd
null,null,13,del

full outer join —>
matchings, non matchings of left side,
and  non matchings of right side table.

tabx full outer join taby —> based on dno.

101,11,11,hyd
102,14,null,null
null,null,13,del

___________________________________

[training@localhost ~]$ cat emp
101,vino,26000,m,11
102,Sri,25000,f,11
103,mohan,13000,m,13
104,lokitha,8000,f,12
105,naga,6000,m,13
101,janaki,10000,f,12
201,aaa,30000,m,12
202,bbbb,50000,f,13
203,ccc,10000,f,13
204,ddddd,50000,m,13
304,xxx,70000,m,14
305,yyy,80000,f,15
[training@localhost ~]$ cat dept
11,marketing,hyd
12,hr,del
13,finance,hyd
20,prod,hyd
21,admin,chennai
[training@localhost ~]$

hive> create database joinsdb;
OK
Time taken: 1.761 seconds
hive> use joinsdb;
OK
Time taken: 0.027 seconds
hive> create table emp(id int, name string,
>   sal int, sex string, dno int)
> row format delimited
>   fields terminated by ‘,’;
OK
Time taken: 0.297 seconds
hive> load data local inpath ’emp’ into table emp;
Copying data from file:/home/training/emp
Copying file: file:/home/training/emp
Loading data to table joinsdb.emp
OK
Time taken: 0.207 seconds
hive> select * from emp;
OK
101     vino    26000   m       11
102     Sri     25000   f       11
103     mohan   13000   m       13
104     lokitha 8000    f       12
105     naga    6000    m       13
101     janaki  10000   f       12
201     aaa     30000   m       12
202     bbbb    50000   f       13
203     ccc     10000   f       13
204     ddddd   50000   m       13
304     xxx     70000   m       14
305     yyy     80000   f       15
Time taken: 0.194 seconds
hive>

hive> load data local inpath ‘dept’
>  into table dept;
Copying data from file:/home/training/dept
Copying file: file:/home/training/dept
Loading data to table joinsdb.dept
OK
Time taken: 0.12 seconds
hive> select * from dept;
OK
11      marketing       hyd
12      hr      del
13      finance hyd
20      prod    hyd
21      admin   chennai
Time taken: 0.068 seconds
hive>

Inner Join:

hive> select id,name,sal,sex,l.dno,r.dno,
>    dname,loc
>  from  emp l join dept r
>  on (l.dno=r.dno);

101     vino    26000   m       11      11      marketing hyd
102     Sri     25000   f       11      11      marketing hyd
104     lokitha 8000    f       12      12      hrdel
101     janaki  10000   f       12      12      hrdel
201     aaa     30000   m       12      12      hrdel
103     mohan   13000   m       13      13      finance   hyd
105     naga    6000    m       13      13      finance   hyd
202     bbbb    50000   f       13      13      finance   hyd
203     ccc     10000   f       13      13      finance   hyd
204     ddddd   50000   m       13      13      finance   hyd
Time taken: 14.11 seconds
hive>

Left outer Join :

hive> select id,name,sal,sex,l.dno,r.dno,
>    dname,loc
>  from  emp l left outer join dept r
>  on (l.dno=r.dno);

101     vino    26000   m       11      11      marketing hyd
102     Sri     25000   f       11      11      marketing hyd
104     lokitha 8000    f       12      12      hrdel
101     janaki  10000   f       12      12      hrdel
201     aaa     30000   m       12      12      hrdel
103     mohan   13000   m       13      13      finance   hyd
105     naga    6000    m       13      13      finance   hyd
202     bbbb    50000   f       13      13      finance   hyd
203     ccc     10000   f       13      13      finance   hyd
204     ddddd   50000   m       13      13      finance   hyd
304     xxx     70000   m       14      NULL    NULL      NULL
305     yyy     80000   f       15      NULL    NULL      NULL
Time taken: 12.786 seconds
hive>

Right outer Join:

hive> select id,name,sal,sex,l.dno,r.dno,
>    dname,loc
>  from  emp l right outer join dept r
>  on (l.dno=r.dno);

101     vino    26000   m       11      11      marketing hyd
102     Sri     25000   f       11      11      marketing hyd
104     lokitha 8000    f       12      12      hrdel
101     janaki  10000   f       12      12      hrdel
201     aaa     30000   m       12      12      hrdel
103     mohan   13000   m       13      13      finance   hyd
105     naga    6000    m       13      13      finance   hyd
202     bbbb    50000   f       13      13      finance   hyd
203     ccc     10000   f       13      13      finance   hyd
204     ddddd   50000   m       13      13      finance   hyd
NULL    NULL    NULL    NULL    NULL    20      prod      hyd
NULL    NULL    NULL    NULL    NULL    21      admin     chennai
Time taken: 12.429 seconds
hive>

Full outer Join;

hive> select id,name,sal,sex,l.dno,r.dno,
>    dname,loc
>  from  emp l full outer join dept r
>  on (l.dno=r.dno);

Denormalizing:

hive> create table info(id int, name string,
>   sal int, sex string , dname string,
>   loc string);
OK
Time taken: 0.04 seconds
hive> insert overwrite table info
>   select id, name, sal, sex, dname, loc
>  from emp l full outer join dept r
>  on (l.dno=r.dno);

hive> select * from info;
OK
101     vino    26000   m       marketing       hyd
102     Sri     25000   f       marketing       hyd
104     lokitha 8000    f       hr      del
101     janaki  10000   f       hr      del
201     aaa     30000   m       hr      del
103     mohan   13000   m       finance hyd
105     naga    6000    m       finance hyd
202     bbbb    50000   f       finance hyd
203     ccc     10000   f       finance hyd
204     ddddd   50000   m       finance hyd
304     xxx     70000   m       NULL    NULL
305     yyy     80000   f       NULL    NULL
NULL    NULL    NULL    NULL    prod    hyd
NULL    NULL    NULL    NULL    admin   chennai
Time taken: 0.071 seconds
hive>

_________________

Task:

hive> create table projects(dno1 int, dno2 int,
>   sal int);
OK
Time taken: 0.072 seconds
hive> insert overwrite table projects
>   select l.dno, r.dno, sal from
>   emp l full outer join dept r
>   on (l.dno=r.dno);

hive> create table proj(stat string, sal int);
hive> insert overwrite table proj
>  select
> if(dno1 is not null and dno2 is not null,
> ‘Working’,
>  if(dno2 is null,’BenchTeam’,’BenchProj’)),
>  sal from projects;

hive> select * from proj;
OK
Working 26000
Working 25000
Working 8000
Working 10000
Working 30000
Working 13000
Working 6000
Working 50000
Working 10000
Working 50000
BenchTeam       70000
BenchTeam       80000
BenchProj       NULL
BenchProj       NULL
Time taken: 0.087 seconds
hive>

hive> insert overwrite table proj
>  select stat, if(sal is null,0,sal)
>  from proj;

hive> create table summary(stat string,
>    totsal int, cnt int);
OK
Time taken: 0.042 seconds
hive> insert overwrite table summary
>   select stat, sum(sal), count(*)
>  from proj
>    group by stat;

hive> select * from summary;
OK
BenchProj       0       2
BenchTeam       150000  2
Working 228000  10

___________________________

[training@localhost ~]$ cat trans
01/01/2010,30000
01/07/2010,40000
01/08/2010,30000
01/26/2010,40000
02/01/2010,30000
02/07/2010,40000
02/09/2010,30000
02/28/2010,40000
03/01/2010,80000
03/07/2010,40000
04/01/2010,30000
04/17/2010,70000
04/18/2010,80000
04/26/2010,90000
05/01/2010,10000
05/07/2010,20000
05/08/2010,30000
05/26/2010,60000
06/01/2010,30000
06/17/2010,50000
07/01/2010,30000
07/07/2010,40000
08/08/2010,10000
08/26/2010,20000
09/01/2010,90000
09/07/2010,40000
09/18/2010,30000
09/30/2010,40000
10/01/2010,30000
10/07/2010,40000
10/08/2010,30000
11/26/2010,40000
11/01/2010,30000
12/07/2010,40000
12/08/2010,30000
12/26/2010,40000
[training@localhost ~]$

hive> create table rawsales(dt string,
>  amt int)
row format delimited
fields terminated by ‘,’;
OK
Time taken: 0.057 seconds
hive> load data local inpath ‘trans’
>  into table rawsales;
Copying data from file:/home/training/trans
Copying file: file:/home/training/trans
Loading data to table joinsdb.rawsales
OK
Time taken: 0.514 seconds
hive>

create table raw2(dt array<string> , amt int);

insert overwrite table raw2
select split(dt,’/’) , amt from rawsales;

hive> select * from raw2 limit 3;
OK
[“01″,”01″,”2010”]      30000
[“01″,”07″,”2010”]      40000
[“01″,”08″,”2010”]      30000
Time taken: 0.061 seconds
hive> insert overwrite table sales
>   select
> concat(dt[2],’-‘,dt[0],’-‘,dt[1]), amt
>  from raw2;

hive> create table sales2(mon int, amt int);
OK
Time taken: 0.044 seconds
hive> insert overwrite table sales2
>   select month(dt), amt from sales;

hive> create table rep1(mon int, tot int);
OK
Time taken: 0.034 seconds
hive> insert overwrite table rep1
>   select mon, sum(amt) from sales2
>    group by mon;

hive> create table carts(m1 int, m2 int,
>   tot1 int, tot2 int);
OK
Time taken: 0.044 seconds
hive> insert overwrite table carts
>   select l.mon , r.mon , l.tot, r.tot
>  from rep1 l join rep1 r;

hive> insert overwrite table carts
>  select * from carts
>    where (m1-m2)=1;

hive> select * from carts;
OK
12      11      110000  70000
7       6       70000   80000
2       1       140000  140000
8       7       30000   70000
4       3       270000  120000
9       8       200000  30000
10      9       100000  200000
5       4       120000  270000
11      10      70000   100000
3       2       120000  140000
6       5       80000   120000

hive> alter table carts add columns(per int);
OK
Time taken: 0.067 seconds
hive> insert overwrite table carts
>   select m1,m2,tot1,tot2,
>     ((tot1-tot2)*100)/tot2 from carts;

hive> select * from carts;
OK
12      11      110000  70000   57
7       6       70000   80000   -12
2       1       140000  140000  0
8       7       30000   70000   -57
4       3       270000  120000  125
9       8       200000  30000   566
10      9       100000  200000  -50
5       4       120000  270000  -55
11      10      70000   100000  -30
3       2       120000  140000  -14
6       5       80000   120000  -33
Time taken: 0.061 seconds

 

_________________________________________

 

 

 

Advertisements

History of Hadoop and Map Reduce

Hi Guys!

This Post is to introduce you to Hadoop and Map Reduce. In our previous post, we  have discussed about What is Big Data? What are its types ? and the reasons to learn Hadoop.

Lets see how Hadoop came into existence.

                                                       History of Hadoop

  • Even though the word “Hadoop” may be new to you, but it is already 10 years old now.
  • As every one has a history, our Hadoop also have a huge history.
  • The Story begins on a sunny afternoon in the year 1997, Doug Cutting, a Yahoo! Employee started writing the first version of Lucene

What is Lucene ?

  • Lucene is a text search library designed by Doug Cutting. This library was used for the faster search of web pages.
  • But after years, he experienced “Dead Code Syndrome”, so for better solutions he open sourced to Source Forge.
  • In 2001, it was made Apache Lucene and then focused on indexing the web pages.
  • Mike Cafarella, a graduate from University of Washington joined him to index the entire web.
  • This combined effort yielded a new Lucene sub-project called as Apache Nutch
  • An important algorithm, that’s used to rank web pages by their relative importance, is called PageRank, after Larry Page, who came up with it
  • It’s really a simple and brilliant algorithm, which basically counts how many links from other pages on the web point to a page. The page that has the highest count is ranked the highest (shown on top of search results). Of course, that’s not the only method of determining page importance, but it’s certainly the most relevant one.

Origin of HDFS

  • During this course of time, Cutting and Cafarella were facing four different issues with the existing file system

[1] Schema less (no tables and columns)

[2] Durable (once written should never lost)

[3] Capability of handling component failure ( CPU, Memory, Network)

[4] Automatically re-balanced (disk space consumption)

                                                                            Google’s Solution 

  • In 2003, Google published GFS Paper. Cutting and Cafarella were astonished to see solutions for the difficulties they were facing during this time.
  • Therefore, using this GFS Paper and integrating Java , they developed their own File System called as NDFS( Nutch Distributed File System).
  • But the problem of durability and fault tolerance was still not solved.
  • Thus, they came up with an idea of distributed processing and divided the file system into 64mb chunks and storing each element on 3 different nodes(replication factor) & set it default to 3.

Time for Map Reduce

  • Now, these guys needed an algorithm for the NDFS cause they want to integrate parallel processing nothing but running multi nodes at same time.
  • Thus in the year 2004, Google published a paper called as Map Reduce – Simple Data Processing on Large Clusters.
  • This algorithm have solved problems like

[1] Parallelization

[2] Distribution

[3] Fault-tolerance

Rise of Hadoop

  • In the year 2005, Cutting reported that Map Reduce is integrated into Nutch
  • In the year 2006, he pulled out the GFS, Map Reduce out of Nutch Codebase and named it Hadoop.
  • Hadoop included Hadoop Common(Core Libraries), HDFS and Map Reduce
  • Later Yahoo was facing the same problems and they employed Cutting again to transform their file system to Hadoop which saved Yahoo! actually.

Facebook, Twitter, LinkedIn…

  • Later, Companies like Facebook, Twitter, LinkedIn started using Hadoop
  • In the year 2008, Hadoop was still a sub project of Lucene. So, Cutting made it a separate project and licensed it under Apache Software Foundation.
  • Different Companies started noticing problem with their File System and started experimenting with Hadoop and were creating sub-projects like Hive, PIG, HBase, Zookeeper.

This is all about how Hadoop and Map Reduce came into existence. People say Hadoop is a new technology but it is already 10 years old.

Introduction to Hadoop – HDFS and Map Reduce

Hi Friends,

In the last post, we have went through the History of Hadoop. In this blog we will understand about What is Hadoop ? What does it consists of ? and Where it is used?

  • The Hadoop platform consists of two key services: a reliable, distributed file system called Hadoop Distributed File System (HDFS) and the high-performance parallel data processing engine called Hadoop MapReduce.
  • Hadoop was created by Doug Cutting and named after his son’s toy elephant. Vendors that provide Hadoop-based platforms include Cloudera, Hortonworks, MapR, Greenplum, IBM and Amazon.

Data Distribution

  • Data distribution used in Hadoop is parallel processing and the file system used here is Distributed File System.

Advantages of Distributed File Systems are

[1] I/O Speed

[2] Less processing time

$ Imagine one single machine which is processing 1 TB of data. So, within some time it will process it. But what if the data is more? For example say 500TB?

DFS

If it takes like 45 min to process 1 TB data using traditional database, then what if the data is 500TB?

DFS time

It will take lot of time and processing speed will be decreased. So, in order to overcome this problem, we are going with DFS (Distributed File System).

High Level Architecture

Hadoop Architecture mainly consists of HDFS and Map Reduce

[1] Hadoop Distributed File System

[2] Map Reduce

HDFS is used for storage and Map Reduce is used for processing the large data sets.

Hadoop Distributed File System

Hadoop follows Master-Slave Architecture and it has 3 Master daemon and 2 Slave daemon

HDFS Architecture

  • Master daemon:

[1] Name Node

[2] Secondary Name node

[3] Job tracker

  • Slave daemon:

[1] Data Node

[2] Task tracker

  • Why HDFS?

[1] Highly fault tolerant (Power failures)

[2] High throughput (Reduce the processing time)

[3] Suitable for applications with large datasets

[4] Streaming access to file system data (write once, use many times)

[5] Can be built out of commodity hardware

  • Design of HDFS

File system designed for storing very large files, with streaming data access pattern, running clusters on commodity hardware.

  • Where HDFS is not used?

[1] Low latency data access

[2] Lots of small files

  • Daemon:

[1] It is a service, process running in the background.

[2] It is a term used in UNIX technology

  • Name node:

[1] Masters the system

[2] Maintains and manages the blocks which are present on data nodes

  • Data node:

[1] Slaves which are deployed on each machine and provide actual storage

[2] Responsible for sending read/write requests for the clients

job tracker

  • Secondary Name node:

[1] Name node stores the data in RAM

[2] Secondary name node stores the data in file system Ex: HDD

  • Edit log:

[1] It is a converter that converts the text data in the name node into images and stores them in the secondary name node

Map Reduce

It is an algorithm which is used in the Hadoop Framework for processing of the large datasets.

In order to understand the concept of Map Reduce and how it works, we need to see about few terms used in processing of the data.

[1] Data

[2] Input Splitter

[3] Record Reader

[4] Mapper

[5] Intermediate Generator

[6] Reducer

[7] Record Writer

[8] HDFS

  • So, we have seen about the architecture of the HDFS and now we will be discussing about the terms used in the Map Reduce Algorithm.
  • As we are processing the large data sets, the first term used in this process is DATA. The input data which we want to process is nothing but the data.
  • Once, we have the data in our File System nothing but the HDFS, we will be dividing/splitting into blocks.
  • This blocks are also called as Chunks. The default size of this block is 64MB. But it can be expanded to 128 MB to make the processing faster.
  • This means that, the input data is split into blocks based on the size we want using the Input Splitter.
  •  Next, we have a record reader which will map the data present into the block to the mapper. Remember the number of blocks will be equal to number of mappers. It means the HDFS will create the same no of mappers as the number of blocks or chunks. And it is done by Record Reader.
  •  Mapper will store the data present in the blocks. When it comes to actual definition, it is a class (specifically a Java Class) in which the initialization part is done.
  •  Intermediate Generator will collect all the data from the mappers and send to the reducer for the processing. Processing may include retrieving, inserting, deleting or any other kind of function or calculation.
  • Sometimes, we may have duplicate or the repetitive data, in that case intermediate generator will assign by doing short shuffling nothing but check whether if there is any input data, then collects and send it to the Reducer.
  • Reducer is also a Java Class which will process the data based on the code we write in the class. This will process the whole data and once the data is processed it will send to the File System nothing but the HDFS. This data is sent to HDFS using Record Writer.
  • So, what HDFS is?
  • HDFS is Hadoop Distributed File System which we discussed above, and it just stores the data. The Processing part of the data is done by Map Reduce.
  • Finally we can say that, HDFS and Map Reduce collectively make Hadoop for the storing and processing of data.

George’s Visit to Hyderabad – Participation Goals Event

George Roter,Head of Core Contributors,Mozilla Participation Visit to Hyderabad Mozilla Community.
Extraordinary Event and Amazing Session of ‪#‎George‬ explaining the Mozilla Vision and Mission to all ‪#‎Mozillians‬ & Supporters of Mozilla about Mozilla Products,Applications,Innovative Programs and lot more Initiatives making India as a Web illiteracy Free Nation which was the primary goal of Mozilla thereby taking technology into hands of distinct trades of people across the globe.
Very Glad and Happy to be very small part of Mozilla…Eagerly Waiting to do more contribution to Mozilla with my best knowledge and acquired skills in near future.

5 Reasons to Learn Hadoop

Hi Guys!

This Blog Post is to know the top most 5 reasons to learn Hadoop. Lets go through them one by one.

HadoopTutorialThemeImage_final

Big Data and Hadoop skill could mean the difference between having your dream career and getting left behind.
Dice has  quoted, “Technology professionals should be volunteering for Big Data projects, which makes them more valuable to their current employer and more marketable to other employers.”

Career with Hadoop

According to 90 executives who participated in the ‘The Big Data Executive Survey 2013’ conducted by NewVantage Partners LLC, supported by the Fortune 1000 senior Business & Technology executives, 90% of the organizations surveyed are already doing something with Big Data.

Hadoop skills are in demand – this is an undeniable fact! Hence, there is an urgent need for IT professional to keep themselves in trend with Hadoop and Big Data technologies. The above info graph shows how many organizations are influenced by Big Data and looking to implement them, if not already.

Big Data Implementation

[1]

More Job Opportunities with Apache Hadoop

  • Looking at the Big Data market forecast, it looks promising and the upward trend will keep progressing with time.
  • Therefore the job trend or Market is not a short lived phenomenon as Big Data and its technologies are here to stay.
  • Hadoop has the potential to improve job prospects whether you are a fresher or an experienced professional.

[2]

  • The Indian Big Data industry is predicted to grow five-fold from the current level of $200 million to $1 billion by 2015 which is 4% of the expected global share.
  • At the same time Gartner has predicted that there is going to be significant gap in job openings and candidates with Big Data skills.
  • This is the right time to take advantage of this opportunity. This skill gap in Big Data can be bridged through comprehensive learning of Apache Hadoop that enables professionals and fresher’s alike, to add the valuable Big Data skills to their profile.

Filled Jobs vs Unfilled Jobs

[2]-2

Look who is employing?

[3]Where to search for Jobs in Hadoop?

  • LinkedIn is the best place to get information on the number of existing Hadoop professional.
  • The above info graph talks about the top companies employing Hadoop professionals and who is leading of them all.
  • Yahoo! happens to be leading in this race.

According to Dice,

  • Dice has stated that “Tech salaries has nearly 3% bump last year and IT pros with expertise in big data-related languages, databases and skills enjoyed some of the largest pay checks.”
  • Dice has also quoted, “Technology professionals should be volunteering for Big Data projects, which makes them more valuable to their current employer and more marketable to other employers.”

Big Data and Hadoop equal bucks!

[4]Top Hadoop Technology Companies

[5]

What is Big Data ?

Hi Guys!

This Blog Post is about Understanding Big Data. We will be seeing through What exactly Big Data is? What are its types and what are its characteristics and the use-cases of big data.

Word Cloud

So, What is Big Data ?

Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, and information privacy.

During 1990’s,When the IT Organizations were evolving, the employees in the particular organizations used to generate the data.

Before 2000

Later in 2000’s, when social networking sites, e-commerce websites came into existence, even users started generating data.

after 2000multitasking-mobile-devices-660x429Now after 2010’s due to emerging smartphone technologies and motion sensor techniques even devices started generating data.

How much data is generated per minute ?

  • Facebook users share nearly 2.5 million pieces of content.
  • Twitter users tweet nearly 300,000 times.
  • Instagram users post nearly 220,000 new photos.
  • YouTube users upload 72 hours of new video content.
  • Apple users download nearly 50,000 apps.
  • Email users send over 200 million messages.
  • Amazon generates over $80,000 in online sales.

Data is generated from almost everywhere!

Data is generated from Healthcare, Multi-channel Sales, Finance, Log Analysis, Homeland Security, Traffic Control, Telecom,Search Quality, Manufacturing, Trading Analytics, Fraud and Risks and Retail.

Picture9

Data generated by Hadron Collider

Types of Data:

[1] Structured Data

[2] Unstructured Data

[3] Semi Structured Data

[1] Structured Data:

  • Your current data warehouse contains structured data and only structured data.
  • It’s structured because when you placed it in your relational database system a structure was enforced on it, so we know where it is, what it means, and how it relates to other pieces of data in there.
  • It may be text (a person’s name) or numerical (their age) but we know that the age value goes with a specific person, hence structured.

[2] Unstructured Data:

  • Essentially everything else that has not been specifically structured is considered unstructured.
  • The list of truly unstructured data includes free text such as documents produced in your company, images and videos, audio files, and some types of social media.
  • If the object to be stored carries no tags (metadata about the data) and has no established schema, ontology, glossary, or consistent organization it is unstructured.
  • However, in the same category as unstructured data there are many types of data that do have at least some organization.

[3] Semi Structured Data:

  • The line between unstructured data and semi-structured is a little fuzzy.
  • If the data has any organizational structure (a known schema) or carries a tag (like XML extensible markup language used for documents on the web) then it is somewhat easier to organize and analyze, and because it is more accessible for analysis may make it more valuable.

Example: Text ( XML, Email), Web Server logs and server patterns, sensor data

Characterization of Big Data:

Picture11

3V’s of Big Data :Picture13

Picture12

Picture14Some call it as 4V’s

Picture15Applications and Use-cases of Big Data:

Picture16Picture17

Popular Use-cases :

[1] A 360 degree view of the customer :

  • This use is most popular, according to Gallivan. Online retailers want to find out what shoppers are doing on their sites — what pages they visit, where they linger, how long they stay, and when they leave.
  • “That’s all unstructured clickstream data,” said Gallivan. “Pentaho takes that and blends it with transaction data, which is very structured data that sits in our customers’ ERP [business management] system that says what the customers actually bought.”

[2] Internet of Things :

  • The second most popular use case involves IoT-connected devices managed by hardware, sensor, and information security companies. “These devices are sitting in their customers’ environment, and they phone home with information about the use, health, or security of the device,” said Gallivan.
  • Storage manufacturer NetApp, for instance, uses Pentaho software to collect and organize “tens of millions of messages a week” that arrive from NetApp devices deployed at its customers’ sites. This unstructured machine data is then structured, put into Hadoop, and then pulled out for analysis by NetApp.

[3] Data warehouse optimization :

  • This is an “IT-efficiency play,” Gallivan said. A large company, hoping to boost the efficiency of its enterprise data warehouse, will look for unstructured or “active” archive data that might be stored more cost effectively on a Hadoop platform. “We help customers determine what data is better suited for a lower-cost computing platform.”

[4] Big data service refinery :

  • This means using big-data technologies to break down silos across data stores and sources to increase corporate efficiency.
  • A large global financial institution, for instance, wanted to move from next-day to same-day balance reporting for its corporate banking customers. It brought in Pentaho to take data from multiple sources, process and store it in Hadoop, and then pull it out again. This allowed the bank’s marketing department to examine the data “more on an intra-day than a longer-frequency basis,” Gallivan told us.
  • “It was about driving an efficiency gain that they couldn’t get with their existing relational data infrastructure. They needed big-data technologies to collect this information and change the business process.”

[5] Information security :

  • This last use case involves large enterprises with sophisticated information security architectures, as well as security vendors looking for more efficient ways to store petabytes of event or machine data.
  • In the past, these companies would store this information in relational databases. “These traditional systems weren’t scaling, both from a performance and cost standpoint,” said Gallivan, adding that Hadoop is a better option for storing machine data.

Traditional Databases :

  • The relational database management system (or RDBMS) had been the one solution for all database needs. Oracle, IBM (IBM), and Microsoft (MSFT) are the leading players of RDBMS.
  •  RDBMS uses structured query language (or SQL) to define, query, and update the database.
  • However, the volume and velocity of business data has changed dramatically in the last couple of years. It’s skyrocketing every day.
  • Limitations of RDBMS to support “big data” :
  • First, the data size has increased tremendously to the range of petabytes—one petabyte = 1,024 terabytes. RDBMS finds it challenging to handle such huge data volumes.
  • To address this, RDBMS added more central processing units (or CPUs) or more memory to the database management system to scale up vertically.
  • Second, the majority of the data comes in a semi-structured or unstructured format from social media, audio, video, texts, and emails.
  • However, the second problem related to unstructured data is outside the purview of RDBMS because relational databases just can’t categorize unstructured data.
  • They’re designed and structured to accommodate structured data such as weblog sensor and financial data.
  • Also, “big data” is generated at a very high velocity. RDBMS lacks in high velocity because it’s designed for steady data retention rather than rapid growth.
  • Even if RDBMS is used to handle and store “big data,”  it will turn out to be very expensive.
  • As a result, the inability of relational databases to handle “big data” led to the emergence of new technologies.