Thursday, May 29, 2014

Pivot and Unpivot operations in Oracle 11g Release 2

This article shows how to use the new PIVOT and UNPIVOT operators in 11g release 2.

PIVOT
For starting, we need some data
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
create table sales
(
    customer_id varchar2(10),   
    product_id varchar2(10),
    month number,
    amount number
);

insert into sales values('Adam','Product 1',1,134);
insert into sales values('Adam','Product 2',2,250);
insert into sales values('Adam','Product 3',3,190);
insert into sales values('Jones','Product 1',1,200);
insert into sales values('Jones','Product 3',2,126);
insert into sales values('Kanes','Product 2',1,240);
insert into sales values('Kanes','Product 2',2,210);
insert into sales values('Kanes','Product 3',3,250);
commit;


As all we know, relational tables are tabular - that is, they are presented in a column-value pair. Now the simple select query is below:
1
2
SELECT a.customer_id, a.product_id, a.month, a.amount
  FROM sales a;

And this is the output:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
CUSTOMER_ID PRODUCT_ID MONTH  AMOUNT    
----------- ---------- ------ --------- 
Adam        Product 1       1       134
Adam        Product 2       2       250
Adam        Product 3       3       190
Jones       Product 1       1       200
Jones       Product 3       2       126
Kanes       Product 2       1       240
Kanes       Product 2       2       210
Kanes       Product 3       3       250

Suppose if we want to transpose the query output to row-value instead of column-value, it is called Pivoting (Rows to Columns Conversion) in SQL. The output will be transposed as
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
CUSTOMER_ID MONTH  PRODUCT_1    PRODUCT_2    PRODUCT_3    
----------- ------ ------------ ------------ ------------ 
Adam             1          134                          
Adam             2                       250             
Adam             3                                    190
Jones            1          200                          
Jones            2                                    126
Kanes            1                       240             
Kanes            2                       210             
Kanes            3                                    250

Prior to Oracle Database 11g, you would do that via some sort of a DECODE function for each value and write each distinct value as a separate column (in this article we do not mention about DECODE function). The technique is quite nonintuitive however.

Fortunately, you now have a great new feature called PIVOT for presenting any query in the crosstab format using a new operator, appropriately named PIVOT. Here is how you write the query:
1
2
3
4
5
6
7
SELECT   *
    FROM sales PIVOT (SUM (amount)
               FOR product_id
               IN  ('Product 1' AS Product_1,
                   'Product 2' AS Product_2,
                   'Product 3' AS Product_3))
ORDER BY customer_id, month;

With the SELECT statement above, we notice that we have some NULL or empty values. To replace NULL values with another value like zero value we can use COALESCE function like this:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
SELECT   customer_id,
         month,
         COALESCE (Product_1, 0) Product_1,
         COALESCE (Product_2, 0) Product_2,
         COALESCE (Product_3, 0) Product_3
    FROM sales PIVOT (SUM (amount)
               FOR product_id
               IN  ('Product 1' AS Product_1,
                   'Product 2' AS Product_2,
                   'Product 3' AS Product_3))
ORDER BY customer_id, month;

Next, suppose that we have n products, how can we list these n products not product by product? In Oracle 11g, we can do this by using PIVOT plus XML clause:
1
2
3
4
5
SELECT   *
    FROM sales PIVOT XML (SUM (amount)
               FOR product_id
               IN (SELECT DISTINCT product_id FROM sales))
ORDER BY customer_id, month;

And here is the result:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
CUSTOMER_ID MONTH  PRODUCT_ID_XML 
----------- ------ -------------- 
Adam             1 <PivotSet><item><column name = "PRODUCT_ID">Product 1</column><column name = "SUM(AMOUNT)">134</column></item><item><column name = "PRODUCT_ID">Product 2</column><column name = "SUM(AMOUNT)"></column></item><item><column name = "PRODUCT_ID">Product 3</column><column name = "SUM(AMOUNT)"></column></item></PivotSet>
Adam             2 <PivotSet><item><column name = "PRODUCT_ID">Product 1</column><column name = "SUM(AMOUNT)"></column></item><item><column name = "PRODUCT_ID">Product 2</column><column name = "SUM(AMOUNT)">250</column></item><item><column name = "PRODUCT_ID">Product 3</column><column name = "SUM(AMOUNT)"></column></item></PivotSet>
Adam             3 <PivotSet><item><column name = "PRODUCT_ID">Product 1</column><column name = "SUM(AMOUNT)"></column></item><item><column name = "PRODUCT_ID">Product 2</column><column name = "SUM(AMOUNT)"></column></item><item><column name = "PRODUCT_ID">Product 3</column><column name = "SUM(AMOUNT)">190</column></item></PivotSet>
Jones            1 <PivotSet><item><column name = "PRODUCT_ID">Product 1</column><column name = "SUM(AMOUNT)">200</column></item><item><column name = "PRODUCT_ID">Product 2</column><column name = "SUM(AMOUNT)"></column></item><item><column name = "PRODUCT_ID">Product 3</column><column name = "SUM(AMOUNT)"></column></item></PivotSet>
Jones            2 <PivotSet><item><column name = "PRODUCT_ID">Product 1</column><column name = "SUM(AMOUNT)"></column></item><item><column name = "PRODUCT_ID">Product 2</column><column name = "SUM(AMOUNT)"></column></item><item><column name = "PRODUCT_ID">Product 3</column><column name = "SUM(AMOUNT)">126</column></item></PivotSet>
Kanes            1 <PivotSet><item><column name = "PRODUCT_ID">Product 1</column><column name = "SUM(AMOUNT)"></column></item><item><column name = "PRODUCT_ID">Product 2</column><column name = "SUM(AMOUNT)">240</column></item><item><column name = "PRODUCT_ID">Product 3</column><column name = "SUM(AMOUNT)"></column></item></PivotSet>
Kanes            2 <PivotSet><item><column name = "PRODUCT_ID">Product 1</column><column name = "SUM(AMOUNT)"></column></item><item><column name = "PRODUCT_ID">Product 2</column><column name = "SUM(AMOUNT)">210</column></item><item><column name = "PRODUCT_ID">Product 3</column><column name = "SUM(AMOUNT)"></column></item></PivotSet>
Kanes            3 <PivotSet><item><column name = "PRODUCT_ID">Product 1</column><column name = "SUM(AMOUNT)"></column></item><item><column name = "PRODUCT_ID">Product 2</column><column name = "SUM(AMOUNT)"></column></item><item><column name = "PRODUCT_ID">Product 3</column><column name = "SUM(AMOUNT)">250</column></item></PivotSet>

We see that the PRODUCT_ID_XML column returns XML data type and it is very hard to see the result.
Now let's get the first XML value to see.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
<PivotSet>
 <item>
  <column name = "PRODUCT_ID">Product 1</column>
  <column name = "SUM(AMOUNT)">134</column>
 </item>
 <item>
  <column name = "PRODUCT_ID">Product 2</column>
  <column name = "SUM(AMOUNT)"/>
 </item>
 <item>
  <column name = "PRODUCT_ID">Product 3</column>
  <column name = "SUM(AMOUNT)"/>
 </item>
</PivotSet>

With XML data type, we can use EXTRACTVALUE function with XPath structure to get the value that we want. We can do like this:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
SELECT customer_id,
       month,
       EXTRACTVALUE (product_id_xml, '/PivotSet/item[1]/column[2]') Product_1,
       EXTRACTVALUE (product_id_xml, '/PivotSet/item[2]/column[2]') Product_2,
       EXTRACTVALUE (product_id_xml, '/PivotSet/item[3]/column[2]') Product_3
  FROM (SELECT   *
            FROM sales PIVOT XML (SUM (amount)
                       FOR product_id
                       IN (SELECT DISTINCT product_id FROM sales))
        ORDER BY customer_id, month);

And here is what we want to see:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
CUSTOMER_ID MONTH  PRODUCT_1 PRODUCT_2 PRODUCT_3
----------- ------ --------- --------- ---------
Adam             1 134                 
Adam             2           250       
Adam             3                     190
Jones            1 200                 
Jones            2                     126
Kanes            1           240       
Kanes            2           210       
Kanes            3                     250

UNPIVOT
For starting, we need some data
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
create table sales_month
(
    customer_id varchar2(10),  
    product_id varchar2(10),
    january number,
    february number,
    march number
);
 
insert into sales_month values('Kanes','Product 1',0,0,1);
insert into sales_month values('Jones','Product 1',0,1,1);
insert into sales_month values('Kanes','Product 3',0,1,0);
insert into sales_month values('Kanes','Product 2',1,0,0);
insert into sales_month values('Adam','Product 2',0,1,0);
insert into sales_month values('Adam','Product 1',1,0,0);
insert into sales_month values('Adam','Product 3',0,0,1);
commit;


The UNPIVOT operator converts column-based data into separate rows. See the simple query before going to converting column-based data.
1
select * from sales_month;

And the result is:
1
2
3
4
5
6
7
8
9
CUSTOMER_ID PRODUCT_ID JANUARY  FEBRUARY  MARCH  
----------- ---------- -------- --------- ------ 
Kanes       Product 1         0         0      1
Jones       Product 1         0         1      1
Kanes       Product 3         0         1      0
Kanes       Product 2         1         0      0
Adam        Product 2         0         1      0
Adam        Product 1         1         0      0
Adam        Product 3         0         0      1

Prior to 11g, we can use the DECODE function and a pivot table with the correct number of rowsby using the CONNECT BY clause in a query from dual to generate the correct number of rows for the unpivot operation. But in this article we do not mention about this.

So we will use UNPIVOT function to convert column-based data into separate rows:
1
2
3
4
5
SELECT   *
    FROM sales_month UNPIVOT (yes
                     FOR month
                     IN ("JANUARY", "FEBRUARY", "MARCH"))
ORDER BY customer_id, product_id;

Here is the result:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
CUSTOMER_ID PRODUCT_ID MONTH    YES  
----------- ---------- -------- ---- 
Adam        Product 1  JANUARY     1
Adam        Product 1  FEBRUARY    0
Adam        Product 1  MARCH       0
Adam        Product 2  FEBRUARY    1
Adam        Product 2  JANUARY     0
Adam        Product 2  MARCH       0
Adam        Product 3  FEBRUARY    0
Adam        Product 3  JANUARY     0
Adam        Product 3  MARCH       1
Jones       Product 1  FEBRUARY    1
Jones       Product 1  JANUARY     0
Jones       Product 1  MARCH       1
Kanes       Product 1  MARCH       1
Kanes       Product 1  JANUARY     0
Kanes       Product 1  FEBRUARY    0
Kanes       Product 2  FEBRUARY    0
Kanes       Product 2  JANUARY     1
Kanes       Product 2  MARCH       0
Kanes       Product 3  MARCH       0
Kanes       Product 3  FEBRUARY    1
Kanes       Product 3  JANUARY     0

With the result above, we want to sort the month column in the order. How can we do that?
We use ORDER BY clause and DECODE function:
1
2
3
4
5
6
7
SELECT   *
    FROM sales_month UNPIVOT (yes
                     FOR month
                     IN ("JANUARY", "FEBRUARY", "MARCH"))
ORDER BY customer_id,
         product_id,
         DECODE (month,  'JANUARY', 1,  'FEBRUARY', 2,  'MARCH', 3);

And here:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
CUSTOMER_ID PRODUCT_ID MONTH    YES  
----------- ---------- -------- ---- 
Adam        Product 1  JANUARY     1
Adam        Product 1  FEBRUARY    0
Adam        Product 1  MARCH       0
Adam        Product 2  JANUARY     0
Adam        Product 2  FEBRUARY    1
Adam        Product 2  MARCH       0
Adam        Product 3  JANUARY     0
Adam        Product 3  FEBRUARY    0
Adam        Product 3  MARCH       1
Jones       Product 1  JANUARY     0
Jones       Product 1  FEBRUARY    1
Jones       Product 1  MARCH       1
Kanes       Product 1  JANUARY     0
Kanes       Product 1  FEBRUARY    0
Kanes       Product 1  MARCH       1
Kanes       Product 2  JANUARY     1
Kanes       Product 2  FEBRUARY    0
Kanes       Product 2  MARCH       0
Kanes       Product 3  JANUARY     0
Kanes       Product 3  FEBRUARY    1
Kanes       Product 3  MARCH       0

Wednesday, May 14, 2014

Ubuntu Server 14.04 - Install Sqoop 1.99.3

Following steps to install Apache Sqoop 1.99.3
Prerequisites:

1. Download sqoop tar
1
hduser@ubuntu:~$ wget http://www.trieuvan.com/apache/sqoop/1.99.3/sqoop-1.99.3-bin-hadoop200.tar.gz

2. Extract tar file
1
hduser@ubuntu:~$ tar xzvf sqoop-1.99.3-bin-hadoop200.tar.gz

3. Rename folder
1
hduser@ubuntu:~$ mv sqoop-1.99.3-bin-hadoop200 sqoop

4. Move folder
1
root@ubuntu:~# mv /home/hduser/sqoop /usr/local/

5. Modify environment
1
hduser@ubuntu:~$ vi ~/.bashrc

Insert code below
1
2
3
export HADOOP_HOME=/usr/local/hadoop 
export SCOOP_HOME=/usr/local/sqoop 
export PATH=$PATH:$HADOOP_HOME/bin/:$SCOOP_HOME/bin/

6. Reboot the system to apply new configuration
1
root@ubuntu:~# reboot

7. Start Sqoop server
Important: make sure Hadoop is started (See more about Hadoop at Install Hadoop)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
hduser@ubuntu:/usr/local/sqoop$ sqoop.sh server start
Sqoop home directory: /usr/local/sqoop
Setting SQOOP_HTTP_PORT:     12000
Setting SQOOP_ADMIN_PORT:     12001
Using   CATALINA_OPTS:
Adding to CATALINA_OPTS:    -Dsqoop.http.port=12000 -Dsqoop.admin.port=12001
Using CATALINA_BASE:   /usr/local/sqoop/server
Using CATALINA_HOME:   /usr/local/sqoop/server
Using CATALINA_TMPDIR: /usr/local/sqoop/server/temp
Using JRE_HOME:        /usr/local/java/jdk1.7.0_55/
Using CLASSPATH:       /usr/local/sqoop/server/bin/bootstrap.jar
hduser@ubuntu:/usr/local/sqoop$

8. Open the browser to test sqoop server is started
Sqoop
By default the Sqoop server is run at port 12000

OR
We can login Sqoop shell
1
2
3
4
5
hduser@ubuntu:~$ sqoop.sh client
Sqoop home directory: /usr/local/sqoop
Sqoop Shell: Type 'help' or '\h' for help.

sqoop:000>

9. Stop Sqoop server
1
hduser@ubuntu:/usr/local/sqoop$ sqoop.sh server start

Ubuntu Server 14.04 - Install Hive 0.13.0

Following steps to install Hive 0.13.0

1. Download hive tar
1
hduser@ubuntu:~$ wget http://mirror.tcpdiag.net/apache/hive/hive-0.13.0/apache-hive-0.13.0-bin.tar.gz

2. Extract tar file
1
hduser@ubuntu:~$ tar -xzvf apache-hive-0.13.0-bin.tar.gz

3. Rename folder
1
hduser@ubuntu:~$ mv apache-hive-0.13.0-bin hive

4. Move hive folder
1
root@ubuntu:~# mv /home/hduser/hive/ /usr/local/

5. Update environment
1
hduser@ubuntu:~$ vi ~/.bashrc

Insert code below to the end of file
1
2
3
4
export HADOOP_HOME=/usr/local/hadoop
export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin
export PATH=$PATH:$HADOOP_HOME/bin

6. Reboot the system to apply new change
1
root@ubuntu:~$ reboot

7. Login to Hive shell
1
2
3
4
hduser@ubuntu:~$ hive

Logging initialized using configuration in jar:file:/usr/local/hive/lib/hive-common-0.13.0.jar!/hive-log4j.properties
hive>