Sunday, November 8, 2015

1.6 How to transform Data while loading using COPY statement.

How to transform Data while loading using COPY statement.

    1. Purpose
Transforming data during loads is useful for computing values to insert into a target database column from other columns in the source database.
    1. Syntax
To transform data during load, use the following syntax to specify the target column for which you want to compute values, as an expression:
 
COPY [ [database-name.]schema-name.]table [([Column as Expression] / column[FORMAT 'format']
     [ ,...])]
FROM ...
    1. Requirement
When transforming data during loads, the COPY statement must contain at least one parsed column.

Note that the parsed column can be a FILLER column. FILLER are fields from file that we are not loading to any target column.
Such fields must be marked as FILLER. However FILLER data could be used for extracting values for other column.
Transforming data while loading is one of the advantages of using FILLER.

Note that when we extract data from another column(non filler), then that another column is also loaded to the table.
4
The expression you use in a COPY statement can be as simple as a single column or as complex as a case expression for multiple columns.

You can specify multiple columns in a COPY expression, and have multiple COPY expressions refer to the same parsed column.

You can specify COPY expressions for columns of all supported data types.
    1. Allowed Functions
COPY expressions can use many HP Vertica-supported SQL functions, operators, constants, NULLs, and comments, including these functions:
 

    1. Not allowed.
COPY expressions cannot use SQL meta functions (HP Vertica-specific), analytic functions, aggregate functions, or computed columns.
7
The return data type of the expression must be coercible to that of the target column. 
    1. How copy handles transformation errors
Errors that occur in COPY expressions are treated as SQL errors, not parse errors. When a parse errors occur, COPY rejects the row and adds a copy of the row to the rejected data file. COPY also adds a message to the exceptions file describing why the row was rejected.

 
COPY expression errors are treated as SQL errors and cause the entire load to rollback.
The HP Vertica-specific log file will include the SQL error message, but the reason for the rollback is not obvious without researching the log.

Example 1: Computing column values from other columns.

Below example uses a column JOINING_DATE to extract value for another column GIFT_ELIGIBLE.

Before running COPY statement we need to set some variables.

<> /home/sukul1 $ export LANG=en_US.UTF-8
<> /home/sukul1 $ export PAGER=less

We need to add the location of the vsql utility to the PATH variable, before running the below command.

<> /home/sukul1 $ vsql -U sukul1 -w bored2d# -h gone077.xxxx.lll.com
Welcome to vsql, the Vertica Analytic Database interactive terminal.

Type:  \h or \? for help with vsql commands
       \g or terminate with semicolon to execute query
       \q to quit

Bad terminal type: "xterm". Will assume vt100.

Following is the file we would load.
Note the command  \! Allows us to run shell commands from within the vsql.

sukul1=> \! cat /home/sukul1/employeedet.dat
SUKUL|28|2010-12-20|10000
RUTUJA|24|2005-12-24|20000

Following is the command used to create the table USER_30_DAY_TABLES.EMPLOYEE_1

sukul1=> CREATE TABLE USER_30_DAY_TABLES.EMPLOYEE_1
sukul1-> (
sukul1(> NAME VARCHAR(30),
sukul1(> AGE INTEGER,
sukul1(> JOINING_DATE DATE,
sukul1(> SALARY DECIMAL(18,0),
sukul1(> GIFT_ELIGIBLE CHAR(1)
sukul1(> );
CREATE TABLE

Following shows how to load the table using COPY statement.
Note that the file contains no value for the column GIFT_ELIGIBLE.
We used the "COLUMN AS EXPRESSION" syntax to extract value for the column GIFT_ELIGIBLE.

sukul1=> COPY USER_30_DAY_TABLES.EMPLOYEE_1
sukul1-> (
sukul1(> NAME,
sukul1(> AGE,
sukul1(> JOINING_DATE,
sukul1(> SALARY,
sukul1(> GIFT_ELIGIBLE AS CASE WHEN JOINING_DATE > TO_DATE('2012-02-02','YYYY-MM-DD') THEN 'N' ELSE 'Y' END
sukul1(> )
sukul1-> FROM LOCAL '/home/sukul1/employeedet.dat'
sukul1-> DELIMITER '|'
sukul1-> ABORT ON ERROR
sukul1-> DIRECT;
 Rows Loaded
-------------
           2
(1 row)

sukul1=> select * from USER_30_DAY_TABLES.EMPLOYEE_1;
  NAME  | AGE | JOINING_DATE | SALARY | GIFT_ELIGIBLE
--------+-----+--------------+--------+---------------
 RUTUJA |  24 | 2005-12-24   |  20000 | Y
 SUKUL  |  28 | 2010-12-20   |  10000 | Y
(2 rows)


Example 2: Computing column values from FILLER columns.

FILLER columns are used when we wont be loading the data fields as it is.
We could use filler to ignore fields in the file or use them to transform data and generate values for other columns.

Following shows how we generate values for the column JOINING_YEAR from JOINING_DATE.
Since we are not loading the JOINING_DATE column we mark it as FILLER.
But then use it to extract values for JOINING_YEAR.

sukul1=> \! cat /home/sukul1/employeedet.dat
SUKUL|28|2010-12-20|10000
RUTUJA|24|2005-12-24|20000

sukul1=> CREATE TABLE USER_30_DAY_TABLES.EMPLOYEE_1
sukul1-> (
sukul1(> NAME VARCHAR(30),
sukul1(> AGE INTEGER,
sukul1(> JOINING_YEAR INTEGER,
sukul1(> SALARY DECIMAL(18,0)
sukul1(> );
CREATE TABLE
sukul1=> COPY USER_30_DAY_TABLES.EMPLOYEE_1
sukul1-> (
sukul1(> NAME,
sukul1(> AGE,
sukul1(> JOINING_DATE FILLER VARCHAR(10),
sukul1(> JOINING_YEAR AS TO_NUMBER(TO_CHAR(TO_DATE(JOINING_DATE,'YYYY-MM-DD'),'YYYY')),
sukul1(> SALARY
sukul1(> )
sukul1-> FROM LOCAL '/home/sukul1/employeedet.dat'
sukul1-> DELIMITER '|'
sukul1-> ABORT ON ERROR
sukul1-> DIRECT;
 Rows Loaded
-------------
           2
(1 row)

sukul1=> select * from USER_30_DAY_TABLES.EMPLOYEE_1;
  NAME  | AGE | JOINING_YEAR | SALARY
--------+-----+--------------+--------
 SUKUL  |  28 |         2010 |  10000
 RUTUJA |  24 |         2005 |  20000

(2 rows)

No comments:

Post a Comment