COPY statement Syntax
Takeaways(Frequently asked questions):
- How to transform data before loading ?
Answer:
Transforming data before loading could mean changing datatype, length, extracting portions of the data read from the file.
Transforming data before loading could mean changing datatype, length, extracting portions of the data read from the file.
To perform transformation we need to use the
"column-as-expression"
In the column list we use the column Name followed by keyword 'AS' , followed by the expression that transforms the data.
In the column list we use the column Name followed by keyword 'AS' , followed by the expression that transforms the data.
- What would we do if we want to load only few table columns from the file? Example: If the file we are loading has only 3 columns and target table as 6 columns and we want to load only specific 3 columns from the file.
Answer:
We need to use the column list. Column list is put down after the table name within round brackets.
We need to use the column list. Column list is put down after the table name within round brackets.
Column Names restricts
the load to one or more specified columns in the table.
If you do not specify any columns, COPY assumes that
load is supposed to happen for all columns.
Table columns that you do not specify in the column list are
assigned their default values. If a column had no defined default
value, COPY inserts NULL.
The data file must contain
the same number of columns as the COPY command's column list.
This is a very important concept. If we are specifying Column
names, then file layout should exactly match the column list.(This would
slightly differ if we are using fillers, we will discuss about that in
examples)
- What are two basic functionalities of FILLER?
Answer:
- Used to
omit data that we dont want to transfer into the table.
Say the input file has 5 fields and we only want to load 2 columns in the table. In this case we define other 3 fields as fillers.
- Used to transform source data and load transformed data into column tables.
Say we want to load data from one field in file to multiple target
columns, then we would define that field as filler and then use column
expresssions to extract individual values from the filler.
- What are the different locations where the input
files could be?
Answer:
- Files could be on a Shared storage that all nodes can access.
- Files could be on the remote client from where we are running the vsql utility.
- Files could be on each of the nodes.
- Why do we specify the Node names?
Answer:
We specify the nodes on
which the data to copy resides and the node that should parse the load file.
If you omit nodename, the location of the input file defaults
to the initiator node for the COPY statement.
- What is the purpose of using "ON ANY
NODE" option ?
Answer:
ON ANY NODE specifies that the source file to load is on all of the nodes, so COPY opens the file and parses it from any node in the cluster.
ON ANY NODE specifies that the source file to load is on all of the nodes, so COPY opens the file and parses it from any node in the cluster.
Very imp option with using ON ANY NODE is the use of wildcard or
glob to load multiple files.
Using a glob results in COPY distributing the list of
files to all cluster nodes and spreading the workload.
- How Do we specify
that the files to be loaded are on the remote client?
Answer:
The LOCAL option indicates that the files are on the remote client machine and not on nodes or a shared storage facility.
The LOCAL option indicates that the files are on the remote client machine and not on nodes or a shared storage facility.
- How to read input
data from Standard input(keyboard) instead of from files?
Answer:
The "FROM STDIN" option indicates that the data is to be read from standard input.
The "FROM STDIN" option indicates that the data is to be read from standard input.
Note that we cannot use
LOCAL or STDIN when specifying nodes names of "ON ANY NODE" option.
- What are the file
formats that COPY can load?
Answer:
COPY can load
BZIP,GZIP,LZO and UNCOMPRESSED.
Default is UNCOMPRESSED.(.i.e files without any compression techniques applied to them)
Default is UNCOMPRESSED.(.i.e files without any compression techniques applied to them)
- What are different
parsers that we can use with COPY?
Answer:
DELIMITED
NATIVE
DELIMITED
NATIVE
NATIVE VARCHAR
FIXED WIDTH
ORC.
Default is DELIMITED. We
should not explicilty specify DELIMITED.
NATIVE AND NATIVE
VARCHAR are not available when using COPY LOCAL.
- What are important
options specify Load Metadata?
Answer:
4 important options that
specify Load metadata as 'NULL','ESACPE','ENCLOSED
BY' and 'DELIMITER'.
I like to remember them
as 'NEED'.
- What is the default DELIMITER and NULL character?
Answer:
Default Delimiter is |
Default NULL character
is emppty string ''. So if | is delimiter, two consecutive | would mean the
column is NULL.
For both NULL and
DELIMITER chatacter we can use any ASCII value in the range
E'\000' to E'\177', inclusive.
E'' is octal
representation.
- What would we do to skip 1st few records while loading a file?
Answer:
We would use the SKIP option to skip the number of records.
We would use the SKIP option to skip the number of records.
- What all options can be used only with FIXEDWIDTH parser?
Answer:
SKIP,TRIM and COLSIZES
are the options that can be only used with FIXEDWIDTH parser.
- What does REJECTED DATA file and EXCEPTION file contain?
Answer:
REJECTED FILE contains
the actual record that was rejected.
EXCEPTION file contains
the reason why each record was rejected.
- Can we store rejected data in a table
Answer:
We can store the
REJECTED DATA AS TABLE option to store the rejected rows in a table.
When we use this option
we do not specify the EXCEPTIONS option.
This is because the
reason for exceptoin is also stored in the same REJECTED DATA TABLE.
Column Name is 'REJECTED
REASON'.
- How to apply limit on the maximum number of rejection allowed.
Answer:
We can use the REJECTMAX
option to specify the number of allowable rejects.
Beyond that limit the
job fails.
- What happens when the input field length is greater than the target column length.
Answer:
For numeric and date
type columns, the row will be rejected.
But for character type
columns, by default COPY will truncate the data and load.
However if truncation is
not acceptable we can use the ENFORCELENGTH option to reject such rows as well.
- What are the available load methods. Which one is default?
Answer:
3 available load methods
are AUTO,DIRECT and TRICKLE.
AUTO is default. This
load data into WOS and then tuple mover would move it to ROS.
DIRECT loads directly to
ROS. This is the most commonly used option.
TRICKLE loads only into
WOS.
- How do we fail the job even if a single record is rejected?
Answer:
Normally if records get
rejected, they are written to the reject file and job continues.
However if that is not
an option then we can use the 'ABORT ON ERROR' to fail the job even if a single
record gets rejected.
- Why do we use stream Names? What happens if we dont use the stream Name.
Answer:
Stream Name is provided
just to identify the COPY job in the LOAD_STREAMS system table.
If we dont provide the
stream Name, COPY statement creates one stream for each file with syntax
TABLENAME-FileNAME
No comments:
Post a Comment