Vertica Learning: 1.3 COPY statement Syntax Takeaways(Frequently asked questions)

COPY statement Syntax Takeaways(Frequently asked questions):

How to transform data before loading ?

Answer:
Transforming data before loading could mean changing datatype, length, extracting portions of the data read from the file.

To perform transformation we need to use the "column-as-expression"
In the column list we use the column Name followed by keyword 'AS' , followed by the expression that transforms the data.

What would we do if we want to load only few table columns from the file? Example: If the file we are loading has only 3 columns and target table as 6 columns and we want to load only specific 3 columns from the file.

Answer:
We need to use the column list. Column list is put down after the table name within round brackets.

Column Names restricts the load to one or more specified columns in the table.

If you do not specify any columns, COPY assumes that load is supposed to happen for all columns.

Table columns that you do not specify in the column list are assigned their default values. If a column had no defined default value, COPY inserts NULL.

The data file must contain the same number of columns as the COPY command's column list.

This is a very important concept. If we are specifying Column names, then file layout should exactly match the column list.(This would slightly differ if we are using fillers, we will discuss about that in examples)

What are two basic functionalities of FILLER?

Answer:

Used to omit data that we dont want to transfer into the table.
Say the input file has 5 fields and we only want to load 2 columns in the table. In this case we define other 3 fields as fillers.

Used to transform source data and load transformed data into column tables.

Say we want to load data from one field in file to multiple target columns, then we would define that field as filler and then use column expresssions to extract individual values from the filler.

What are the different locations where the input files could be?

Answer:

Files could be on a Shared storage that all nodes can access.

Files could be on the remote client from where we are running the vsql utility.

Files could be on each of the nodes.

Why do we specify the Node names?

Answer:

We specify the nodes on which the data to copy resides and the node that should parse the load file.

If you omit nodename, the location of the input file defaults to the initiator node for the COPY statement.

What is the purpose of using "ON ANY NODE" option ?

Answer:
ON ANY NODE specifies that the source file to load is on all of the nodes, so COPY opens the file and parses it from any node in the cluster.

Very imp option with using ON ANY NODE is the use of wildcard or glob to load multiple files.

Using a glob results in COPY distributing the list of files to all cluster nodes and spreading the workload.

How Do we specify that the files to be loaded are on the remote client?

Answer:
The LOCAL option indicates that the files are on the remote client machine and not on nodes or a shared storage facility.

How to read input data from Standard input(keyboard) instead of from files?

Answer:
The "FROM STDIN" option indicates that the data is to be read from standard input.

Note that we cannot use LOCAL or STDIN when specifying nodes names of "ON ANY NODE" option.

What are the file formats that COPY can load?

Answer:

COPY can load BZIP,GZIP,LZO and UNCOMPRESSED.
Default is UNCOMPRESSED.(.i.e files without any compression techniques applied to them)

What are different parsers that we can use with COPY?

Answer:
DELIMITED
NATIVE

NATIVE VARCHAR

FIXED WIDTH

ORC.

Default is DELIMITED. We should not explicilty specify DELIMITED.

NATIVE AND NATIVE VARCHAR are not available when using COPY LOCAL.

What are important options specify Load Metadata?

Answer:

4 important options that specify Load metadata as 'NULL','ESACPE','ENCLOSED BY' and 'DELIMITER'.

I like to remember them as 'NEED'.

What is the default DELIMITER and NULL character?

Answer:

Default Delimiter is |

Default NULL character is emppty string ''. So if | is delimiter, two consecutive | would mean the column is NULL.

For both NULL and DELIMITER chatacter we can use any ASCII value in the range E'\000' to E'\177', inclusive.

E'' is octal representation.

What would we do to skip 1st few records while loading a file?

Answer:
We would use the SKIP option to skip the number of records.

What all options can be used only with FIXEDWIDTH parser?

Answer:

SKIP,TRIM and COLSIZES are the options that can be only used with FIXEDWIDTH parser.

What does REJECTED DATA file and EXCEPTION file contain?

Answer:

REJECTED FILE contains the actual record that was rejected.

EXCEPTION file contains the reason why each record was rejected.

Can we store rejected data in a table

Answer:

We can store the REJECTED DATA AS TABLE option to store the rejected rows in a table.

When we use this option we do not specify the EXCEPTIONS option.

This is because the reason for exceptoin is also stored in the same REJECTED DATA TABLE.

Column Name is 'REJECTED REASON'.

How to apply limit on the maximum number of rejection allowed.

Answer:

We can use the REJECTMAX option to specify the number of allowable rejects.

Beyond that limit the job fails.

What happens when the input field length is greater than the target column length.

Answer:

For numeric and date type columns, the row will be rejected.

But for character type columns, by default COPY will truncate the data and load.

However if truncation is not acceptable we can use the ENFORCELENGTH option to reject such rows as well.

What are the available load methods. Which one is default?

Answer:

3 available load methods are AUTO,DIRECT and TRICKLE.

AUTO is default. This load data into WOS and then tuple mover would move it to ROS.

DIRECT loads directly to ROS. This is the most commonly used option.

TRICKLE loads only into WOS.

How do we fail the job even if a single record is rejected?

Answer:

Normally if records get rejected, they are written to the reject file and job continues.

However if that is not an option then we can use the 'ABORT ON ERROR' to fail the job even if a single record gets rejected.

Why do we use stream Names? What happens if we dont use the stream Name.

Answer:

Stream Name is provided just to identify the COPY job in the LOAD_STREAMS system table.

If we dont provide the stream Name, COPY statement creates one stream for each file with syntax TABLENAME-FileNAME

Vertica Learning

Pages

Saturday, November 7, 2015

1.3 COPY statement Syntax Takeaways(Frequently asked questions)

No comments:

Post a Comment