Friday, November 6, 2015

1.2 COPY statement Syntax.

COPY statement Syntax:

We will 1st look at the Syntax of the COPY statement as per the Vertica manuals.
Later we will go in details for each of the options with examples


Syntax:

COPY [ [db-name.]schema-name.]target-table
... [ ( {
column-as-expression | column }
...... [ FILLER
datatype ]
...... [ FORMAT
'format' ]
...... [ ENCLOSED [ BY ] 'char' ]
...... [ ESCAPE [ AS ]
'char' | NO ESCAPE ]
...... [ NULL [ AS ]
'string' ]
...... [ TRIM '
byte' ]
...... [ DELIMITER [ AS ] 'char' ]
... [, ... ] ) ]
... [ COLUMN OPTION (
column
...... [ FORMAT '
format' ]
...... [ ENCLOSED [ BY ] 'char' ]
...... [ ESCAPE [ AS ]
'char' | NO ESCAPE ]
...... [ NULL [ AS ]
'string' ]
...... [ DELIMITER [ AS ]
'char' ]
... [, ... ] ) ]
FROM { STDIN
...... [ BZIP | GZIP | LZO | UNCOMPRESSED ]
...|
'pathToData' [ ON nodename | ON ANY NODE ]
...... [ BZIP | GZIP | LZO | UNCOMPRESSED ] [, ...]
...| LOCAL {STDIN |
'pathToData'}
...... [ BZIP | GZIP | LZO | UNCOMPRESSED ] [, ...]
}
...[ NATIVE | NATIVE VARCHAR | FIXEDWIDTH COLSIZES (
integer [, ....]) | ORC]
...[ WITH, AS ]
...[ WITH  [ SOURCE
source([arg=value [,...] ])]  [ FILTER  filter([arg=value [,...] ]) ] [ PARSER parser([arg=value [,...] ])  ] ]
...[ DELIMITER [ AS ] 'char' ]
...[ TRAILING NULLCOLS ]
...[ NULL [ AS ] 'string' ]
...[ ESCAPE [ AS ]
'char' | NO ESCAPE ]
...[ ENCLOSED [ BY ]
'char' ]
...[ RECORD TERMINATOR
'string' ]
...[ SKIP
records ]
...[ SKIP BYTES
integer ]
...[ TRIM '
byte' ]
...[ REJECTMAX
integer ]
...[ REJECTED DATA {
'path' [ ON nodename ]  [, ...] | AS TABLE 'reject-table'} ]
...[ EXCEPTIONS
'path' [ ON nodename ] [, ...] ]
...[ ENFORCELENGTH ]
...[ ABORT ON ERROR ]
...[
load-method ]
...[ STREAM NAME  '
streamName']
...[ NO COMMIT ]

[[db-name.]schema-name.]target-table

Specifies the name of the columnar or flexible table into which you are loading new data
 
You can optionally specify a schema (not a projection), optionally preceded by a database name. 

When using more than one schema, specify the schema that contains the table.

COPY ignores db-name or schema-name options when used as part of a CREATE EXTERNAL TABLE... or CREATE FLEX EXTERNAL TABLE... statements.
column-as-expression

Specifies the expression used to compute values for the target column. For example:

COPY t(year AS TO_CHAR(k, 'YYYY')) FROM 'myfile.dat'
Use this option to transform data when it is loaded into the target database.

Note the syntax for transforming data. We use the column Name followed by keyword 'AS' , followed by the expression that transforms the data.
Note that its not mandatory use any transformations. As shown by the next option we can just use the column name to be loaded. 

If the column name is not specified then its assumed that the column is not loaded. However if none of the columns are specified, then it means all the columns are loaded from the file. (.i.e file layout is same as that of the table)
Column

Restricts the load to one or more specified columns in the table.

If you do not specify any columns, COPY loads all columns by default. We would use this when the file layout is exactly same as that of the table. However if the file has values for only few of the table columns, then we need to use the column names to indicate which columns, the file data should go to.

Table columns that you do not specify in the column list are assigned their default values. If a column had no defined default value, COPY inserts NULL.

If you leave the column parameter blank to load all columns in the table, you can use the optional parameter COLUMN OPTION to specify parsing options for specific columns.

The data file must contain the same number of columns as the COPY command's column list.
This is a very important concept. If we are specifying Column names, then file layout should exactly match the column list.(This would slightly differ if we are using fillers, we will discuss about that in examples)

If we dont specify the column list then it means file layout and table layout are identical and all columns in the table need to be loaded.
FILLER datatype

Specifies not to load the column and its fields into the destination table. Use this option to omit columns that you do not want to transfer into a table.
This option is useful if the input file has more columns than those we intend to load in the table. For the fields that we dont wish to load we would use fillers.

This parameter also transforms data from a source column and loads the transformed data to the destination table, rather than loading the original, untransformed source column (parsed column).
Transforming data is also another advantage of FILLER. Say we want to load data from one field in file to multiple target columns, then we would define that field as filler and then use column expresssions to extract individual values from the filler.

If datatype is VARCHAR, be sure to set the VARCHAR lengthVARCHAR(n)so the combined length of all FILLER source fields does not exceed the target column's defined length; otherwise, the COPY command is liable to return with an error.
FORMAT 'format'

Specifies the input formats to use when loading date/time and binary columns, where format can be one of the following:
These are the valid input formats when loading binary columns:
  • octal
  • hex
  • Bitstream

When loading date/time columns, using FORMAT significantly improves load performance. COPY supports the same formats as theTO_DATE function.
pathToData

    Specifies the absolute path of the file containing the data, which can be from multiple input sources.

    If path resolves to a storage location, and the user invoking COPY is not a superuser, the following privileges apply:
    • The storage location must have been created with the USER option
    • The user must already have been granted READ access to the file storage location

    pathToData can optionally contain wildcards to match more than one file. The file or files must be accessible to the local client or the host on which the COPY statement runs.

    If the file is an ORC file stored on a remote HDFS node, pathToData is formatted: 'webhdfs://hostname:port/path/to/file'.

    You can use variables to construct the pathname
Nodename

[Optional]

Specifies the node on which the data to copy resides and the node that should parse the load file.

You can use nodename to COPY and parse a load file from a node other than the initiator node of the COPY statement. If you omit nodename, the location of the input file defaults to the initiator node for the COPY statement.

Note: You cannot specify nodename with either STDIN or LOCAL, because STDIN is read on the initiator node only and LOCAL indicates a client node.
To use STDIN from LOCal machine we should use "LOCAL STDIN"
ON ANY NODE

[Optional]

Specifies that the source file to load is on all of the nodes, so COPY opens the file and parses it from any node in the cluster. Make sure that the source file is available and accessible on each cluster node.

You can use a wildcard or glob (such as *.dat) to load multiple input files, combined with the ON ANY NODE clause. Using a glob results in COPY distributing the list of files to all cluster nodes and spreading the workload.
So if we have a storage location to load files from , then we should use *. This will cause the files from storage location to be distributed to all nodes.
This is the technique normally used. Instead of storing data files on each of the nodes, we use a NAS storage for files and use the "/path/filename* ON ANY NODE" syntax.

You cannot specify ON ANY NODE with either STDIN or LOCAL. Specifying STDIN can only use the initiator node, and LOCAL indicates a client node.
STDIN

Reads from the client a standard input instead of a file. 
STDIN takes one input source only and is read on the initiator node. To load multiple input sources, use pathToData.

User must have INSERT privilege on table and USAGE privilege on schema.
LOCAL

Specifies that all paths for the COPY statement are on the client system and that all COPY variants are initiated from a client.

This option is also used frequently when the file to be loaded in on a remote machine (and not on one of the nodes or shared locations)

You can use the LOCAL and STDIN parameters together.
You can use LOCAL and pathtodata (as described earlier), to use a relative path.
BZIP | GZIP | LZO | UNCOMPRESSED

Specifies the input file format.

The default value is UNCOMPRESSED, and input files can be of any format.

If you use wildcards, all qualifying input files must be in the same format. To load different file formats, specify the format types specifically.

When using concatenated BZIP or GZIP files, verify that all source files terminate with a record terminator before concatenating them.

Concatenated BZIP and GZIP files are not supported for NATIVE (binary) and NATIVE VARCHAR formats.
LZO files are assumed to have been compressed with lzop. 
NATIVE | NATIVE VARCHAR | FIXEDWIDTH | ORC

Specifies the parser to use when bulk loading columnar tables.

These parsers are not applicable when loading flexible tables. The ORC (Optimized Row Columnar) parser is for use with Hadoop files in that format. This is an alternative to reading through webHDFS.

By default, COPY uses the DELIMITER parser for UTF-8 format, delimited text input data.
Do not specify DELIMITER. COPY always uses the default parser unless you specify another.

COPY LOCAL does not support the NATIVE and NATIVE VARCHAR parsers.
So when using COPY LOCAL we can only use FIXED WIDTH or DELIMITED.(which is okay :P)


FIXEDWIDTH AND DELIMITED are the ones that we would normally use.
WITH, AS

Improve readability of the statement. These parameters have no effect on the actions performed by the statement.
[WITH [ SOURCE source(arg=value)]
[FILTER filter(arg=value)]
[PARSER parser(param=value)]]

Directs COPY to optionally use one or more User Defined Load functions. You can specify up to one source, zero or more filters, and up to one parser, either zero or more arguments.
To load flexible tables, use the PARSER parameter followed by one of the flex table parsers, favroparser, fcefparser, fcsvparser,fdelimitedparser, fdelimitedpairparser, fjsonparser, or fregexparser.
COLUMN OPTION
Specifies load metadata for one or more columns declared in the table column list.

For example, you can specify that a column has its own DELIMITER, ENCLOSED BY, NULL as 'NULL' expression, and so on. You do not have to specify every column name explicitly in the COLUMN OPTION list, but each column you specify must correspond to a column in the table column list.

Note that by default we can specify load metadata

Column Metadata includes DELIMITER,ENCLOSED BY, NULL,ESCAPE.
There are 3 places in the COPY statement where we can specify these. COLUMN OPTION is one of the places.

COLSIZES (integer [,...])
Specifies column widths when loading fixed-width data. 
COPY requires that you specify theCOLSIZES when using the FIXEDWIDTH parser. 

COLSIZES and the list of integers must correspond to the columns listed in the table column list.
DELIMITER
Provides Load Metadata

Indicates the single ASCII character used to separate columns within each record of a file.
The default in HP Vertica is a vertical bar (|).
You can use any ASCII value in the range E'\000' to E'\177', inclusive.
You cannot use the same character for both the DELIMITERand NULL parameters. 
TRAILING NULLCOLS
Specifies that if HP Vertica encounters a record with insufficient data to match the columns in the table column list, COPY inserts the missing columns with NULLs. 
ESCAPE [AS]

When reading in data, COPY interprets only the characters defined by these options as special values:
  • ESCAPE [AS]
  • DELIMITER
  • ENCLOSED [BY]
  • RECORD TERMINATOR

ESCAPE [AS] Sets the escape character to indicate that the following character should be interpreted literally, rather than as a special character.

You can define an escape character using any ASCII value in the range E'\001' to E'\177', inclusive (any ASCII character except NULL:E'\000').
NO ESCAPE

Eliminates escape character handling. Use this option if you do not need any escape character and you want to prevent characters in your data from being interpreted as escape sequences.
ENCLOSED [BY]

Sets the quote character within which to enclose data, allowing delimiter characters to be embedded in string values.

You can choose any ASCII value in the range E'\001' toE'\177' inclusive (any ASCII character except NULL: E'\000').

By default, ENCLOSED BY has no value, meaning data is not enclosed by any sort of quote character.
NULL

The string representing a null value.

The default is an empty string (''). You can specify a null value as any ASCII value in the range E'\001' to E'\177' inclusive (any ASCII character except NULL: E'\000').

You cannot use the same character for both the DELIMITER and NULL options.
RECORD TERMINATOR

Specifies the literal character string indicating the end of a data file record.
SKIP records

Indicates the number (integer) of records to skip in a load file.
SKIP BYTES total

Indicates the total number (integer) of bytes in a record to skip.

This option is only available when loading fixed-width data.
TRIM

Trims the number of bytes you specify from a column.

This option is only available when loading fixed-width data.
REJECTMAX

Specifies a maximum number of logical records that can be rejected before a load fails.

REJECTED DATA { 'path'
[ ON nodename ] [, ...] |
AS TABLE reject-table }

Specifies the file name or absolute path of the file in which to write rejected rows.
The rejected data consists of each row that failed to load due to a parsing error.

Use the REJECTED DATA clause with the EXCEPTIONS clause.
The load exceptions describe why each corresponding row was rejected.

The optional ON nodename clause moves any existing rejected data files on nodename topath on the same node. 

Alternatively, use the REJECTED DATA AS TABLE reject-table clause to save rejected rows to a columnar table. Saving rejected data to a table also saves the reason for the rejected row. You can then query the table to access rejected data information.

If you use both the NO COMMIT and REJECTED DATA AS TABLE clauses in your COPY statement, and the reject-table does not already exist, Vertica Analytic Database saves the rejected data table as a LOCAL TEMP table and returns a message that a LOCAL TEMP table is being created.
EXCEPTIONS 'path'[ ON nodename ] [, ...]

Specifies the file name or absolute path of the file in which to write exceptions.
Exceptions are descriptive messages explaining why each rejected row was rejected.

Each exception describes the corresponding record in the file specified by the REJECTED DATA option.

If path resolves to a storage location, and the user invoking COPY is not a superuser, the following privileges are required:
  • The storage location must have been created with the USER option
  • The user must already have been granted READ access to the storage location where the files exist, as described in GRANT

The optional ON nodename clause moves any existing exceptions files on nodename to the indicated path on the same node. 

Specifying an exceptions file name is incompatible with using the REJECTED DATA AS TABLE clause. Exceptions are listed in the table's rejected_reason column.
ENFORCELENGTH
Determines whether COPY truncates or rejects data rows of type char, varchar, binary, and varbinary if they do not fit the target table.

By default, COPY truncates offending rows of these data types, but does not reject them.
ABORT ON ERROR

Specifies that COPY stops if any row is rejected.

The statement is rolled back and no data is loaded.
load-method

Specifies how to load data into the database, one of the following:

  • AUTO (default): Loads data into WOS. Use this method for smaller bulk loads.
  • DIRECT: Loads data directly into ROS containers. Use this method for large bulk loads (100MB or -more).
  • TRICKLE: Loads data only into WOS. Use for frequent incremental loads, after the initial bulk load is complete.

STREAM NAME
[Optional]

Supplies a COPY load stream identifier. Using a stream name helps to quickly identify a particular load.
The STREAM NAME value that you supply in the load statement appears in the stream column of the LOAD_STREAMS system table.

By default, HP Vertica names streams by table and file name. For example, if you are loading two files (f1, f2) into TableA, their default stream names are TableA-f1, TableA-f2, respectively.

To name a stream:
=> COPY mytable FROM myfile DELIMITER '|' DIRECT STREAM NAME 'My stream name';

NO COMMIT
Prevents the COPY statement from committing its transaction automatically when it finishes copying data.
 
COPY ignores this option when used as part of a CREATE EXTERNAL TABLE statement.
If you use both the NO COMMIT and REJECTED DATA AS TABLE clauses in your COPYstatement, and the reject-table does not already exist, Vertica Analytic Database saves the rejected data table as a LOCAL TEMP table and returns a message that a LOCAL TEMP table is being created.


No comments:

Post a Comment