COPY statement Syntax:
We will
1st look at the Syntax of the COPY statement as per the Vertica manuals.
Later we
will go in details for each of the options with examples
Syntax:
COPY [ [db-name.]schema-name.]target-table
... [ ( { column-as-expression | column }
...... [ FILLER datatype ]
...... [ FORMAT 'format' ]
...... [ ENCLOSED [ BY ] 'char' ]
...... [ ESCAPE [ AS ] 'char' | NO ESCAPE ]
...... [ NULL [ AS ] 'string' ]
...... [ TRIM 'byte' ]
...... [ DELIMITER [ AS ] 'char' ]
... [, ... ] ) ]
... [ COLUMN OPTION ( column
...... [ FORMAT 'format' ]
...... [ ENCLOSED [ BY ] 'char' ]
...... [ ESCAPE [ AS ] 'char' | NO ESCAPE ]
...... [ NULL [ AS ] 'string' ]
...... [ DELIMITER [ AS ] 'char' ]
... [, ... ] ) ]
FROM { STDIN
...... [ BZIP | GZIP | LZO | UNCOMPRESSED ]
...| 'pathToData' [ ON nodename | ON ANY NODE ]
...... [ BZIP | GZIP | LZO | UNCOMPRESSED ] [, ...]
...| LOCAL {STDIN | 'pathToData'}
...... [ BZIP | GZIP | LZO | UNCOMPRESSED ] [, ...]
}
...[ NATIVE | NATIVE VARCHAR | FIXEDWIDTH COLSIZES (integer [, ....]) | ORC]
...[ WITH, AS ]
...[ WITH [ SOURCE source([arg=value [,...] ])] [ FILTER filter([arg=value [,...] ]) ] [ PARSER parser([arg=value [,...] ]) ] ]
...[ DELIMITER [ AS ] 'char' ]
...[ TRAILING NULLCOLS ]
...[ NULL [ AS ] 'string' ]
...[ ESCAPE [ AS ] 'char' | NO ESCAPE ]
...[ ENCLOSED [ BY ] 'char' ]
...[ RECORD TERMINATOR 'string' ]
...[ SKIP records ]
...[ SKIP BYTES integer ]
...[ TRIM 'byte' ]
...[ REJECTMAX integer ]
...[ REJECTED DATA {'path' [ ON nodename ] [, ...] | AS TABLE 'reject-table'} ]
...[ EXCEPTIONS 'path' [ ON nodename ] [, ...] ]
...[ ENFORCELENGTH ]
...[ ABORT ON ERROR ]
...[ load-method ]
...[ STREAM NAME 'streamName']
...[ NO COMMIT ]
... [ ( { column-as-expression | column }
...... [ FILLER datatype ]
...... [ FORMAT 'format' ]
...... [ ENCLOSED [ BY ] 'char' ]
...... [ ESCAPE [ AS ] 'char' | NO ESCAPE ]
...... [ NULL [ AS ] 'string' ]
...... [ TRIM 'byte' ]
...... [ DELIMITER [ AS ] 'char' ]
... [, ... ] ) ]
... [ COLUMN OPTION ( column
...... [ FORMAT 'format' ]
...... [ ENCLOSED [ BY ] 'char' ]
...... [ ESCAPE [ AS ] 'char' | NO ESCAPE ]
...... [ NULL [ AS ] 'string' ]
...... [ DELIMITER [ AS ] 'char' ]
... [, ... ] ) ]
FROM { STDIN
...... [ BZIP | GZIP | LZO | UNCOMPRESSED ]
...| 'pathToData' [ ON nodename | ON ANY NODE ]
...... [ BZIP | GZIP | LZO | UNCOMPRESSED ] [, ...]
...| LOCAL {STDIN | 'pathToData'}
...... [ BZIP | GZIP | LZO | UNCOMPRESSED ] [, ...]
}
...[ NATIVE | NATIVE VARCHAR | FIXEDWIDTH COLSIZES (integer [, ....]) | ORC]
...[ WITH, AS ]
...[ WITH [ SOURCE source([arg=value [,...] ])] [ FILTER filter([arg=value [,...] ]) ] [ PARSER parser([arg=value [,...] ]) ] ]
...[ DELIMITER [ AS ] 'char' ]
...[ TRAILING NULLCOLS ]
...[ NULL [ AS ] 'string' ]
...[ ESCAPE [ AS ] 'char' | NO ESCAPE ]
...[ ENCLOSED [ BY ] 'char' ]
...[ RECORD TERMINATOR 'string' ]
...[ SKIP records ]
...[ SKIP BYTES integer ]
...[ TRIM 'byte' ]
...[ REJECTMAX integer ]
...[ REJECTED DATA {'path' [ ON nodename ] [, ...] | AS TABLE 'reject-table'} ]
...[ EXCEPTIONS 'path' [ ON nodename ] [, ...] ]
...[ ENFORCELENGTH ]
...[ ABORT ON ERROR ]
...[ load-method ]
...[ STREAM NAME 'streamName']
...[ NO COMMIT ]
|
[[db-name.]schema-name.]target-table
|
Specifies the name of the columnar or flexible table into
which you are loading new data
You
can optionally specify a schema (not a projection), optionally preceded by a
database name.
When using more than one schema, specify the schema that
contains the table.
COPY ignores db-name or schema-name options when used as part of a CREATE EXTERNAL TABLE... or CREATE FLEX EXTERNAL TABLE... statements. |
|
column-as-expression
|
Specifies
the expression used to compute values for the target column. For example:
COPY
t(year AS
TO_CHAR(k, 'YYYY')) FROM 'myfile.dat'
Use
this option to transform data when it is loaded into the target database.
Note
the syntax for transforming data. We use the column Name followed by keyword
'AS' , followed by the expression that transforms the data.
Note
that its not mandatory use any transformations. As shown by the next option
we can just use the column name to be loaded.
If the
column name is not specified then its assumed that the column is not loaded.
However if none of the columns are specified, then it means all the columns
are loaded from the file. (.i.e file layout is same as that of the table)
|
|
Column
|
Restricts the load to one or more specified columns in the
table.
If you do not specify any columns, COPY loads all
columns by default. We would use this when
the file layout is exactly same as that of the table. However if the file has
values for only few of the table columns, then we need to use the column
names to indicate which columns, the file data should go to.
Table
columns that you do not specify in the column list are assigned their default
values. If a column had no defined default value, COPY
inserts NULL.
If you
leave the column parameter blank to load all columns in the table,
you can use the optional parameter COLUMN OPTION to specify parsing
options for specific columns.
The data file must contain the same number
of columns as the COPY command's column list.
This
is a very important concept. If we are specifying Column names, then file
layout should exactly match the column list.(This would slightly differ if we
are using fillers, we will discuss about that in examples)
If we
dont specify the column list then it means file layout and table layout are
identical and all columns in the table need to be loaded.
|
|
FILLER
datatype
|
Specifies not to load the column and its fields into the
destination table. Use this option to omit
columns that you do not want to transfer into a table.
This
option is useful if the input file has more columns than those we intend to
load in the table. For the fields that we dont wish to load we would use
fillers.
This parameter also transforms data from a source column
and loads the transformed data to the destination table, rather than loading the original, untransformed source
column (parsed column).
Transforming
data is also another advantage of FILLER. Say we want to load data from one
field in file to multiple target columns, then we would define that field as
filler and then use column expresssions to extract individual values from the
filler.
If datatype is VARCHAR, be sure to set
the VARCHAR length—VARCHAR(n)—so the combined length of
all FILLER source fields does not exceed the target column's
defined length; otherwise, the COPY command is liable to return
with an error.
|
|
FORMAT
'format'
|
Specifies
the input formats to use when loading date/time and binary columns,
where format can be one of the following:
These
are the valid input formats when loading binary columns:
|
|
pathToData
|
Specifies the absolute path of the file containing the
data, which can be from multiple input
sources.
If path resolves to a storage location, and the user invoking COPY is not
a superuser, the following privileges apply:
pathToData can
optionally contain wildcards to match more than one file. The file or files
must be accessible to the local client or the host on which the COPY
statement runs.
If the file is an ORC file stored on a remote HDFS node, pathToData is formatted: 'webhdfs://hostname:port/path/to/file'.
You
can use variables to construct the pathname
|
|
Nodename
|
[Optional]
Specifies
the node on which the data to copy resides and the node that should parse the
load file.
You can use nodename to COPY and parse a load
file from a node other than the initiator node of the COPY statement. If you omit nodename, the location of the input file
defaults to the initiator node for the COPY statement.
Note: You
cannot specify nodename with either STDIN or LOCAL,
because STDIN is read on the initiator node only
and LOCAL indicates a client node.
To use
STDIN from LOCal machine we should use "LOCAL STDIN"
|
|
ON ANY
NODE
|
[Optional]
Specifies
that the source file to load is on all of the nodes, so COPY opens
the file and parses it from any node in the cluster. Make sure that the
source file is available and accessible on each cluster node.
You can use a wildcard or glob (such
as *.dat) to load multiple input files,
combined with the ON ANY NODE clause. Using a glob results in
COPY distributing the list of files to all cluster nodes and spreading
the workload.
So if we
have a storage location to load files from , then we should use *. This will
cause the files from storage location to be distributed to all nodes.
This is
the technique normally used. Instead of storing data files on each of the
nodes, we use a NAS storage for files and use the "/path/filename* ON
ANY NODE" syntax.
You
cannot specify ON ANY NODE with
either STDIN or LOCAL. Specifying STDIN can only use
the initiator node, and LOCAL indicates a client node.
|
|
STDIN
|
Reads
from the client a standard input instead of a file.
STDIN takes one input source only and is read on the
initiator node. To load multiple input
sources, use pathToData.
User
must have INSERT privilege on table and USAGE privilege on schema.
|
|
LOCAL
|
Specifies
that all paths for the COPY statement are on the client system and that all
COPY variants are initiated from a client.
This
option is also used frequently when the file to be loaded in on a remote
machine (and not on one of the nodes or shared locations)
You
can use the LOCAL and STDIN parameters together.
You
can use LOCAL and pathtodata (as described earlier), to use a
relative path.
|
|
BZIP |
GZIP | LZO | UNCOMPRESSED
|
Specifies
the input file format.
The
default value is UNCOMPRESSED, and input files can be of any format.
If you
use wildcards, all qualifying input files must be in the same format. To load
different file formats, specify the format types specifically.
When
using concatenated BZIP or GZIP files, verify that all source files terminate
with a record terminator before concatenating them.
Concatenated
BZIP and GZIP files are not supported for NATIVE (binary)
and NATIVE VARCHAR formats.
LZO files
are assumed to have been compressed with lzop.
|
|
NATIVE
| NATIVE VARCHAR | FIXEDWIDTH | ORC
|
Specifies
the parser to use when bulk loading columnar tables.
These
parsers are not applicable when loading flexible tables. The ORC (Optimized
Row Columnar) parser is for use with Hadoop files in that format. This is an
alternative to reading through webHDFS.
By
default, COPY uses the DELIMITER parser for UTF-8 format, delimited text
input data.
Do not specify DELIMITER. COPY
always uses the default parser unless you specify another.
COPY LOCAL does not support the NATIVE and NATIVE VARCHAR
parsers.
So
when using COPY LOCAL we can only use FIXED WIDTH or DELIMITED.(which is okay
:P)
FIXEDWIDTH
AND DELIMITED are the ones that we would normally use.
|
|
WITH,
AS
|
Improve
readability of the statement. These parameters have no effect on the actions
performed by the statement.
|
|
[WITH
[ SOURCE source(arg=value)]
[FILTER filter(arg=value)] [PARSER parser(param=value)]] |
Directs
COPY to optionally use one or more User
Defined Load functions. You can specify up to one source, zero or
more filters, and up to one parser, either zero or more arguments.
To
load flexible tables, use the PARSER parameter followed by one of the
flex table
parsers, favroparser, fcefparser, fcsvparser,fdelimitedparser, fdelimitedpairparser, fjsonparser,
or fregexparser.
|
|
COLUMN
OPTION
|
Specifies load metadata for one or more columns declared in the table column
list.
For example, you can specify that a column has its own
DELIMITER, ENCLOSED BY, NULL as 'NULL' expression, and so on. You do not have to specify every column name
explicitly in the COLUMN OPTION list, but each column you specify
must correspond to a column in the table column list.
Note
that by default we can specify load metadata
Column
Metadata includes DELIMITER,ENCLOSED BY, NULL,ESCAPE.
There
are 3 places in the COPY statement where we can specify these. COLUMN OPTION
is one of the places.
|
|
COLSIZES
(integer [,...])
|
Specifies column widths when loading fixed-width data.
COPY requires
that you specify theCOLSIZES when using the FIXEDWIDTH parser.
COLSIZES and
the list of integers must correspond to the columns listed in the table
column list.
|
|
DELIMITER
|
Provides
Load Metadata
Indicates
the single ASCII character used to separate columns within each record of a
file.
The
default in HP Vertica is a vertical bar (|).
You can
use any ASCII value in the range E'\000' to E'\177', inclusive.
You
cannot use the same character for both
the DELIMITERand NULL parameters.
|
|
TRAILING
NULLCOLS
|
Specifies that if HP Vertica encounters a record with insufficient data to match the
columns in the table column list, COPY inserts the missing columns
with NULLs.
|
|
ESCAPE
[AS]
|
When
reading in data, COPY interprets only the characters defined by
these options as special values:
ESCAPE
[AS] Sets the escape character to indicate that the following character
should be interpreted literally, rather than as a special character.
You can
define an escape character using any ASCII value in the
range E'\001' to E'\177', inclusive (any ASCII character
except NULL:E'\000').
|
|
NO
ESCAPE
|
Eliminates escape character handling. Use this option if you do not need any escape character
and you want to prevent characters in your data from being interpreted as
escape sequences.
|
|
ENCLOSED
[BY]
|
Sets the quote character within which to enclose data,
allowing delimiter characters to be embedded in string values.
You
can choose any ASCII value in the
range E'\001' toE'\177' inclusive (any ASCII character
except NULL: E'\000').
By
default, ENCLOSED BY has no value, meaning data is not enclosed by any
sort of quote character.
|
|
NULL
|
The string representing a null value.
The default is an empty string (''). You can specify a null
value as any ASCII value in the
range E'\001' to E'\177' inclusive (any ASCII character
except NULL: E'\000').
You
cannot use the same character for both the DELIMITER and NULL options.
|
|
RECORD
TERMINATOR
|
Specifies the literal character string indicating the end of a data file record.
|
|
SKIP
records
|
Indicates
the number (integer) of records to skip in a load file.
|
|
SKIP
BYTES total
|
Indicates the total number (integer) of bytes in
a record to skip.
This option is only available when loading fixed-width data.
|
|
TRIM
|
Trims
the number of bytes you specify from a column.
This option is only available when loading fixed-width data.
|
|
REJECTMAX
|
Specifies
a maximum number of logical records that can be rejected before a load fails.
|
|
REJECTED
DATA { 'path'
[ ON
nodename ] [, ...] |
AS TABLE reject-table } |
Specifies the file name or absolute path of the file in
which to write rejected rows.
The
rejected data consists of each row that failed to load due to a parsing
error.
Use the REJECTED DATA clause with
the EXCEPTIONS clause.
The
load exceptions describe why each corresponding row was rejected.
The
optional ON nodename clause moves any existing rejected data
files on nodename topath on the same node.
Alternatively,
use
the REJECTED DATA AS TABLE reject-table clause
to save rejected rows to a columnar table. Saving rejected data to a table
also saves the reason for the rejected row. You can then query the table to
access rejected data information.
If you
use both the NO COMMIT and REJECTED DATA AS TABLE clauses
in your COPY statement, and the reject-table does not already
exist, Vertica Analytic Database saves the rejected data table as a
LOCAL TEMP table and returns a message that a LOCAL TEMP table is being
created.
|
|
EXCEPTIONS
'path'[ ON nodename ] [, ...]
|
Specifies the file name or absolute path of the file in
which to write exceptions.
Exceptions are descriptive messages explaining why each rejected row was rejected. Each exception describes the corresponding record in the file specified by the REJECTED DATA option.
If path resolves
to a storage location, and the user invoking COPY is not a
superuser, the following privileges are required:
The
optional ON nodename clause moves any existing exceptions
files on nodename to the indicated path on the same
node.
Specifying
an exceptions file name is incompatible with using the REJECTED DATA
AS TABLE clause. Exceptions are listed in the
table's rejected_reason column.
|
|
ENFORCELENGTH
|
Determines
whether COPY truncates or rejects data rows of
type char, varchar, binary, and varbinary if they do not
fit the target table.
By
default, COPY truncates offending rows of these data types, but
does not reject them.
|
|
ABORT
ON ERROR
|
Specifies
that COPY stops if any row is rejected.
The statement is rolled back and no data is loaded. |
|
load-method
|
Specifies
how to load data into the database, one of the following:
|
|
STREAM
NAME
|
[Optional]
Supplies a COPY load stream identifier. Using a stream name helps to quickly identify a particular
load.
The STREAM NAME value that you supply in the
load statement appears in the stream column of the LOAD_STREAMS system
table.
By
default, HP Vertica names streams by table and file name. For
example, if you are loading two files (f1, f2) into TableA, their
default stream names are TableA-f1, TableA-f2, respectively.
To
name a stream:
=>
COPY mytable FROM myfile DELIMITER '|' DIRECT STREAM NAME 'My stream name';
|
|
NO
COMMIT
|
Prevents the COPY statement from committing its transaction
automatically when it finishes copying data.
COPY
ignores this option when used as part of a CREATE EXTERNAL TABLE statement.
If you
use both the NO COMMIT and REJECTED DATA AS TABLE clauses
in your COPYstatement, and the reject-table does not already
exist, Vertica Analytic Database saves the rejected data table as a
LOCAL TEMP table and returns a message that a LOCAL TEMP table is being
created.
|
No comments:
Post a Comment