What is the relation between COPY
statement and UTF-8 Encoding format.
|
You can load ASCII data, which is
UTF-8 compatible. (This is
what we normally would do)
Character sets like ISO 8859-1
(Latin1) are incompatible with UTF-8 and are not supported.
|
|
UTF-8 is an abbreviation for Unicode
Transformation Format-8 (where 8 equals 8-bit) and is a variable-length
character encoding for
Unicode created by Ken Thompson and Rob Pike.
UTF-8 can represent any universal
character in the Unicode standard, yet the initial encoding of byte codes and
character assignments for UTF-8 is coincident with ASCII (requiring little or
no change for software that handles ASCII but preserves other values).
All input data received by the
database server is expected to be in UTF-8, and all data output by HP
Vertica is in UTF-8.
The client drivers automatically
convert data to and from UTF-8 when sending to and receiving data
from HP Vertica using API calls.
The drivers do not transform data
loaded by executing a COPY or COPY LOCAL
statement. SO its important to have data in the UTF 8 format.
|
|
To make sure that the file can be loaded
properly we can use the 'file' unix command to determine the type of
the file.
The file command reports the
encoding of any text files.
For example:
$ file Date_Dimension.tbl
Date_Dimension.tbl: ASCII text
Then to check for multibyte
characters in an ASCII file, use the wc command. For example:
$ wc Date_Dimension.tbl
1828 5484 221822 Date_Dimension.tbl
If the wc command returns
an error such as Invalid or incomplete multibyte or wide character, the
data file is using an incompatible character set.
|
|
Following shows example of files
that are not in UTF-8(or compatible ASCII format).
So these cannot be loaded using the
COPY statement.
$ file data*
data1.txt: Little-endian UTF-16 Unicode text data2.txt: ISO-8859 text |
|
To convert files before loading
them into HP Vertica, use the 'iconv' UNIX command.
iconv -f ISO88599 -t utf-8
data2.txt > data2-utf8.txt
To list the supported encoding
techniques use the command 'iconv -l'.
|
|
After loading data, use the ISUTF8 function to verify that all of
the string-based data in the table is in UTF-8 format.
=> SELECT name FROM nametable
WHERE ISUTF8(name) = FALSE;
If all of the strings are in UTF-8
format, the query should not return any rows.
|
Bottom
Line: Files to be loaded using COPY should be in UTF-8 format OR UTF-compatible
ASCII format.
No comments:
Post a Comment