Saturday, November 7, 2015

1.4 What is the relation Between COPY statement and UTF-8 Encoding format.

What is the relation between COPY statement and UTF-8 Encoding format.


    1.  
HP Vertica supports loading data files in the Unicode UTF-8 format.
You can load ASCII data, which is UTF-8 compatible. (This is what we normally would do)

Character sets like ISO 8859-1 (Latin1) are incompatible with UTF-8 and are not supported.
    1.  
UTF-8 is an abbreviation for Unicode Transformation Format-8 (where 8 equals 8-bit) and is a variable-length character encoding for Unicode created by Ken Thompson and Rob Pike.

UTF-8 can represent any universal character in the Unicode standard, yet the initial encoding of byte codes and character assignments for UTF-8 is coincident with ASCII (requiring little or no change for software that handles ASCII but preserves other values).

All input data received by the database server is expected to be in UTF-8, and all data output by HP Vertica is in UTF-8.

The client drivers automatically convert data to and from UTF-8 when sending to and receiving data from HP Vertica using API calls.

The drivers do not transform data loaded by executing a COPY or COPY LOCAL statement. SO its important to have data in the UTF 8 format.
    1.  
To make sure that the file can be loaded properly we can use the 'file' unix command to determine the type of the file.

The file command reports the encoding of any text files.  For example:

          $ file Date_Dimension.tbl
          Date_Dimension.tbl: ASCII text

Then to check for multibyte characters in an ASCII file, use the wc command. For example:

          $ wc Date_Dimension.tbl
            1828   5484 221822 Date_Dimension.tbl

If the wc command returns an error such as Invalid or incomplete multibyte or wide character, the data file is using an incompatible character set.

    1.  
Following shows example of files that are not in UTF-8(or compatible ASCII format).
So these cannot be loaded using the COPY statement.

          $ file data*
            data1.txt: Little-endian UTF-16 Unicode text
            data2.txt: ISO-8859 text
    1.  
To convert files before loading them into HP Vertica, use the 'iconv' UNIX command. 

          iconv -f ISO88599 -t utf-8 data2.txt > data2-utf8.txt

To list the supported encoding techniques use the command 'iconv -l'.
    1.  
After loading data, use the ISUTF8 function to verify that all of the string-based data in the table is in UTF-8 format. 

          => SELECT name FROM nametable WHERE ISUTF8(name) = FALSE;

If all of the strings are in UTF-8 format, the query should not return any rows.


Bottom Line: Files to be loaded using COPY should be in UTF-8 format OR UTF-compatible ASCII format.

No comments:

Post a Comment