Thursday, November 19, 2015

2.5 How to Manually purge data.

How to Manually purge data.

Manually purging deleted data consists of the following series of steps:

  1. Determine the point in time to which you want to purge deleted data.

  1. Set the Ancient History Mark (AHM) to this point in time using one of the following SQL functions (described in the SQL Reference Manual):

  • SET_AHM_TIME() sets the AHM to the epoch that includes the specified TIMESTAMP value on the initiator node.
     
  • SET_AHM_EPOCH() sets the AHM to the specified epoch.
     
  • GET_AHM_TIME() returns a TIMESTAMP value representing the AHM.
     
  • GET_AHM_EPOCH() returns the number of the epoch in which the AHM is located.
     
  • MAKE_AHM_NOW() sets the AHM to the greatest allowable value (now), and lets you drop pre-existing projections. This purges all deleted data.

What is AHM?
AHM stands for Ancient History Mark.
AHM the epoch prior to which historical data can be purged from physical storage. 

When you use SET_AHM_TIME or GET_AHM_TIME, keep in mind that the timestamp you specify is mapped to an epoch, which has (by default) a three-minute granularity. Thus, if you specify an AHM time of '2008-01-01 00:00:00.00' the resulting purge could permanently remove as much as the first three minutes of 2008, or could fail to remove the last three minutes of 2007.

  1. Manually initiate a purge using one of the following SQL functions:

  1. The Tuple Mover performs a mergeout operation to purge the data.


Following are the important functions.
Note that all are functions and we need to execute them using SELECT.

SET_AHM_TIME()
Sets the Ancient History Mark (AHM) to the epoch corresponding to the specified time on the initiator node. This function allows historical data up to and including the AHM epoch to be purged from physical storage.

SET_AHM_TIME ( time , [ true ] )

  1. "time" Is a TIMESTAMP value that is automatically converted to the appropriate epoch number.
     
  2. True [Optional] Allows the AHM to advance when nodes are down.

SET_AHM_TIME returns a TIMESTAMP WITH TIME ZONE value representing the end point of the AHM epoch.

You cannot change the AHM when any node in the cluster is down, except by using the optional true parameter

Epochs depend on a configured epoch advancement interval. If an epoch includes a three-minute range of time, the purge operation is accurate only to within minus three minutes of the specified timestamp:

=> SELECT SET_AHM_TIME('2008-02-27 18:13');
           set_ahm_time
------------------------------------
 AHM set to '2008-02-27 18:11:50-05'
(1 row)

In the previous example, the actual AHM epoch ends at 18:11:50, roughly one minute before the specified timestamp. This is because SET_AHM_TIME selects the epoch that ends at or before the specified timestamp. It does not select the epoch that ends after the specified timestamp because that would purge data deleted as much as three minutes after the AHM.

For example, using only hours and minutes, suppose that epoch 9000 runs from 08:50 to 11:50 and epoch 9001 runs from 11:50 to 15:50. SET_AHM_TIME('11:51')chooses epoch 9000 because it ends roughly one minute before the specified timestamp.

    Sets the Ancient History Mark (AHM) to the specified epoch. This function allows deleted data up to and including the AHM epoch to be purged from physical storage.

     
    SET_AHM_EPOCH ( epoch, [ true ]  )

    Epoch: The number of the epoch in which to set the AHM
    True : Optionally allows the AHM to advance when nodes are down.

    If you use SET_AHM_EPOCH , the number of the specified epoch must be:
    • Greater than the current AHM epoch
    • Less than the current epoch
    • Less than or equal to the cluster last good epoch (the minimum of the last good epochs of the individual nodes in the cluster)

    Use the SYSTEM table to see current values of various epochs related to the AHM;

    sukul1=> select * from system;
    -[ RECORD 1 ]------------+--------------
    current_epoch            | 4158884
    ahm_epoch                | 4158880
    last_good_epoch          | 4158883
    refresh_epoch            | -1
    designed_fault_tolerance | 1
    node_count               | 54
    node_down_count          | 0
    current_fault_tolerance  | 1
    catalog_revision_number  | 5434550
    wos_used_bytes           | 0
    wos_row_count            | 0
    ros_used_bytes           | 2499756575124
    ros_row_count            | 205470844516
    total_used_bytes         | 2499756575124
    total_row_count          | 205470844516


    The following command sets the AHM to a specified epoch of 12:

    => SELECT SET_AHM_EPOCH(12);

    The following command sets the AHM to a specified epoch of 2 and allows the AHM to advance despite a failed node:

    => SELECT SET_AHM_EPOCH(2, true);

Returns a TIMESTAMP value representing the Ancient History Mark.

SELECT GET_AHM_TIME();
                  GET_AHM_TIME
-------------------------------------------------
 Current AHM Time: 2010-05-13 12:48:10.532332-04
(1 row)

Returns the number of the epoch in which the Ancient History Mark is located.

SELECT GET_AHM_EPOCH();
    GET_AHM_EPOCH
----------------------
 Current AHM epoch: 0
(1 row)

Sets the Ancient History Mark (AHM) to the greatest allowable value, and lets you drop any projections that existed before the issue occurred.

This function is intended for use by Administrators only.

MAKE_AHM_NOW ( [ true ] )

True: [Optional] Allows AHM to advance when nodes are down. 


Permanently removes deleted data from physical storage so that the disk space can be reused. You can purge historical data up to and including the epoch in which the Ancient History Mark is contained.

Note that only data that is deleted and is beyond the AHM will be purged.

Purges the specified projection.

PURGE_PROJECTION ( '[[db-name.]schema.]projection-name' )

=> CREATE TABLE tbl (x INT , y INT);
CREATE TABLE
=> CREATE PROJECTION tbl_p AS SELECT x FROM tbl;
CREATE PROJECTION
=> SELECT PURGE_PROJECTION ('tbl_p');
purge_projection
-------------------
Projection purged
(1 row)
(projection Name is in single quotes).

Purges a table partition of deleted rows. Similar to PURGE() and PURGE_PROJECTION(), this function removes deleted data from physical storage so you can reuse the disk space. PURGE_PARTITION() removes data from the AHM epoch and earlier only.

PURGE_PARTITION ( '[[db_name.]schema_name.]table_name', partition-key )

tablename: The partitioned table to purge.
partition
key : The key of the partition to purge.

=> SELECT PURGE_PARTITION('t',5); -- Purge partition with key 5.
                            purge_partition
------------------------------------------------------------------------
 Task: merge partitions
(Table: public.t) (Projection: public.t_super)
(1 row)

Permanently removes deleted data from physical storage so that the disk space can be reused.

Purges all projections of the specified table. You cannot use this function to purge temporary tables.

 
PURGE_TABLE ( '[[db-name.]schema.]table-name' )

Permanently removes deleted data from physical storage so that the disk space can be reused. You can purge historical data up to and including the epoch in which the Ancient History Mark is contained.

Purges all projections in the physical schema. PURGE does not delete temporary tables.

PURGE()

PURGE() was formerly named PURGE_ALL_PROJECTIONS. HP Vertica supports both function calls.



Monday, November 16, 2015

CONCEPTS: Tuple Mover, Mergeout, Moveout

Tuple Mover:

The Tuple Mover (TM) is the HP Vertica database optimizer component that
  • moves data from memory (WOS) to disk (ROS). (called moveout operation)
  • combines small ROS containers into larger ones, (called mergeout operation)
  • and purges deleted data. (called mergeout operation)

During moveout operations, the TM is also responsible for adhering to any storage policies that are in effect for the storage location. The Tuple Mover runs in the background, performing some tasks automatically (ATM) at time intervals determined by its configuration parameters. 


Mergeout:
A mergeout is the
  • process of consolidating ROS containers(Smaller, scattered ROS containers are grouped to larger ones) and
  • purging deleted records. (Deleted Records older than AHM)

Moveout

Moveout operations move data from memory (WOS) into a new ROS container. A moveout "flushes" all historical data from the WOS to the ROS.

2.4 How to Set a purge policy:

How to Set a purge policy:

  1. The preferred method for purging data is to establish a policy that determines which deleted data is eligible to be purged.
    Eligible data is automatically purged when the Tuple Mover performs mergeout operations.

  1. Two methods for determining when deleted data is eligible to be purged:
     
    • Specifying the time for which delete data is saved
       
    • Specifying the number of epochs that are saved


  1. Specifying the time for which delete data is saved is the preferred method for determining which deleted data can be purged.

To change the specified time for saving deleted data, use the HistoryRetentionTime configuration parameter:

=> ALTER DATABASE mydb SET HistoryRetentionTime = {seconds | -1};

In the above syntax:
  • seconds is the amount of time (in seconds) for which to save deleted data.
  • -1 indicates that you do not want to use the HistoryRetentionTime configuration parameter to determine which deleted data is eligible to be purged. Use this setting if you prefer to use the other method (HistoryRetentionEpochs) for determining which deleted data can be purged.

Unless you have a reason to limit the number of epochs, HP recommends that you specify the time over which delete data is saved.

  1. To specify the number of historical epoch to save through the HistoryRetentionEpochs configuration parameter:

Turn off the HistoryRetentionTime configuration parameter:

=> ALTER DATABASE mydb SET HistoryRetentionTime = -1;

Set the history epoch retention level through the HistoryRetentionEpochs configuration parameter:

=> ALTER DATABASE mydb SET HistoryRetentionEpochs = {num_epochs | -1};


In above Syntax
  • num_epochs is the number of historical epochs to save.
  • -1 indicates that you do not want to use the HistoryRetentionEpochs configuration parameter to trim historical epochs from the epoch map. By default, HistoryRetentionEpochs is set to -1.

If both HistoryRetentionTime and HistoryRetentionEpochs are specified, HistoryRetentionTime takes precedence.

  1. If you want to preserve all historical data, set the value of both historical epoch retention parameters to -1, as follows:

=> ALTER DABABASE mydb SET HistoryRetentionTime = -1;
=> ALTER DATABASE mydb SET HistoryRetentionEpochs = -1;


2.3 What does purging Data mean?

What does purging Data mean?


  1. In HP Vertica, delete operations do not remove rows from physical storage.
    The 
    DELETE command in HP Vertica marks rows as deleted so that they remain available to historical queries

    Retention of historical data also applies to the 
    UPDATE command, which is actually a combined DELETE and INSERT operation.

  1. The cost of retaining deleted data in physical storage can be measured in terms of:

  • Disk space for the deleted rows and delete markers
  • A performance penalty for reading and skipping over deleted data

  1. Purging : Permanently removes deleted data from physical storage so that the disk space can be reused.

  1. We can control how much deleted data is retained in the physical storage used by your database by performing a purge operation using one of the following techniques:


Purpose of both these operations is to set the Ancient History Mark (AHM), which is an epoch that represents the time until which history is retained. History older than the AHM are eligible for purge.
AHM Is the epoch prior to which historical data can be purged from physical storage.

2.2 Optimizing updates and Deletes , EVALUATE_DELETE_PERFORMANCE

Best Practices for Update and Delete:

  1. HP Vertica is optimized for query-intensive workloads, so DELETE and UPDATE queries might not achieve the same level of performance as other queries.

  1. DELETE and UPDATE operations go to the WOS by default, but if the data is sufficiently large and would not fit in memory, HP Vertica automatically switches to using the ROS

  1. HP Vertica does not remove deleted data immediately but keeps it as history for the purposes of historical query . A large amount of history can result in slower query performance. 

  1. A large number of (unpurged) deleted rows can negatively affect query performance.
    To eliminate rows that have been deleted from the result, a query must do extra processing.

    If 10% or more of the total rows in a table have been deleted, the performance of a query on the table degrades. 

  1. If a table has a large number of deleted rows, consider purging those rows to improve performance. 

  1. Recovery is the action required for a cluster to restore K-safety after a crash. Large numbers of deleted records can degrade the performance of a recovery. To improve recovery performance, purge the deleted rows. 

Optimizing DELETEs and UPDATEs for Performance
The following sections describe several ways to improve projection design and improve DELETE and UPDATE queries to significantly increase DELETE and UPDATE performance.

  1. When all columns required by the DELETE or UPDATE predicate are present in a projection, the projection is optimized for DELETEs and UPDATEs. DELETE and UPDATE operations on such projections are significantly faster than on non-optimized projections.
    Both simple and pre-join projections can be optimized.

    For example, consider the following table and projections:

    CREATE TABLE t (a INTEGER, b INTEGER, c INTEGER);CREATE PROJECTION p1 (a, b, c) AS SELECT * FROM t ORDER BY a;CREATE PROJECTION p2 (a, c) AS SELECT a, c FROM t ORDER BY c, a;

    In the following query, both
     p1 and p2 are eligible for DELETE and UPDATE optimization because column a is available:
    DELETE from t WHERE a = 1;

    In the following example, only projection
     p1 is eligible for DELETE and UPDATE optimization because the b column is not available in p2:
    DELETE from t WHERE b = 1;





Optimized DELETEs are not supported under the following conditions:
 
  • With pre-join projections on nodes that are down
     
  • With replicated and pre-join projections if subqueries reference the target table. For example, the following syntax is not supported:
    DELETE FROM tb1 WHERE tb1.a IN (SELECT e FROM tb2, tb2 WHERE tb2.e = tb1.e);
     
  • With subqueries that do not return multiple rows. For example, the following syntax is not supported:
    DELETE FROM tb1 WHERE tb1.a = (SELECT k from tb2);



  1. Design your projections so that frequently-used DELETE or UPDATE predicate columns appear in the sort order of all projections for large DELETEs and UPDATEs.
    For example, suppose most of the DELETE queries you perform on a projection look like the following:DELETE from t where time_key < '1-1-2007'
    To optimize the DELETEs, make time_key appear in the ORDER BY clause of all your projections. This schema design results in better performance of the DELETE operation.


  1. To analyze projections for sort order issues, use the EVALUATE_DELETE_PERFORMANCE function.(evaluates delete and update performance issues)
    This function Evaluates projections for potential
     DELETE performance issues. If there are issues found, a warning message is displayed. 

    This function
    uses data sampling to determine whether there are any issues with a projection. Therefore, it does not generate false-positives warnings, but it can miss some cases where there are performance issues.
    Optimizing for delete performance is the same as optimizing for update performance.
    So, you can use this function to help optimize a projection for updates as well as deletes.
    Syntax:

    EVALUATE_DELETE_PERFORMANCE ( 'target' )

target
The name of a projection or table. If you supply the name of a projection, only that projection is evaluated for DELETE performance issues. If you supply the name of a table, then all of the projections anchored to the table will be evaluated for issues.

If you do not provide a projection or table name, EVALUATE_DELETE_PERFORMANCE examines all of the projections that you can access for DELETE performance issues. Depending on the size you your database, this may take a long time.

When evaluating multiple projections, EVALUATE_DELETE_PERFORMANCE reports up to ten projections that have issues, and refers you to a table that contains the full list of issues it has found.

The following example demonstrates how you can use EVALUATE_DELETE_PERFORMANCE to evaluate your projections for slow DELETE performance.


=> create table example (A int, B int,C int);
CREATE TABLE
=> create projection one_sort (A,B,C) as (select A,B,C from example) order by A;
CREATE PROJECTION
=> create projection two_sort (A,B,C) as (select A,B,C from example) order by A,B;
CREATE PROJECTION
=> select evaluate_delete_performance('one_sort');
            evaluate_delete_performance           
---------------------------------------------------
 No projection delete performance concerns found.
(1 row)
=> select evaluate_delete_performance('two_sort');
            evaluate_delete_performance           
---------------------------------------------------
 No projection delete performance concerns found.
(1 row)

Sunday, November 15, 2015

2.1 4 main methods of deleting data in bulk from tables.

4 main methods of deleting data in bulk from tables.

Command
Description
Permanently removes a table and its definition. Optionally removes associated views and projections as well.
Marks rows with delete vectors and stores them so data can be rolled back to a previous epoch.

The data must eventually be purged before the database can reclaim disk space.
Removes all storage and history associated with a table. The table structure is preserved for future use.

The results of this command cannot be rolled back.
Removes one partition from a partitioned table. Each partition contains a related subset of data in the table.

Partitioned data can be dropped efficiently, and provides query performance benefits.


Choosing the Right Technique for Deleting Data :

  • To delete both table data and definitions and start from scratch, use the DROP TABLE [CASCADE] command.

  • To drop data, while preserving table definitions so that you can quickly and easily reload data, use TRUNCATE TABLE. Note that unlike DELETE, TRUNCATE does not have to mark each row with delete vectors, so it runs much more quickly. Explains why TRUNCATE is faster.

  • To perform bulk delete operations on a regular basis, HP Vertica recommends using Partitioning.

  • To perform occasional small deletes or updates with the option to roll back or review history, use DELETE FROM TABLE.



A] DROP TABLE

  1. Used to remove tables and optionally related projections.

  1. Syntax:

     
DROP TABLE [ IF EXISTS ] [[db-name.]schema.]table [, ...] [ CASCADE ]

             Notice the IF EXISTS clause. It is optional.
             If specified, DROP TABLE does not report an error if one or more of the tables to be dropped does not exist. This clause is useful in SQL scripts where you want to drop a table if it exists before recreating it.

              Also notice the CASCADE option. This is also optional.
              Drops all projections that include the table. We need to use this option if the table to be dropped has projections.

              Following shows an example of using DELETE on a table with projections on it.

              In 1st statement we create unsegmented projections on the table.                     Note that we have created unsegmented projection on all nodes.

sukul1=> CREATE PROJECTION USER_30_DAY_TABLES.p1_emp AS select FIRSTNAME FROM USER_30_DAY_TABLES.EMPTABLE UNSEGMENTED ALL NODES;
WARNING 4468:  Projection <USER_30_DAY_TABLES.p1_emp_node0001> is not available for query processing. Execute the select start_refresh() function to copy data into this projection.
          The projection must have a sufficient number of buddy projections and all nodes must be up before starting a refresh
WARNING 4468:  Projection <USER_30_DAY_TABLES.p1_emp_node0002> is not available for query processing. Execute the select start_refresh() function to copy data into this projection.
          The projection must have a sufficient number of buddy projections and all nodes must be up before starting a refresh
WARNING 4468:  Projection <USER_30_DAY_TABLES.p1_emp_node0003> is not available for query processing. Execute the select start_refresh() function to copy data into this projection.
          The projection must have a sufficient number of buddy projections and all nodes must be up before starting a refresh

        In next statement we can see that the drop fails because of number of projections dependent on the table.

sukul1=> DROP TABLE USER_30_DAY_TABLES.EMPTABLE;
NOTICE 4927:  The Projection p1_emp_node0001 depends on Table EMPTABLE
NOTICE 4927:  The Projection p1_emp_node0002 depends on Table EMPTABLE
NOTICE 4927:  The Projection p1_emp_node0003 depends on Table EMPTABLE
ROLLBACK 3128:  DROP failed due to dependencies
DETAIL:  Cannot drop Table EMPTABLE because other objects depend on it
HINT:  Use DROP ... CASCADE to drop the dependent objects too

       In next statement we can see how using CASCADE option works.

sukul1=> DROP TABLE USER_30_DAY_TABLES.EMPTABLE CASCADE;
DROP TABLE
sukul1=>


B] DELETE


  1. Marks tuples(records) as no longer valid in the current epoch, marking the records for deletion in the WOS, rather than deleting data from disk storage.

  1. By default, delete uses the WOS and if the WOS fills up, overflows to the ROS.

  1. You cannot delete records from a projection.

  1. Syntax

              DELETE [ /*+ DIRECT */ ] [ /*+ LABEL(label-name)*/ ]
              ... FROM [[
db-name.]schema.]table
              ...
WHERE Clause

/*+ DIRECT */
Writes the data directly to disk (ROS) bypassing memory (WOS).
Note: If you delete using the direct hint, you still need to issue a COMMIT or ROLLBACK command to finish the transaction.

  1. DELETE statements support subqueries and joins, which is useful for deleting values in a table based on values that are stored in other tables.

  1. To remove all rows from a temporary table, use a DELETE statement with no WHERE clause. In this case, the rows are not stored in the system, which improves performance.(Similar to truncate)

  1. If you include a WHERE clause with a DELETE statement on a temporary table, DELETE behaves the same as for anchor tables, marking all delete vectors for storage, and you lose any performance benefits.

  1. To truncate a temporary table, without ending the transaction, use DELETE FROM temp_table.
    Truncate table is a DDL and will commit the current transaction.


C] TRUNCATE TABLE

  1. Removes all storage associated with a table, while preserving the table definitions.

  1. TRUNCATE TABLE auto-commits the current transaction after statement execution and cannot be rolled back. TRUNCATE TABLE is a DDL.

  1. Syntax
    TRUNCATE TABLE [[
    db-name.]schema.]table

  1. To truncate an ON COMMIT DELETE ROWS temporary table without ending the transaction, use DELETE FROM temp_table syntax.
    Note: The effect of DELETE FROM depends on the table type. If the table is specified as ON COMMIT DELETE ROWS, then DELETE FROM works like TRUNCATE TABLE;
    otherwise it behaves like a normal delete in that it does not truncate the table.

  1. After truncate operations complete, the data recovers from that current epoch onward. Because TRUNCATE TABLE removes table history, AT EPOCH queries return nothing.


D] DROP_PARTITION

  1. Forces the partition of projections (if needed) and then drops the specified partition.
    REMEMBER DROP PARTITION is not a statement, its a function. Called using SELECT.

  1. Syntax
    DROP_PARTITION (
    table_name , partition_value [ , ignore_moveout_errors, reorganize_data ])

table-name
Specifies the name of the table
partition_value
The key of the partition to drop. For example: DROP_PARTITION('trade', 2006);
ignore_moveout_errors
    Optional Boolean, defaults to false.
    • true—Ignores any WOS moveout errors and forces the operation to continue. Set this parameter to true only if there is no WOS data for the partition.
    • false (or omit)—Displays any moveout errors and aborts the operation on error.
    Note: If you set this parameter to true and the WOS includes data for the partition in WOS, partition data in WOS is not dropped.
reorganize_data
    Optional Boolean, defaults to false.
    • true—Reorganizes the data if it is not organized, and then drops the partition.
    • false—Does not attempt to reorganize the data before dropping the partition. If this parameter is false and the function encounters a ROS without partition keys, an error occurs.

  1. The results of a DROP_PARTITION call go into effect immediately. If you drop a partition using DROP_PARTITION and then try to add data to a partition with the same name, HP Vertica creates a new partition.

  1. In general, if a ROS container has data that belongs to n+1 partitions and you want to drop a specific partition, the DROP_PARTITION operation:
    1. Forces the partition of data into two containers where
      • One container holds the data that belongs to the partition that is to be dropped.
      • Another container holds the remaining n partitions.
    1. Drops the specified partition.
      • DROP_PARTITION forces a moveout if there is data in the WOS (WOS is not partition aware).
      • DROP_PARTITION acquires an exclusive lock on the table to prevent DELETE | UPDATE | INSERT | COPY statements from affecting the table, as well as any SELECT statements issued at SERIALIZABLE isolation level.

  1. You cannot perform a DROP_PARTITION operation on tables with projections that are not up to date

  1. Example:
    SELECT DROP_PARTITION('trade', 2009);DROP_PARTITION-------------------Partition dropped(1 row)
    Note that the table needs to be partitioned in the 1st place for the DROP_PARTITION to work.