Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Info
titleINFO

Datameer's housekeeping is a background service that improves processing by deleting obsolete data from HDFS, removing old entries from job history, and removing unsaved workbooks.

The housekeeping service is defined either by count, where data objects above a given number are marked for deletion (e.g. keep 10 data objects, when object 11 comes in, delete object 1), by time (e.g. at the end of every business day), or with a combination of both. Data entities that are set to be deleted are first marked for deletion. Data entity deletion is based on specific site retention policy and can be configured by the Datameer X administrator using the default.properties file.

A property that delays file deletion from HDFS for a period longer than the standard backup interval can be used to configure failover behavior. Similarly, a property that delays data deletion after a Datameer X upgrade can be used to assure that roll back is successful if it is needed.

...

  • data entities from database, that are in status 'DELETED'
  • If a deletion fails, the entry's column value ist set to 'DELETION_FAILED'

Physical Data Deletion

Info
titleINFO

The database table 'filesystem_artifact_to_delete' tracks physical (HDFS/ distributed storage) artifact deletion.

Example:

The status column entries are:

  • '0' stands for 'WAITING_FOR_DELETION': artifacts are scheduled for the deletion when the next Housekeeping Service runs 
  • '1' stands for 'DELETION_FAILED': artifacts with deletion from distributed storage failed during a Housekeeping Service run
  • '2' stands for 'DATA_DELETION_DISABLED': artifacts are scheduled for deletion, but the distributed storage artifact deletion has been disabled via the flag 'housekeeping.data.deletion.disabled=true'


Anchor
housekeeping_not_delete
housekeeping_not_delete
Housekeeping Does Not Delete

  • data not marked for deletion (configurable for failover and rollback reasons)
  • data in status 'MARKET_FOR_DELETION' that is used in a running job
  • data referenced by an active workbook snapshot
  • data is copied instead of being referenced, when a sheet is linked in a different workbook and is marked as kept

...