Datameer's housekeeping runs in the background of the software and keeps it running smoothly by deleting unnecessary data. The service consists of different operations like the deletion of physical data on the cluster or the deletion of data entities from the database. All data entities that are set to be deleted have to get a mark of deletion first. When a data entity is getting deleted, is based on your individual retention policy. This is either defined by count (e.g., keep 10 data objects, if number 11 comes in, delete the first one), by time (e.g. after every business day) or both.
What Does Datameer's Housekeeping Service Do?
- Deletes obsolete data from HDFS
- Removes old entries from the job history
- Removes unsaved workbooks
Housekeeping deletes
- Data in status "marked for deletion" on the Hadoop cluster after 30 minutes.
After the data status has been set as MARKED_FOR_DELETION, filesystem artifacts (e.g., job logs) are deleted in two steps. First, the database entry is deleted and a new database entry in the FilesystemArtifactToDelete
table is created with the path to the HDFS object. The entry has the column's state set to WAITING_FOR_DELETION. Second, the housekeeping service collects all entries from the state WAITING_FOR_DELETION of the FilesystemArtifactToDelete
table. The housekeeping service tries to delete these paths from the HDFS. If this succeeds, the entry is removed from the table.
As of Datameer 6.3
If a deletion fails, the entry's column value is set to DELETION_FAILED.
- Data entities from database that are in status "deleted".
- Job history after 28 days.
- Unsaved workbooks after 3 days.
All of these can be configured by the administrator to best fit their individual needs in the default properties.
Housekeeping doesn't delete
- Data when not marked for deletion. (Configurable for failover reasons)
- Data in status "marked for deletion" that is used in a running job.
- Data referenced by an active workbook snapshot.
As of Datameer 6.3
In Datameer version 6.3 and above, data is copied instead of being referenced when a sheet is linked in a different workbook and is marked as kept.
Configuring Housekeeping
The Housekeeping configuration settings can be found in the default.properties
file in Datameer.
################################################################################################ ## Housekeeping configuration ################################################################################################ housekeeping.enabled=true (As of Datameer v6.3) # Define the maximum number of days job executions are saved in the job history, after a job has been completed. housekeeping.execution.max-age=28d # Maximum number of out-dated executions that should be deleted per housekeeping run housekeeping.run.delete.outdated-job-executions=50 # To allow for better failover due to a crashed database, deleted data should be kept longer than the configured # frequency of database backups. housekeeping.keep-deleted-data=30m # Maximum number of out-dated data objects that should be marked for deletion per housekeeping run housekeeping.run.mark-for-deletion.outdated-data-objects=200 # Maximum number of out-dated data objects that should be deleted per housekeeping run housekeeping.run.delete.outdated-data-objects=25 # Maximum number out-dated data artifacts that should be deleted from HDFS per housekeeping run housekeeping.run.delete.outdated-data-artifacts=100 # Define the maximum number of days unsaved workbooks are stored in the database. housekeeping.temporary-files.max-age=3d # Minimum time to keep files in temporary folder after last access. housekeeping.temporary-folder-files.max-age=30d # Maximum number of out-dated temporary conductor files that should be deleted per housekeeping run housekeeping.run.delete.outdated-temporary-files=50 # The time that the housekeeping service falls asleep after each cycle housekeeping.sleep-time=1h ################################################################################################
The default configuration works well for most enterprise environments and is set to control the amount of artifacts which HousekeepingService touches during one transaction. The Housekeeping service itself runs as long as there are objects to delete and performs multiple transactions as necessary. Per default, the service strives for the largest possible delete count. Unless you would like to keep artifacts longer or would like to change the amount of transactions per run, there is no need to change the default configuration.
Adjusting for failover protection
Delay deleting files from HDFS for a period longer than this backup interval, by adding this property in conf/live.properties
:
# To allow for better failover due to a crashed database, deleted data should be kept longer than the configured # frequency of database backups. housekeeping.keep-deleted-data=30m
Read more information on Configuring a Server for Datameer Failover.
Viewing Housekeeping Service Log Files
Housekeeping service information is stored separately from the conductor.log.
The housekeeping log files can be found in <Datameer>/logs/housekeeping.log*
The default configuration logs the Housekeeping services log up to a file size of 1MB. After 1MB, a new file is created up to 10 times for a total of 11 Housekeeping files. New files are then replace the oldest file.
All files in <Datameer>/logs/housekeeping.log*
need to be reviewed for full Housekeeping knowledge.
See Log4j for more information and how to change the default behavior of the file appender
Reading Housekeeping Service Log Files
The log contains the following entries:
Service run start
[system] INFO [<timestamp>] [HousekeepingService thread-1] (HousekeepingService.java:155) - Starting housekeeping
...
Example: Service run
... [system] INFO [<timestamp>] [HousekeepingService thread-1] (JobExecutionService.java:461) - Deleted 45 executions: <id>, <id>, ... , <id> ... [system] INFO [<timestamp>] [HousekeepingService thread-1] (HousekeepingService.java:410) - Deleting artifact temp/job-<jobExecutionID>. [system] INFO [<timestamp>] [HousekeepingService thread-1] (HousekeepingService.java:413) - Deleting <fs>:///<path>/job-<jobExecutionID> ...
Info messages
Artifacts which Housekeeping doesn't delete yet are logged in the following way:
... [system] INFO [<timestamp>] [HousekeepingService thread-1] (HousekeepingService.java:447) - Not deleting WorkbookData[id=<configID>,status=MARKED_FOR_DELETION], because it is at least referenced in the kept sheet '<sheetName>' of workbook '/<DatameerPath>/<workbookName>.wbk'. [system] INFO [<timestamp>] [HousekeepingService thread-1] (HousekeepingService.java:447) - Not deleting DataSourceData[id=<ID>,status=MARKED_FOR_DELETION], because it is at least referenced in the kept sheet '<sheetName>' of workbook '/<DatameerPath>/<workbookName>.wbk'. ...
Based on this information, you might review the data retention settings of referenced workbooks and worksheets.
Example: Clean up
... [system] INFO [<timestamp>] [HousekeepingService thread-1] (HousekeepingService.java:358) - Setting data status of (WorkbookData:<dataID>) to DELETED [system] INFO [<timestamp>] [HousekeepingService thread-1] (HousekeepingService.java:362) - Deleting <fs>:/<path>/workbooks/<configID>/<jobID> [system] INFO [<timestamp>] [HousekeepingService thread-1] (HousekeepingService.java:358) - Setting data status of (DataSourceData:<dataID>) to DELETED [system] INFO [<timestamp>] [HousekeepingService thread-1] (HousekeepingService.java:362) - Deleting <fs>:/<path>/importlinks/<configID>/<jobID> ... [system] INFO [<timestamp>] [HousekeepingService thread-1] (HousekeepingService.java:109) - Next check for physical data to delete will start at offset 109. [system] INFO [<timestamp>] [HousekeepingService thread-1] (HousekeepingService.java:499) - Deleted data WorkbookData[id=<dataID>,status=DELETED] created by DapJobExecution{id=<jobID>, type=NORMAL, status=COMPLETED} [system] INFO [<timestamp>] [HousekeepingService thread-1] (HousekeepingService.java:499) - Deleted data DataSourceData[id=<dataID>,status=DELETED] created by DapJobExecution{id=<jobID>, type=NORMAL, status=COMPLETED} ...
Example: Summary after a successful run
... [system] INFO [<timestamp>] [HousekeepingService thread-1] (HousekeepingService.java:116) - Next check for physical data to delete will start at the beginning again. [system] INFO [<timestamp>] [HousekeepingService thread-1] (HousekeepingService.java:225) - Housekeep physically deleted data entities: 1.04% (62ms) [system] INFO [<timestamp>] [HousekeepingService thread-1] (HousekeepingService.java:225) - Housekeep outdated job executions : 49.92% (2986ms) [system] INFO [<timestamp>] [HousekeepingService thread-1] (HousekeepingService.java:225) - Housekeep obsolete artifacts from cluster : 0.69% (41ms) [system] INFO [<timestamp>] [HousekeepingService thread-1] (HousekeepingService.java:225) - Housekeep outdated temporary datameer files: 0.02% (1ms) [system] INFO [<timestamp>] [HousekeepingService thread-1] (HousekeepingService.java:225) - Housekeep outdated temporary HDFS folder and files: 0.20% (12ms) [system] INFO [<timestamp>] [HousekeepingService thread-1] (HousekeepingService.java:225) - Housekeep physical data : 48.01% (2872ms) [system] INFO [<timestamp>] [HousekeepingService thread-1] (HousekeepingService.java:228) - Business: 0.2% [system] INFO [<timestamp>] [HousekeepingService thread-1] (HousekeepingService.java:205) - Housekeeping done. took 5 sec