Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

When you import data, you import it into a connection which is connection, which is a collection of data from different sources such as various types of files and databases . See (see Configuring a Connection to learn more. Set up the connection first, then you can set up ), and then create a workbook that uses data from multiple sources (for example, a user list from a database and an Apache error log file). See Types of Data Supported for information about the types of databases and files that you can import into Datameer.

Once the connection is set up, you create import jobs to import the data you want to use. You can also edit, rename, create a copy, run, view the full data, view the details and information, or delete an existing import job.

See Types of Data Supported for information about the types of databases and files that you can import into Datameer.

See See Define File Path Range to  to learn how to use a date range to limit which files get imported.

Table of Contents

About Import Jobs

You can import each type of data from a connection and then create a workbook that uses data from multiple sources, for example, a user list from a database and an Apache error log file.

See Types of Data Supported for information about the types of files that you can import into Datameer.

Note

Obfuscation is available with Datameer's Advanced Governance module.

Table of Contents

Creating an Import Job

To create an import job:

  1. Click the + (plus) button and select Import Job or right-click in the browser and select Create new > Import job.
  2. Click Click Select Connections, choose the connection and click click Select, then select the file type and click click Next. Click Click New Connection to add a new connection if needed.
  3. Specify the file and folder location and click click Next. You can use wildcard characters. 
  4. The fields on the the Data Details page  page depend on the type of the file, however, there are several fields in common. In the Encryption section, enter all columns to obfuscated with a space between the names. Note that when a column is obfuscated, that data is never pulled into Datameer.  
    See the following sections for additional details about importing each of the file types:  
    1. Apache log: specify the file or folder and the log format. See the samples provided in the dialog box for details.
    2. CSV/TSV files:
      • Specify the delimiter such as"\t" for tab, comma ",", or semicolon ";".
      • Specify if column headers should be made from the first non-ignored row. 
      • Specify the escape character to "escape" processing that character and just show it in Advanced Settings.
      • Specify the quote character in advanced settings. If If Enable strict quoting is  is selected, characters outside the quotes are ignored. 
    3. Fixed width: specify the file or folder and specify if any of the first lines in the data should be skipped and then if column headers should be made from the first non ignored row.
    4. JSON: specify the file or folder and other parameters about what to parse within the JSON structure.
    5. Mbox: specify the file or folder. This is a format used for collections of electronic mail messages.
    6. Parquet: columnar storage format available in Hadoop.
    7. Regex Parsable Text files: specify the file or folder, a Regex pattern for processing the data (see note below), and specify if any of the first lines in the data should be skipped and then if column headers should be made from the first non ignored row.
    8. Twitter data: specify the file or folder.
    9. XML data: specify the file or folder, the root element, container element, and XPath expressions for the fields you would

      Note
      titleCustom Headers

       For all types, you can:

      • Specify that the column headers are in a separate file by selecting Provide custom schema. Indicate where the schema is located and the delimiter characters. The first row of that file is assumed to be column headers. 
      • Specify a number of lines to be ignored on import from the source data, including the header file using Ignore first n lines, under Advanced.
  5. View a sample of the data set to confirm this is the data source you want to use. Use the checkboxes to select which fields to import into Datameer. The accept  accept empty checkbox  checkbox allows you to specify if NULL and empty values are used or dropped upon import. Verify or select a data type for each column from the drop down menu. You can specify the format for date type fields. Click the question mark icon in the data format field to see a complete list of supported date pattern formats

    The The Raw Records section  section shows how your data is viewed by Datameer before the import.

    The The Empty value placeholders section  section is a feature giving you the ability to assign specific values as as NULL. Values added here aren't imported into Datameer.

    The The How to handle invalid data? section  section lets you decide how to proceed if part of a record doesn doesn't fit with the defined schema during import. 
    Selecting the option to drop the record removes the entire record from the import job. The option to abort the job stops the import job when an invalid record is detected.

    You can partition your data using date parameters. When this data is loaded into a workbook, you can choose to run your calculations on all or on just a part of your data. Also if you decide to export data, you can choose to export all or just a part of your data. Learn more about time based partitions.  

    Click Next.
  6. Anchor
    schedule_details
    schedule_details
    Define the schedule details. 

    Under In the Loading  Loading section,  Select  select Manually to rerun the import job in order to update or or On a schedule to  to run the import job update at a specified time. 
    Anchor
    data_retention
    data_retention
     The In the Data Retention Policy section lets you decide how Retention Policy section choose whether to replace new updated data or to append (join with it to existing data ) when updating an import job. If you append data with a sliding window, define when the window You also have the option to choose Append with sliding time window to define a range during which the update expires and how many results to keep.

    Anchor
    columnchange
    columnchange

    Note
    titleAdding or deleting columns from an appended import job

    The appended import jobs can contain additional or fewer columns than the before previously run import job. When the schema is rescanned, Datameer notices the change in columns and that columns have changed so no data is lost.

    Data However, data can be lost due to append changes in these circumstances:

    - If a column name has been renamed then the schema uses the column with the new name and delete deletes data from the old column name.

    - If the data type of a column has been changed. (example: the column was a integer and was relabelled as a string)

    - If a change in the partition schema is made the data resets starting from the new change.

    Add a description, name the file, click the checkbox to start the import immediately if desired, and click click Save. You can also specify notification emails to be sent for any error messages received and when a job has successfully runcompleted successfully. Use a comma to separate multiple email addressesThe maximum length for a table description character count in these fields is 255 characters.

    Note

    Note: Sample Regex pattern for importing data: (\S+) (\S+) (\S+) (\S+) (\S+) See Importing with Regular Expressions to learn more.

...

  • Integer columns can be imported as date by interpreting the integer value as UNIX timestamp or epoch timestamp.
  • Date columns can be converted as integer, the converted columns are shown as an epoch timestamp.
  • Strings can be converted to Boolean, where "false", "no", "f", "n" and "0" are converted to false and "true", "yes", "t", "y" and "1" are converted to true.

Raw Records

Click Click Raw Records to  to view an expanded sample of the raw data not in tabular format. Click Click Raw Records again  again to hide the raw data.

How to View and Edit the Job Settings

Some of these settings can also be accessed through the the Save Workbook settings settings. The The Save Workbook settings  settings let you specify when jobs are run, how error handling should be done and specify who gets notified, and lets you specify what data is saved with the workbook and how much historical data (if any) is saved. See See Configuring Workbook Settings to learn details about each of the settings.

To view and edit the job settings through the Import Data view:

  1. Click the the File Browser tab tab.
  2. Click on on Import Jobs in  in the navigation window on the left side of the screen.
  3. Highlight the name of the data source you want to view.
  4. Click Configure. (As of v6.3, click Open.)
  5. Click Click Next to view each type of job setting. You can also make changes.
  6. The The Schedule screen  screen has settings that can also be set through through Save Workbook settings settings.
  7. Specify whether to replace or append data and whether to append using a sliding time window. You can then specify when the data should expire and how many results you should keep.
  8. Select which groups have view, edit, and run access permissions and specify what access permissions all users have.
  9. Click Save to save your changes. You can also click Rename to rename the import.

To view and edit the job settings through the workbook view:

  1. Click the the File Browser tab  tab at the top of the page.
  2. Click the name of the workbook you want to view.
  3. Click Configure. (As of v6.3, click Open.)

...

  1. Double-click the import job name.
  2. Click the recent listing ID in the History list.
  3. On the the Import job run details page page, view the the Errors list to  to find out what happened.

The job details page showing Job History.

...

Use the links in the Job log file section to download the log file, download the job trace, or to report an issue to Datameer. When you click Report an issue, fill out the bug report and provide steps to recreate the error.

Anchor
schedule
schedule
Scheduling Job Runtime

You have a great deal of flexibility in choosing when jobs are run. You can choose to run them manually or at a interval you specify. 

  1. Create an a new import job or right-click and select Open to open the import job configuration wizard.
  2. Under the Calculation settings, select Manually or Scheduled.
  3. When Scheduled is selected, specificy specify a time for the import job to run.
  4. Click Save to save your changes.

Image Modified

Anchor
browser_schedule
browser_schedule
The schedule of an import job can also be viewed or edited from the inspector in the file browser.

Image RemovedImage Added

Schedules created with non complex cron patterns are converted automatically in the inspector. Select Schedule Type and Custom from the drop down menu to view or edit the schedule cron pattern. 

...

To edit an import job:

  1. Click the the File Browser tab tab
  2. Click Click Import Jobs from  from the navigation bar on the left side of the screen.
  3. Right-click the import job you want to edit and click Configure. (As of v6.3, click Open.)
  4. Make your changes and click Next to move through the screens.
  5. Click Click Save when  when you are finished. Note Note that after a schema change, saving the import job deletes data from previous job runs and aborts the job currently running. To keep previous data, use Save Copy As. 

If just the partitioning has changed, the Save & Migrate button simply performs a migration rather than refetching data from the source.

Image Added

Create a Copy of an Import Job

...

  • Already imported data isn't copied.
  • The metadata and job definition are copied, including the information and settings on the the Schedule screen screen.
  • To avoid unnecessary data volume consumption, Datameer recommends reviewing the configuration after copying an import job.

To create a copy of an existing import job:

  1. Click the the File Browser tab tab
  2. Click Click Import Jobs from  from the navigation bar on the left side of the screen.
  3. Right-click the import job you want to copy.
  4. Click Duplicate.

The copy is created and is named "copy of" and plus the name of the original import job.

...

To run an import job manually:

  1. Click the the File Browser tab tab
  2. Click Click Import Jobs from  from the navigation bar on the left side of the screen.
  3. Right-click the import job you want to run and click click Run.

Depending on the amount of data, this might take awhilesome time.

Delete an Import Job

Note that this deletes the import job, not the original data. It also deletes the imported data within HDFS to be removed by the the Housekeeping service service. Deleting an import job can't be undone. It a workbook is referencing the imported data, the workbook might not work.  

To delete an import job:

  1. Click the the File Browser tab tab
  2. Click Click Import Jobs from  from the navigation bar on the left side of the screen.
  3. Right-click the import job you want to delete and click Delete.
  4. Click Click OK, then confirm the deletion.

...

To link data to a new workbook:

  1. Click the the File Browser tab tab.
  2. Click Click Import Jobs from  from the navigation bar on the left side of the screen.
  3. Double-click the import job name to link into a new workbook.
  4. Click Click Link Data in New Workbook.
  5. The data is loaded into a new workbook.

...

To view the processed bytes per single job execution and totals for that job configuration of the import jobs:

  1. Click the the File Browser tab tab.
  2. Click Click Import Jobs from  from the navigation bar on the left side of the screen.
  3. The size of last job run is displayed first and the total for that job configuration is displayed to the right in parentheses.

...

  1. Complete the import job configuration until reaching the the Save section.
  2. Review the note box detailing the changes this configuration save has on the previous save and which workbooks are affected.
  3. Check the box to email users of the workbooks that are affected by the new schema changes.

...