Sequence Files with Metadata

Overview

Datameer X provides a plug-in (plugin-sequencefile-v6) for importing data stored from a sequence file given the following conditions:

  • The value writable of the sequence file is org.apache.hadoop.io.Text.
  • The text value is a record with columns delimited by a character.
  • The schema can be determined at run time given the sequence file metadata.

Sequence files (.seq) can contain anything, text or binary. The structure of a sequence file is <key,value> with markers for block compression or decompression.

If the file contains delimited text in the value, with a NULL key (no metadata), use the CSV file type for import or upload rather than the sequence files with metadata type.

Set the delimiter that was used to write the file. 

Setting Up the Import Job

If your SequenceFile data satisfies the condition listed above you can setup an Import Job to import the data into Datameer. From the Import Job Wizard, select Sequence File with Metadata from the File Type menu:

Next, configure the import job providing the paths to the sequence file data as well as specifying the delimiter character. 

Metadata Schema Strategy

In order to translate your delimited records, you must provide Datameer X with an implementation of a plug-in interface MetadataSchemaStrategy. This is a simple interface which allows developers to translate a record using the provided sequence file metadata into a Datameer X schema. Once the strategy is implemented and plugged into Datameer X it is for selection from the drop-down list and can be used to handle the mapping of all data contained in your sequence files 

public interface MetadataSchemaStrategy extends SingletonExtension {

    public static final String EXTENSION_POINT_ID = "datameer.das.plugin.sequencefile.importjob.MetadataSchemaStrategy";

    /**
     * Creates the Datameer X input schema based on the SequenceFile.Metadata of the file
     * being imported currently.

     * @param metadata metadata pulled from the SequenceFile currently being imported
     * @return the array of Field's that Datameer X should use a schema for this data
     */
    public Field[] createSchemaFromMetadata(SequenceFile.Metadata metadata);

    /**
     * Extracts the Field value for a given field based on the raw record which has already been split on the incoming
     * delimiter
     *
     * @param field the current Field
     * @param rawFields the array of String's resulting from splitting of the Text value
     * @return the Field's value
     */
    public Object parseFieldValue(Field field, String[] rawFields);

    /**
     * Provides the name of the strategy
     *
     * @return the strategy's name to be used in the drop down list
     */
    public String getName();
}

Take a look at an example implementation, Native Metadata Schema Strategy.

Sequence File Schema Error Workaround

If the sequence file imported/uploaded under the sequence file with metadata file type returns an error that Datameer X can't parse the input, try importing/uploading the sequence file under the file type CSV/TSV.