HDFS

Configuring HDFS as a Connection

To import from HDFS, you must first create a connection.

  1. Click the + (plus) button and select  Connection or right-click in the browser and select Create new > Connection.
  2. Click on the drop-down list and choose HDFS as the connection type.
  3. Click Next.
  4. Enter the host address for the HDFS server and the port number.  
  5. Add the root path prefix of the directory from which to import data.
  6. Add needed custom properties when importing from HDFS.
     
  7. Add a description and click S ave.  
  8. Give your HDFS connection a name and click S ave

HDFS High Availability

HDFS has added high-availability capabilities allowing the main metadata server (the NameNode) to be failed over manually to a backup in the event of failure.

Follow the setup instructions above to set up the HDFS HA (High Availability) feature on the connector.

  1. Use a logical namenode name in the hostname. Keep the port field empty.
  2. Add the custom properties:
dfs.nameservices=nnmain
dfs.ha.namenodes.nnmain=nn0,nn1
dfs.namenode.rpc-address.nnmain.nn0=ip-<host-name>:<port>
dfs.namenode.rpc-address.nnmain.nn1=ip-<host-name>:<port>
dfs.client.failover.proxy.provider.nnmain=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider 

Importing Data with the HDFS Connector

After configuring an HDFS connection the wizard you can set up one or more import jobs which access the HDFS connection.

  1. Click the + (plus) button and select I mport Job or right-click in the browser and select Create new > Import job.
  2. Click Select Connection and choose the name of your HDFS connection (here - HDFS_Connector).
  3. Enter the path for a file or folder on the HDFS.

    As of Datameer X 7.4

    Datameer X has added the Remote Data Browser feature. This file browser gives you a visual interface to select the file from your HDFS for which to import.

    The filter box at the top of the Remote Data Browser can be used to find folders/files within the current directory.

  4.  A preview of the imported data is displayed. Review the schema.
  5. Review the schedule, data retention, and advanced properties for the job.
  6. Add a description, click Save, and name the file.