Hardware Recommendation
To use the JDBC Generic (Hive Server2) connector, Datameer recommends a Hadoop cluster with at least 16 vCPU and 30 GB RAM.
Property Recommendation
The following property changes might help to reduce problems from hardware limitations:
- das.job.map-task.memory=1024
- das.job.reduce-task.memory=1024
- das.job.application-manager.memory=1024
Configuring JDBC Generic (Hive Server2) as a Connection
In order to import from your database using Java Database Connectivity (JDBC), you must first create a connection.
- Click + (plus) button and select Connection or right-click in the browser and select Create new > Connection.
- From drop-down list, select JDBC Generic (Hive Server2) as the connection type. Click Next.
Use the drop down menu to select your implementation for your database.
The only database implementation currently available for this connector is HiveServer2.
The minimum recommended Hive version is 1.2.1 for both HDI and HDP.
Enter the IP address or hostname of the database server and port number.
- Enter any additional URL or connection properties, if needed.
- Select whether credentials should be provided or asked for. Add the required username and password.
The connection currently is only for importing data.
Select the checkbox to enable SSL and enter the TrustStore address and password.
Click Next.Helpful keystore and truststore information
In order to use SSL you need two pieces, a keystore and a truststore.
Keystore
A keystore is located on the server side only. This file contains the private key to decrypt messages. A keystore isn't delivered by the JDK. You have to create your own keystore and import the key you got from your Certificate Authority (CA) or you can import a self generated key into your keystore.
Truststore
A truststore is located on the client side (e.g., Datameer). This file contains:
- The public part of the private key from the keystore backed in a certificate (in case of self signed certificates).
- All other root certificates called CA-certificates. These certificates are used to verify if the private key from the server is valid (SSL Handshake).
You can use the truststore provided by the JDK from Oracle. This file is located under
$JAVA_HOME/jre/lib/security/cacerts
.Hive Server Side
You can turn on SSL if you want to protect the communication between Hive Sever and any other Hive Client. To do that edit your hive-site.xml and add the following lines. (The values are just examples)
<property> <name>hive.server2.use.SSL</name> <value>true</value> <description>enable/disable SSL </description> </property> <property> <name>hive.server2.keystore.path</name> <value>/home/datameer/hive/security/hive.jks</value> <description>path to keystore file</description> </property> <property> <name>hive.server2.keystore.password</name> <value>datameer</value> <description>keystore password</description> </property>
The property
hive.server2.keystore.path
is a java keystore (JKS) which contains the private key to decrypt messages.Hive Client Side
The client is using a truststore to hold certificates which are trustworthy. You can define an own truststore or you can use the default truststore provided by Oracle. This truststore is located under $JAVA_HOME/jre/lib/security/cacerts.
There are two types of a certificate you can import into a truststore, a self signed certificate or a CA signed certificate.
Self signed certificate:
These certificates aren't part of the default truststore provided by Oracle located under
$JAVA_HOME/jre/lib/security/cacerts
. You must import the self signed certificate into your truststore if you want to use self signed certificates for SSL.CA signed certificate:
You can buy a certificate by a Certificate Authority (CA). Oracle trust many different CA's. You can look into this truststore via the command
keytool -list -keystore $JAVA_HOME
/jre/lib/security/cacerts
You don't need to import your CA signed certificate into the JDK truststore, if your CA where you bought your cert is part of the truststore of the JDK.
- If required, add a description and click Save.
Importing Data from HiveServer2 with a JDBC Generic Connector
After configuring a JDBC Generic connection, using the wizard you can set up one or more import jobs (or data links) which access the connection.
- Click + and select Import Job or right-click in the browser and select Create new > Import Job .
- Click Select Connection and choose the name of your generic JDBC connection (here - Generic_JDBC_HiveServer2_Connector) then click Next.
Select to import from a Table, View, or by entering a SQL select statement.
A checkbox is available to select to enable/disable the collection of preview data displayed on the next page.If you selected the box to preview data, a preview of the imported data is displayed.
From the Define Fields page you can change the data field types and if necessary, set up date parse patterns.
The preview includes the columns within the Hive partition but not the partition values. If needed, add the partition values to the import job by marking the included box under the column name.
By default, the column names are "<tablename>_<columnname>". This feature can be turned off inhive.site.xml
with the following property:hive.resultset.use.unique.column.names=false
.
Additional advanced features are available to specify how to handle the data.
Empty value placeholders gives you the ability to define what text converts to empty cells.
Split columns enables parallel loading of a table. The selected column will have its rows segmented into unique subsets. If you don't want to split a column, selected "No Concurrent Import!!"Datameer does not control splitting from data ingested from the Hive JDBC connector. The splitting is controlled via the Hive JDBC.
When splitting columns it is possible to create a deadlock situation on your Hadoop cluster. This can be prevented by properly configuring the Limit # of Mappers to 1 on the Scheduling page under the section Advanced .
Datameer has a default split count of 1. It is recommend to not increase the splitting as it could have a negative performance impact.
Time-based partitions let you partition data by date. This features allows calculations to run on all or only specific parts of the imported data. See Partitioning Data in Datameer for more information.
Datameer jobs are compiled outside of Hive and don't have the same restrictions as Hive queries.
A filter is similar to a where clause. It restricts the results on only include results that match the requested search criteria.
Review the schedule, data retention, and advanced properties for the job.
Add a description, click Save, and name the file.