skip to main content
Using Hybrid Data Pipeline : Creating data sources with the Web UI : Apache Hadoop Hive parameters
  

Try Now

Apache Hadoop Hive parameters

The following tables describe parameters available on the tabs of an Apache Hadoop Hive On-Premise Data Source dialog:
*General tab
*Security tab
*OData tab
*Advanced tab

General tab

Click the thumbnail to view the screen. Required fields are marked with an asterisk.
General tab of the Apache Hadoop Hive data source setup dialogGeneral tab of the Apache Hadoop Hive data source setup dialog
Table 6. General tab connection parameters for Apache Hadoop Hive
Field
Description
Data Source Name
A unique name for the data source. Data source names can contain only alphanumeric characters, underscores, and dashes.
Description
A general description of the data source.
User ID
The User ID for the Apache Hive account used to establish the connection to the Apache Hive server.
Password
A password for the Apache Hive account that is used to establish the connection to your Apache Hive server.
Note: By default, the password is encrypted.
By default, the characters in the Password field you type are not shown. If you want the password to be displayed in clear text, click the eye Watchful eye password icon icon. Click the icon again to conceal the password.
Server Name
Specifies either the server name (if your network supports named servers) or the IP address of the primary Apache Hive server machine, for example, MyHiveServer or 122.23.15.12.
Port Number
The port number of the Apache Hive server to connect to.
Database
The name of the database that is running on the database server.
Connector ID
The unique identifier of the On-Premise Connector that is to be used to access the on-premise data source. Select the Connector that you want to use from the dropdown. The identifier can be a descriptive name, the name of the machine where the Connector is installed, or the Connector ID for the Connector.
If you have not installed an On-Premise Connector, and no Connectors have been shared with you, this field and drop-down list are empty.
If you own multiple Connectors that have the same name, for example, Production, an identifier is appended to each Connector, for example, Production_dup0 and Production_dup1. If the Connectors in the dropdown were shared with you, the owner's name is appended, for example, Production(owner1) and Production(owner2).

Security tab

Click the thumbnail to view the screen.
Security tab of the Apache Hadoop Hive data source setup dialogSecurity tab of the Apache Hadoop Hive data source setup dialog
Table 7. Security tab connection parameters for Apache Hadoop Hive
Field
Description
Encryption Method
Determines whether data is encrypted and decrypted when transmitted over the network between the Hybrid Data Pipeline connectivity service and the database server.
Valid Values:
noEncryption | SSL
If set to noEncryption, data is not encrypted or decrypted.
If set to SSL, data is encrypted using SSL. If the database server does not support SSL, the connection fails and the Hybrid Data Pipeline connectivity service throws an exception.
*Connection hangs can occur when the Hybrid Data Pipeline connectivity service is configured for SSL and the database server does not support SSL. You may want to set a login timeout using the Login Timeout parameter to avoid problems when connecting to a server that does not support SSL.
*When SSL is enabled, the following parameters also apply:
Host Name In Certificate
Validate Server Certificate
Crypto Protocol Version
Default: noEncryption
Crypto Protocol Version
Specifies a protocol version or a comma-separated list of the protocol versions that can be used in creating an SSL connection to the data source. If the protocol (or none of the protocols) is not supported by the database server, the connection fails and the connectivity service returns an error.
Valid Values:
cryptographic_protocol [[, cryptographic_protocol ]...]
where:
cryptographic_protocol
is one of the following cryptographic protocols:
TLSv1 | TLSv1.1 | TLSv1.2
The client must send the highest version that it supports in the client hello.
Note: Good security practices recommend using TLSv1.2 if your data source supports that protocol version, due to known vulnerabilities in the earlier protocols.
Example
Your security environment specifies that you can use TLSv1.1 and TLSv1.2. When you enter the following values, the connectivity service sends TLSv1.2 to the server first.
TLSv1.1,TLSv1.2
Default: TLSv1, TLSv1.1, TLSv1.2
Host Name In Certificate
Specifies a host name for certificate validation when SSL encryption is enabled (Encryption Method=SSL) and validation is enabled (Validate Server Certificate=ON). This optional parameter provides additional security against man-in-the-middle (MITM) attacks by ensuring that the server that the Hybrid Data Pipeline connectivity service is connecting to is the server that was requested.
Valid Values:
host_name | #SERVERNAME#
where host_name is a valid host name.
If host_name is specified, the Hybrid Data Pipeline connectivity service compares the specified host name to the DNSName value of the SubjectAlternativeName in the certificate. If a DNSName value does not exist in the SubjectAlternativeName or if the certificate does not have a SubjectAlternativeName, the connectivity service compares the host name with the Common Name (CN) part of the certificate’s Subject name. If the values do not match, the connection fails and the connectivity service throws an exception.
If #SERVERNAME# is specified, the Hybrid Data Pipeline connectivity service compares the server name that is specified in the connection URL or data source of the connection to the DNSName value of the SubjectAlternativeName in the certificate. If a DNSName value does not exist in the SubjectAlternativeName or if the certificate does not have a SubjectAlternativeName, the connectivity service compares the host name to the CN part of the certificate’s Subject name. If the values do not match, the connection fails and the connectivity service throws an exception. If multiple CN parts are present, the connectivity service validates the host name against each CN part. If any one validation succeeds, a connection is established.
Default: Empty string
Validate Server Certificate
Determines whether the Hybrid Data Pipeline connectivity service validates the certificate that is sent by the database server when SSL encryption is enabled (Encryption Method=SSL). When using SSL server authentication, any certificate that is sent by the server must be issued by a trusted Certificate Authority (CA). Allowing the connectivity service to trust any certificate that is returned from the server even if the issuer is not a trusted CA is useful in test environments because it eliminates the need to specify truststore information on each client in the test environment.
Valid Values:
ON | OFF
If ON is selected, the Hybrid Data Pipeline connectivity service validates the certificate that is sent by the database server. Any certificate from the server must be issued by a trusted CA in the truststore file.
If OFF is selected, the Hybrid Data Pipeline connectivity service does not validate the certificate that is sent by the database server. The connectivity service ignores any truststore information that is specified by the Java system properties.
Default: ON
ImpersonateUser
Specifies the user ID used for Impersonation. When Impersonation is enabled on the server (hive.server2.enable.doAs=true), this value determines your identity and access rights to Hadoop resources when executing queries. If Impersonation is disabled, you will execute queries as the user who initiated the HiveServer2 process.

OData tab

The following table describes the controls on the OData tab. For information on using the Configure Schema editor, see Configuring data sources for OData connectivity and working with data source groups. For information on formulating OData requests, see Formulating queries with OData Version 2.
Click the thumbnail to view the screen. Required fields are marked with an asterisk.
OData tabOData tab
Table 8. OData tab connection parameters for Apache Hadoop Hive
Field
Description
OData Version
Enables you to choose from the supported OData versions. OData configuration made with one OData version will not work if you switch to a different OData version. If you want to maintain the data source with different OData versions, you must create different data sources for each of them.
OData Access URI
Specifies the base URI for the OData feed to access your data source, for example, https://hybridpipe.operations.com/api/odata/<DataSourceName>. You can copy the URI and paste it into your application's OData configuration.
The URI contains the case-insensitive name of the data source to connect to, and the query that you want to execute. This URI is the OData Service Root URI for the OData feed. The Service Document for the data source is returned by issuing a GET request to the data source's service root.
The OData Service Document returns the names of the entities exposed by the Data Source OData service. To get details such as the properties of the entities exposed, the data types for those properties and the relationships between entities, the Service Metadata Document can be fetched by adding /$metadata to the service root URI.
Schema Map
Enables OData support. If a schema map is not defined, the OData API cannot be used to access the data store using this data source definition. Use the Configure Schema editor to select the tables/columns to expose through OData.
Page Size
Determines the number of entities returned on each page for paging controlled on the server side. On the client side, requests can use the $top and $skip parameters to control paging. In most cases, server side paging works well for large data sets. Client side pagination works best with a smaller data sets where it is not as expensive to fetch subsequent pages.
Valid Values: 0 | n
where n is an integer from 1 to 10000.
When set to 0, the server default of 2000 is used.
Default: 0
Refresh Result
Controls what happens when you fetch the first page of a cached result when using Client Side Paging. Skip must be omitted or set to 0. You can use the cached copy of that first page, or you can re-execute the query to get a new result, discarding the previously cached result. Re-executing the query is useful when the data being fetched may change between two requests for the first page. Using the cached result is useful if you are paging back and forth through results that are not expected to change.
Valid Values:
When set to 0, the OData service caches the first page of results.
When set to 1, the OData service re-executes the query.
Default: 1
Inline Count Mode
Specifies how the connectivity service satisfies requests that include the $count parameter when it is set to true (for OData version 4) or the $inlinecount parameter when it is set to allpages (for OData version 2). These requests require the connectivity service to include the total number of entities that are defined by the OData query request. The count must be included in the first page in server-driven paging and must be included in every page when using client-driven paging.
The optimal setting depends on the data store and the size of results. The OData service can run a separate query using the count(*) aggregate to get the count, before running the query used to generate the entities. In very large results, this approach can often lead to the first page being returned faster. Alternatively, the OData service can fetch the entire result before returning the first page. This approach works well for small results and for data stores that cannot optimize the count(*) aggregate; however, it may have a longer initial response time for the first page if the result is large.
Valid Values:
When set to 1, the connectivity service runs a separate count(*) aggregate query to get the count of entities before executing the query to return results. In very large results, this approach can often lead to the first page being returned faster.
When set to 2, the connectivity service fetches all entities before returning the first page. For small results, this approach is always faster. However, the initial response time for the first page may be longer if the result is large.
Default: 1
Top Mode
Indicates how requests typically use $top and $skip for client side pagination, allowing the service to better anticipate how to process queries.
Valid Values:
Set to 0 when the application generally uses $top to limit the size of the result and rarely attempts to get additional entities by combining $top and $skip.
Set to 1 when the application uses $top as part of client-driven paging and generally combines $top and $skip to page through the result.
Default: 0
OData Read Only
Controls whether write operations can be performed on the OData service. Write operations generate a 405 Method Not Allowed response if this option is enabled.
Valid Values:
ON | OFF
When ON is selected, OData access is restricted to read-only mode.
When OFF is selected, write operations can be performed on the OData service.
Default: OFF
String Max Length
Controls the maximum length reported for Apache Hive String columns. Values larger than the specified value cause the String columns to be excluded from the model. Values smaller than the specified value may cause issues with some OData applications as data may be returned that exceeds the maximum length. The default value is 32768.

Advanced tab

Click the thumbnail to view the screen.
Advanced tab of the Apache Hadoop Hive data source setup dialogAdvanced tab of the Apache Hadoop Hive data source setup dialog
Table 9. Advanced tab connection parameters for Apache Hadoop Hive
Field
Description
Array Fetch Size
Specifies the number of fields the data access service uses to calculate the maximum number of rows for a fetch. When executing a fetch, the service divides the Array Fetch Size value by the number of columns in a particular table to determine the number of rows to retrieve. By determining the fetch size based on the number of fields, out of memory errors may be avoided when fetching from tables containing a large number of columns while continuing to provide improved performance when fetching from tables containing a small number of columns.
Valid values:
-x | x
where:
-x is a negative integer
x is a positive integer.
If set to -x, the service overrides any settings on the statement level and uses the number of fields specified by the absolute value of -x to calculate the number of rows to retrieve.
If set to x, the service uses the number of fields specified by the value of x to calculate the number of rows to retrieve. However, the service will not override settings, such as setFetchSize(), on the statement level.
For example, if this property is set to 20000 fields and you are querying a table with 19 columns, the service divides the number of fields by the number of columns to calculate the number of rows to retrieve. In this case, approximately 1053 rows would be retrieved for each fetch.
Note: You can improve performance by increasing the value specified for this parameter. However, if the number of fields specified exceeds the available buffer memory on the server, an out of memory error will be returned. If you receive this error, decrease the value specified until fetches are successfully executed.
Default: 20000 (fields)
Array Insert Size
Specifies the number of fields the data access service uses to calculate the maximum number of rows sent in a packet when executing a multi-row insert. When executing a multi-row insert, the service divides the Array Insert Size value by the number of columns in a particular insert statement to determine the number of rows to send in a packet. By determining the packet size based on the number of fields, the service can avoid out of memory errors when executing inserts containing a large number of columns while continuing to provide improved performance when executing inserts containing a small number of columns. The default value is 20,000 fields.
In most scenarios, the default setting for Array Insert Size provides the ideal behavior; however, you may need to reduce the value specified if you encounter either of the following:
*Performance or memory issues when inserting a large number of rows that contain large values.
*The following error when inserting a large number of rows when using Apache Knox: HTTP/1.1 500 Server Error.
Default: 20000 (fields)
Batch Mechanism
Determines the mechanism that is used to execute batch operations.
Valid values:
nativeBatch | multiRowInsert.
If set to nativeBatch, the Hive native batch mechanism is used to execute batch operations, and an insert statement is executed for each row contained in a parameter array.
If set to multiRowInsert, the service attempts to execute a single insert statement for all the rows contained in a parameter array. If the size of the insert statement exceeds the available buffer memory of the server, the service executes multiple statements. This behavior provides substantial performance gains for batch inserts.
Default: multiRowInsert
*Multirow inserts can only be performed on Insert statements that use parameterized arrays.
*Batch operations for parameterized arrays are not supported for updates or deletes.
*The service modifies the HQL statement to perform a multirow insert.
*This connection property can affect performance.
Catalog Mode
Specifies whether the service uses native catalog functions to retrieve information returned by DatabaseMetaData functions.
Valid values:
mixed | native | query
If set to mixed, the service uses a combination of native catalog functions and discovered information to retrieve catalog information. Select this option for the optimal balance of performance and accuracy.
If set to native, the service uses native catalog functions to retrieve information returned by DatabaseMetaData functions. This setting provides the best performance, but at the expense of less-accurate catalog information.
If set to query, the service uses discovered information to retrieve catalog information. This option provides highly accurate catalog information, but at the expense of slower performance.
Default: mixed
Initialization String
A semicolon delimited set of commands to be executed on the data store after Hybrid Data Pipeline has established and performed all initialization for the connection. If the execution of a SQL command fails, the connection attempt also fails and Hybrid Data Pipeline returns an error indicating which SQL commands failed.
Syntax:
command[[; command]...]
Where:
command
is a SQL command. Multiple commands must be separated by semicolons. In addition, if this property is specified in a connection URL, the entire value must be enclosed in parentheses when multiple commands are specified. For example, assuming a schema name of SFORCE:
InitializationString=(REFRESH SCHEMA SFORCE)
Login Timeout
The amount of time, in seconds, that the Hybrid Data Pipeline connectivity service waits for a connection to be established before timing out the connection request.
Valid Values:
0 | x
where x is a positive integer that represents a number of seconds.
If set to 0, the connectivity service does not time out a connection request.
If set to x, the connectivity service waits for the specified number of seconds before returning control to the application and throwing a timeout exception.
Default: 30
Max Pooled Statements
The maximum number of prepared statements to cache for this connection. If the value of this property is set to 20, the connectivity service caches the last 20 prepared statements that are created by the application.
Query Timeout
The number of seconds for the default query timeout for all statements that are created by a connection.
Valid Values:
-1 | 0 | x
If set to -1, the query timeout functionality is disabled. The Hybrid Data Pipeline connectivity service silently ignores calls to the Statement.setQueryTimeout() method.
If set to 0, the default query timeout is infinite (the query does not time out).
If set to x, the Hybrid Data Pipeline connectivity service uses the value as the default timeout for any statement that is created by the connection. To override the default timeout value set by this connection parameter, call the Statement.setQueryTimeout() method to set a timeout value for a particular statement.
Default: 0
Transport Mode
Specifies whether binary (TCP) mode or HTTP mode is used to access Apache Hive data sources.
Valid values:
binary | http
If set to binary, Thrift RPC requests are sent directly to data sources using a binary connection (TCP mode).
If set to http, Thrift RPC requests are sent using HTTP transport (HTTP mode). HTTP mode is typically used when connecting to a proxy server, such as a gateway, for improved security, or a load balancer.
Default: binary
*The setting of this parameter corresponds to that of the hive.server2.transport.mode property in your hive-site.xml file.
*When Transport Mode is set to http, the HTTP/HTTPS end point for the Hive server must be specified using the HTTP Path parameter.
*To use HTTPS end points, set Transport Mode to http and Encryption Method to SSL.
*Apache Hive currently supports using only one protocol mode per server at a time.
HTTP Path
Specifies the path of the HTTP/HTTPS endpoint used for connections when HTTP mode is enabled (Transport Mode set to http).
Valid values:
string
where:
string
is the path of the URL endpoint. By default, the value specified must be an HTTP end point. To support HTTPS values, enable SSL by setting Encryption Method to SSL.
Enable Cookie Authentication
Determines whether the service attempts to use cookie based authentication for requests to an HTTP endpoint after the initial authentication to the server. Cookie based authentication improves response time by eliminating the need to re-authenticate with the server for each request.
Valid values:
ON | OFF
If set to ON, the service attempts to use cookie based authentication for requests to an HTTP endpoint after the initial authentication to the server. The cookie used for authentication is specified by the Cookie Name parameter. If the name does not match, or authentication fails, the driver attempts to authenticate according to the setting of the Authentication Method.
If set to OFF, the service does not use cookie based authentication for HTTP requests after the initial authentication.
Default: ON
Cookie Name
Specifies the name of the cookie used for authenticating HTTP requests when HTTP mode is enabld (Transport Mode set to http) and cookie based authentication is enabled (Enable Cookie Authentication is set to ON). When preparing an HTTP request to the server, the service will not attempt to reauthenticate if a valid cookie is present.
Valid values:
string
where:
string
is a valid cookie name.
Default: hive.server2.auth
Extended Options
Specifies a semi-colon separated list of connection options and their values. Use this configuration option to set the value of undocumented connection options that are provided by Progress DataDirect technical support. You can include any valid connection option in the Extended Options string, for example:
Database=Server1;UndocumentedOption1=value[;UndocumentedOption2=value;]
If the Extended Options string contains option values that are also set in the setup dialog, the values of the options specified in the Extended Options string take precedence.
Valid Values:
string
Default: none
Metadata Exposed Schemas
Restricts the metadata exposed by Hybrid Data Pipeline to a single schema. The metadata exposed in the SQL Editor, the Configure Schema Editor, and third party applications will be limited to the specified schema. JDBC, OData, and ODBC metadata calls will also be restricted. In addition, calls made with the Schema API will be limited to the specified schema.
Warning: This functionality should not be regarded as a security measure. While the Metadata Exposed Schemas option restricts the metadata exposed by Hybrid Data Pipeline to a single schema, it does not prevent queries against other schemas on the backend data store. As a matter of best practice, permissions should be set on the backend data store to control the ability of users to query data.
Valid Values
<schema>
Where:
<schema>
is the name of a valid schema on the backend data store.
Default: No schema is specified. Therefore, all schemas are exposed.
See the steps for:
How to create a data source in the Web UI