Microsoft Service Provider Reference Architecture Hadoop

10/23/2020

Last October 28^th Microsoft finally released a new Azure service called “Windows Azure HDInsight Service”, that is our Hadoop offering in the Cloud:

Microsoft Security Reference Architecture
Reference Architecture Definition
Microsoft Service Provider Reference Architecture Hadoop Pdf
Dod Reference Architecture
Home Depot Service Provider Reference Guide

Along with this core service, a series of additional components has been also released to integrate Big Data world with Microsoft BI stack familiar tools and on-premise technologies, the most important one is a new ODBC driver that will permit connection to Hadoop HIVE:

Microsoft® Hive ODBC Driver is a connector to Apache Hadoop Hive available as part of HDInsight clusters.

Microsoft® Hive ODBC Driver enables Business Intelligence, Analytics and Reporting on data in Apache Hive.

The Azure Architecture Center contains guidance for building end-to-end solutions on Microsoft Azure. Here you will find reference architectures, best practices, design patterns, scenario guides, and reference implementations. Microsoft Cloud Reference Architecture: Foundation 8 Software Version Description System Center Service Provider Foundation 2012 R2 UR6 Integrated IaaS capabilities for designing and implementing multi-tenant self-service portals Windows Azure Pack (WAP) 1.0 UR6 Integrates with System Center and Windows Server. It is usually not desirable to add DeviceCode as the default token provider type. But it can be used when using a local command: hadoop fs -Dfs.adl.oauth2.access.token.provider.type=DeviceCode -ls. Running this will print a URL and device code that can be used to login from any browser (even on a different machine, outside of the ssh session).

You can download the driver from the link below:

Microsoft® Hive ODBC Driver

This driver can be installed on 32bit or 64bit versions of Windows 7, Windows 8, Windows Server 2008 R2 and Windows Server 2012 and will allow connection to “Windows Azure HDInsight Service” (v.1.6 and later) and “Windows Azure HDInsight Emulator” (v.1.0.0.0 and later). You should install the version that matches the version of the application where you will be using the ODBC driver. Both packages can be installed on the same machine if you need both versions of the driver; they are installed, by default, on different paths:

32bit: “C:Program Files (x86)Microsoft Hive ODBC Driver”
64bit: “C:Program FilesMicrosoft Hive ODBC Driver”

Depending on the version installed, you may need to use a different version of ODBC Data Source Administrator, the 32-bit version (“odbcad32.exe”) is located, on a 64bit machine, inside “C:WindowsSysWOW64” location. On Windows 8/2012 Server, you can find the right version easily using the “Search” function:

In case of version mismatch between SQL Server and ODBC driver, you will receive an error message as reported below when you will try to execute any query:

OLE DB provider 'MSDASQL' for linked server 'HiveSample' returned message '[Microsoft][ODBC Driver Manager] The specified DSN contains an architecture mismatch between the Driver and Application'.

Msg 7303, Level 16, State 1, Line 1 Cannot initialize the data source object of OLE DB provider 'MSDASQL' for linked server 'HiveSample'.

The setup process is really straightforward, nothing to configure or change until the end:

As you can see in the second print screen, this driver has been developed in collaboration with SIMBA Technologies Inc., at the end of the installation it’s recommended to go into the installation folder, by default for 64-bit is “C:Program FilesMicrosoft Hive ODBC Driver” and look at the following files/documents:

“THIRDPARTYNOTICES.TXT”: this is an extract from the file content:

Includes material furnished by Simba Technologies, Inc.Used under license. Note: While Microsoft is not the author of the files below, Microsoft is offering you a license subject to the terms of the Microsoft Software License Terms for Microsoft Hive ODBC Driver (the “Microsoft Program”). Microsoft reserves all other rights. The notices below are provided for informational purposes only and are not the license terms under which Microsoft distributes these files.

“Microsoft Hive ODBC Driver Install Guide.pdf”: contains lots of very useful information on usage of the driver; I would recommend to read at least sections titled 'SQL Query versus HiveQL Query' and 'SQL Connector';

Once installed, a sample data source is already pre-configured, but need modifications if you want to reuse for connecting to your HDInsight cluster:

Click on “Configure” and adjust parameters as reported in the print screen below:

NOTE: If you already tested a beta version of this ODBC driver in earlier HDInsight release, please be aware that the TCP port used has been changed from 563 to 443.

Insert “User Name” and “Password” based on your selection when provisioned the HDInsight cluster and then click on “Test” button; if everything is correct, you should receive the following output:

Be very careful also with the “Advanced Options” button, there are some important parameters to be aware of:

“Use Native Query”: if you enable this checkbox, the ODBC driver will *not* try to convert TSQL into HiveQL, then you should use it only if you are 100% sure you are submitting pure HiveQL statements; since the context of this blog is SQL Server, you should leave it disabled/unchecked.
“Rows fetched per block”: if you are going to fetch a large amount of records, tuning this parameter may be required to ensure optimal performances.
Data types: be very careful with the data type lengths and precisions specified in the right side of the above dialog since they may affect how data is returned, causing incorrect information to be returned due to loss of precision and/or truncation.

Now it’s time to open SQL Server Management Studio and create a “Linked Server” definition using the TSQL statement below:

EXEC master.dbo.sp_addlinkedserver @server = N'HiveSample', @srvproduct=N'HIVE', @provider=N'MSDASQL', @datasrc=N'Sample Microsoft Hive DSN', @provstr=N'Provider=MSDASQL.1;Persist Security Info=True;User ID=<<user_name>>; Password=<<password>>;'

You can now see the new “Linked Server” definition along with the sample table existing in the HIVE data warehouse on HDInsight cluster:

Now I want to run a simple SELECT statement to access data in the default HIVE sample table:

select * from [HiveSample].[HIVE].[default].hivesampletable

Mac vs pc compare. Unfortunately something went wrong and got the following error:

OLE DB provider 'MSDASQL' for linked server 'HiveSample' returned message 'Requested conversion is not supported.'.

Msg 7341, Level 16, State 2, Line 1 Cannot get the current row value of column '[MSDASQL].clientid' from OLE DB provider 'MSDASQL' for linked server 'HiveSample'.

This seems very strange to me since [clientid] in HIVE is a basic “string” type and the ODBC connector should be able to translate. Trusting the error message above, I was able to successfully execute the above query introducing an explicit type conversion in TSQL using CONVERT function as shown below and changing the 'default string column lenght' (ODBC DSN advanced options) parameter value to (8000):

select convert(varchar(8000),[clientid]) from [HiveSample].[HIVE].[default].hivesampletable

The reason for the error above is that HIVE does *not* provide the maximum data length for “string Download game league of stickman 2 mod apk. ” columns in the column metadata, then you need to use explicit cast/convert techniques in SQL Server TSQL to deal with it. Obviously you can use different lower numbers from 8000, since it's the maximum char length you can specify without using MAX qualifier in VARCHAR, it's up to you, but be sure to set the ODBC DSN advanced options parameters accordingly.

IMPORTANT: Be very careful with the parameter values contained in the ODBC “Advanced Options” property page, as described early in this post, since you may retrieve incorrect data due to precision lost and/or truncation.

In addition to the four-parts syntax used above, you can also use the OPENQUERY syntax as reported in the example below:

select * from openquery (HiveSample, ' select sessionid from hivesampletable)

If you want to check the HIVE native data type of a specific table, you can use the following Power Shell command:

Invoke-Hive 'describe hivesampletable'

NOTE: In order to use the Power Shell cmdlet for HDInsight, you need to follow the instructions below:

Install and configure PowerShell for HDInsight

Another useful query, still using Power Shell for submission, is reported below to show all the tables present in the HIVE metastore: Mac os zip free download.

Invoke-Hive 'show tables'

When you submit queries to HDInsight HIVE using the ODBC connector, be aware that every query will be translated to a Hadoop Map-Reduce Job, then the execution time may be long: if in your SQL Server installation you normally use a query timeout different from the default value of (0), that is infinite wait, you may have to change it, otherwise you will get an error before HDInsight will be able to process your query/job. In the properties of the “Linked Server”, you can see and eventually change the connection and query timeout values:

Download mac keyboard for windows. You may also have to change the SQL Server instance wide timeout value for “Query Timeout”, since by default is equal to 600 seconds (10 minutes), using the script below as an example for infinite wait:

sp_configure 'show advanced options',1

reconfigure

Sting desert rose mp3 download. go

sp_configure 'remote query timeout (s)',0

reconfigure

Finally, remember that this is v1 of this new technology, then not all HIVE aspects are covered by the driver, then I would recommend you to check if everything you need is included. This is a short list of what is supported and what is not yet:

The following data types are supported: TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE, DECIMAL, BOOLEAN, STRING , TIMESTAMP;
The aggregate data types (ARRAY, MAP and STRUCT) are not yet supported. Columns of aggregate types are treated as STRING columns;
Quoted Identifiers—When quoting identifiers, HiveQL uses back quotes (`) while SQL uses double quotes ('). Even when a driver reports the back quote as the quote character, some applications still generate double-quoted identifiers.
HiveQL does not support the AS keyword between a table reference and its alias.
JOIN, INNER JOIN and CROSS JOIN—SQL INNER JOIN and CROSS JOIN syntax is translated to HiveQL JOIN syntax.
TOP N/LIMIT—SQL TOP N queries are transformed to HiveQL LIMIT queries.

If you are a SQL Server DBA and then an expert of TSQL language, and want to know exactly which syntax and HiveQL syntax are supported, you can see the official link to the Hadoop HIVE documentation on Apache web site:

That’s all SQL Server folks…. Welcome to the Big data world!

The Cloudera Altus Director Azure Plugin is an implementation of the Cloudera Altus Director Service Provider Interface for the Microsoft Azure cloud platform.

Copyright and License

Cloudera, the Cloudera logo, and any other product or service names or slogans contained in this document are trademarks of Cloudera and its suppliers or licensors, and may not be copied, imitated or used, in whole or in part, without the prior written permission of Cloudera or the applicable trademark holder. Kashmir ki kali full movie download for mobile.

Hadoop and the Hadoop elephant logo are trademarks of the Apache Software Foundation. Amazon Web Services, the 'Powered by Amazon Web Services' logo, Amazon Elastic Compute Cloud, EC2, Amazon Relational Database Service, and RDS are trademarks of Amazon.com, Inc. or its affiliates in the United States and/or other countries. All other trademarks, registered trademarks, product names and company names or logos mentioned in this document are the property of their respective owners. Reference to any products, services, processes or other information, by trade name, trademark, manufacturer, supplier or otherwise does not constitute or imply endorsement, sponsorship or recommendation thereof by us.

Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or for any purpose, without the express written permission of Cloudera.

Cloudera may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this document. Except as expressly provided in any written license agreement from Cloudera, the furnishing of this document does not give you any license to these patents, trademarks, copyrights, or other intellectual property. For information about patents covering Cloudera products, see http://tiny.cloudera.com/patents.

The information in this document is subject to change without notice. Cloudera shall not be liable for any damages resulting from technical errors or omissions which may be present in this document, or from use of this document.

These instructions describe how to run unit and live tests for the Cloudera Altus Director Azure Plugin.

Prerequisites

Required environmental information

Before using this plugin you will need:

An Azure subscription.
A Service Principal.Use the Service Principal to get the required fields used for testing and authentication:
- subscription id
- tenant id
- client id
- client secret
Maven installed.
Programmatic deployment enabled for each virtual machine image used in the subscription.

Resource limits

Per the Azure Reference Architecture (PDF), the supported instance types used in deploying a CDH cluster are large. Even with small test clusters you'll hit the default resource limits per subscription.

Increase the limits, especially core count, by filing a support ticket on the Azure portal.

Building everything

To build everything, from Launchpad's base directory run:

Unit tests

Unit tests are simple. From Launchpad's base directory run:

Live tests

Live tests are more complicated and require the following:

The Service Principal fields:
- $SUBCRIPTION_ID
- $TENANT_ID
- $CLIENT_ID
- $CLIENT_SECRET
Paths to public and private ssh keys($SSH_PUBLIC_KEY_PATH and $SSH_PRIVATE_KEY_PATH). For keys included with Launchpad use:
- -Dtest.azure.sshPublicKeyPath=./././keys/azure/azure-user
- -Dtest.azure.sshPrivateKeyPath=./././keys/azure/azure-user.pem
A Resource Group name ($RESOURCE_GROUP_NAME)
- If the Resource Group does not exist the tests will create it, along with the supporting resources, and delete it when the tests finish.
- If the Resources Group does exist the tests will use it and won't delete it. The following resources must be in the existing Resource Group:
  - a Virtual Network named vn
  - a Network Security Group named nsg
  - a managed Availaiblity Set named managedAS
  - an unmanaged Availability Set named unmanagedAS
A Region to use ($REGION); this is optional and by default the tests will use eastus.
A specific set of tests to run ($TEST) in the format <Class> or <Class>#<Method> (e.g. AzureComputeProviderLiveTest#fullCycle); this is optional and by default all the tests will be run.

To run all Live Tests, from Launchpad's base directory run:

To run a specific test, from Launchpad's base directory run (note the change from -am clean install to test):

These instructions describe how to deploy the Cloudera Altus Director Azure Plugin.

Prerequisites

Required fields

Before using this plugin you will need:

An Azure subscription.
A Service Principal.Use the Service Principal to get the required fields used for testing and authentication:
- subscription id
- tenant id
- client id
- client secret
Maven installed.
Programmatic deployment enabled for each virtual machine image used in the subscription.

Pre deployment setup

Forward and reverse DNS

Cloudera Distribution of Hadoop (CDH) and Cloudera Altus Director require forward and reverse hostname resolution; this is not currently supported by Azure (Azure only supports forward resolution) and it's required to do name resolution using your own DNS server.

More instructions.

Resource limits

Increase the limits, especially core count, by filing a support ticket on the Azure portal.

IMPORTANT: The Cloudera Enterprise Reference Architecture for Azure Deployments (PDF) is the authoritative document for supported deployment configurations in Azure.

There are two files that the Cloudera Altus Director Azure Plugin uses to change settings:

images.conf
azure-plugin.conf

The files and their uses are explained below.

`images.conf`

What does images.conf do?

The images.conf file defines the VM images Cloudera Altus Director can use to provision VMs. The images.conf file in this repository is continuously updated with the latest supported VM images. The latest supported images can be found in the Azure Reference Architecture (PDF).

How do I add another image I want to use?

Take the images.conf file found in this repository and add a new image using the following format:
On Cloudera Altus Director server, copy your modified images.conf to /var/lib/cloudera-director-plugins/azure-provider-*/etc/images.conf.
Restart Cloudera Altus Director with sudo service cloudera-director-server restart.
Now you can use your newly defined image when deploying clusters. Note that in the Cloudera Altus Director UI you won't see the image-name in the dropdown list - just type it in manually and it will work.

Any caveats I should be aware of?

When you copy your modified images.conf file to /var/lib/cloudera-director-plugins/azure-provider-*/etc/images.conf make sure you're putting it in the latest azure-provider-[version] directory.
After updating your version of Cloudera Altus Director you'll need to copy your .conf files to the latest azure-provider-[version] directory.

`azure-plugin.conf`

Microsoft Security Reference Architecture

What does azure-plugin.conf do?

The azure-plugin.conf file defines settings that Cloudera Altus Director uses to validate VMs before provisioning. There are a bunch of fields, here are the important ones:

provider > supported-regions: this is the list of regions that a cluster can be deployed into. Only regions that support premium storage should be added to the list - that list can be found here.
instance > supported-instances: this is the list of supported instance sizes that can be used. Only certain sizes have been certified.
instance > maximum-disk-size: this is the maximum disk size, inclusive, for both Premium and Standard storage. Only certain sizes have been certified.

The latest supported premium and standard disk sizes can be found in the Azure Reference Architecture (PDF).

Download pastor chris oyakhilome teachings. How do I add a new region to use?

Take the azure-plugin.conf file found in this repository and add a new region to the provider > supported-regions list. The plugin will replace its internal list with this list so make sure you keep all of the supported regions that are already defined in azure-plugin.conf
On Cloudera Altus Director server, copy your modified azure-plugin.conf to /var/lib/cloudera-director-plugins/azure-provider-*/etc/azure-plugin.conf.
Restart Cloudera Altus Director with sudo service cloudera-director-server restart
Now you can use your newly defined region when deploying clusters.

How do I change the maximum premium or standard disk size?

Take the azure-plugin.conf file found in this repository and change the instance > maximum-disk-size value.
On Cloudera Altus Director server, copy your modified azure-plugin.conf to /var/lib/cloudera-director-plugins/azure-provider-*/etc/azure-plugin.conf.
Restart Cloudera Altus Director with sudo service cloudera-director-server restart
Now you can use your newly defined Standard disk when deploying clusters.

Reference Architecture Definition

Any caveats I should be aware of?

When you use a custom azure-plugin.conf any keys defined replace those defined internally, so you must append to lists rather than creating a list with only the new values. For example, if you were to add a region to provider > supported-regions you would need to keep all of the currently defined supported regions and append yours:

If you were to have a list with only the new region than the rest of the regions would stop working.