Thursday, January 26, 2017

Mashing RDDs in Apache Spark from RDBMs perspective


Happy new year! This is my first post in 2017!. 2016 was amazing year for me. lots of work, projects and achievements. Looking forward to 2017.

I am writing this blog post to cover the standard techniques to work with Resilient Distributed Datasets (RDDs) to join data in Apache Spark.

I would like to share some insights when working with RDDs in Spark. That's related to how to work with multiple RDDs as we do when working with relational database management systems.

Apache Spark support joins in RDDs, where you can implement all kinds of joins that we are aware of in RDBMS. Below i will list how would you implement this on this platform.

Apache Spark Join Transformations Operations:

1) join: This is equivalent to inner join in RDBMs. It returns a new pair RDD with the elements containing all possible pairs of values from the first and second RDDs that has the same keys. For the keys that exist in only one of the the two RDDs. the resulting RDD will have no elements.

2) leftOuterJoin: This is equivalent to left outer join in RDBMs. The resulting RDD will also contain the elements for those keys that don't exist in the second RDD.

3) rightOuterJoin: This is equivalent to right outer join in RDBMs. The resulting RDD will also contain the elements for those keys that don't exist in the first RDD.

4) fullOuterJoin: This is equivalent to cross join in RDBMs. The resulting RDD will also contain the elements for both keys that exist in either RDDs.

In case of the RDDs contain duplicate keys, these keys will be joined multiple times.

Hope this helps!

Friday, October 14, 2016

Introducing Power BI Embedded talk at Cloud Summit

Hi All,

Earlier today, I have presented "Introducing Power BI Embedded" top that covers platform capabilities and tools in Cloud Summit event at Microsoft Chevy Chase office.

The session covered Power BI Platform capabilities, tools and Power BI Embedded as PaaS option in Microsoft Cloud platform.

I have got a lot of questions about Power BI data set scheduling, working with data capabilities including direct queries vs. import options while authoring reports in Power BI desktop. I also covered the need for Power BI Gateway for hybrid scenarios.


Thursday, October 13, 2016

Fixing powerbi.d.ts missing modules errors in Visual Studio 2015


While i was working on Visual Studio 2015; I have got this errors due to missing modules in powerbi.d.ts file. I have an ASP.NET MVC project that uses Power BI Embedded and i would like to get this application up and running but i am getting these errors while building my app.

These errors are due to missing TypeScript tools for Visual Studio 2015. Once you install them, you will be able to run your app and all these errors disappear.

To fix this problem, follow these steps:

  • Open Tools | Extensions and Updates.
  • Select Online in the tree on the left.
  • Search for TypeScript using the search box in the upper right.
  • Select the most current available TypeScript version.
  • Download and install the package.
  • Build your project!

Hope this helps.

Monday, September 19, 2016

Extending Product Outreach with Outlook Connectors

Hi All,

I presented last Saturday at SharePoint Detroit a talk with title "Extending Product Outreach with Outlook Connectors"; Since i covered how to utilize office 365 groups to extend product outreach using outlook group connectors with demos.

Session Description:

Office 365 Connectors is a brand new experience that delivers relevant interactive content and updates from popular apps and services to Office 365 Groups. We are now bringing this experience to you, our Office 365 customers. Whether you are tracking a Twitter feed, managing a project with Trello or watching the latest news headlines with Bing—Office 365 Connectors surfaces all the information you care about in the Office 365 Groups shared inbox, so you can easily collaborate with others and interact with the updates as they happen. Session will cover how to build your office 365 connectors and how to work with Microsoft to help you build your company one.

Thursday, September 08, 2016

Build Intelligent Microservices Solutions using Azure

Hi All,

I had the pleasure last night to present at one of our local user groups to talk about building intelligent microservices in Azure.

The session covers in detail how to build intelligent microservices solutions using Cloud Services including web and worker roles, Azure App Service features in Azure & Service Fabric. The session was a demo driven and i demonstrated how to design and provision complete end-to-end solutions using cloud services using web roles, worker roles and service bus in Azure.
I also covered Azure App Service capabilities that help developers to scale and monitor production applications; in addition to setup continuous deployment.

Session objectives and takeaways:

  1. Benefits of creating micro services in the cloud
  2. End-To-End Use case for building cloud service with web & worker roles with service bus integration
  3. Azure App Service intelligent features including troubleshooting, CI, back up, routing, scheduling & other features
  4. Azure Service Fabric microservices platform

The presentation is posted below.

Wednesday, August 31, 2016

Building Big Data Solutions in Azure Data Platform @ Data Science MD

Hi All,

Yesterday i was at Johns Hopkins University in Laurel, MD presenting how to build big data solutions in Azure. The presentation was focused on the underling technologies and tools that are needed to build end to end big data solutions in the cloud. I presented the capabilities that Azure offers out of the box in addition to cluster types and tiers that are available for ISVs and developers.

The session covers the following:

1) What HDInsight cluster offers in hadoop ecosystem technology stack.
2) HDInsight cluster tiers and types.
3) HDInsight developer tools in Visual Studio 2015, HDInsight developer tools.
4) Working with HBase databases and Hive View, deploying Hive apps from Visual Studio.
5) Building, Debugging and Deploying Storm Apps into Storm clusters.
6) Working with Spark clusters using Jupyter, PySpark.

Session Title: Building Big Data Solutions in Azure Data Platform

Session Details:
The session covers how to get started to build big data solutions in Azure. Azure provides different Hadoop clusters for Hadoop ecosystem. The session covers the basic understanding of HDInsight clusters including: Apache Hadoop HDFS, HBase, Storm and Spark. The session covers how to integrate with HDInsight in .NET using different Hadoop integration frameworks and libraries. The session is a jump start for engineers and DBAs with RDBMS experience who are looking for a jump start working and developing Hadoop solutions. The session is a demo driven and will cover the basics of Hadoop open source products.

Friday, August 26, 2016

Study notes for exam 70-475: Designing and Implementing Big Data Analytics Solutions

Hi All,

Today I passed the "Designing and Implementing Big Data Analytics Solutions" Microsoft exam.

I have been preparing for this exam (70-475) for a couple of months and I have been using Hadoop ecosystem tools and platforms for awhile.

I wanted to master building big data analytics solutions using HDInsight clusters using Hadoop ecosystem which contains: Storm, Spark, HBase, Hive and HDFS. I worked to cover any gap in understanding I had in Azure Data Lake, ML, Python & R programming and Azure Machine Learning.

This exam covers the following primarily four main technologies (from most covered to least):

1) Hadoop ecosystem: Working with HDFS, HBase, Hive, Storm, Spark and understanding Lambda Architecture. If you want to know more about Lambda Architecture, read my blog post explaining it here.

2) Azure Machine Learning: building/training models, predictive models, classification vs regression vs clustering, recommender algorithms. building custom models, Executing code in R and Python. Ingesting data from Azure Event Hub & transformation in Stream analytics.

3) Azure Data Lake: building pipeline, activities, linked services, move, transform and analyze data, working with storage options in Azure (blob vs block) & tools to transform data.

4) SQL Server and Azure SQL: Security in transit and at rest, SQL Data Warehouse. Working with R in Sql Server 2016/Azure SQL.

My study notes while preparing to pass this test:

1) To protect data at rest as well as querying in Azure SQL Database: Use "Always Encrypted" to make sure data in transit is encrypted. Use "Transparent Data Encryption" to make sure that data at rest is encrypted. Read more about TDE here. Read more about Always Encrypted feature here.

2) When running an Azure ML experiment and you are getting "Out of memory error" here is how to fix it:
   a) Increase the memory settings for the map and reduce operations in the import module.
   b) Use Hive query to limit the amount of data being processed in the import module.

3) The easiest way to manage Hadoop clusters in Azure is to assign every HDInsight cluster to a resource group and to apply tags to all related resources.

4) In Hadoop, When the data is row-based, self-describing with schema and provide compact binary data serialization: it is recommended to use Avro.

5) Which Hadoop cluster type for query and analysis batch jobs:
     a) Spark: A cluster for In-memory processing, interactive queries, and micro-batch stream processing.
     b) Storm: A real-time event processing.
     c) HBase: NoSQL data storage for big data systems.

6) Importing data using Pyhon in Azure ML tips:
    a) Missing values are converted into NA for processing. NA will be converted back to missing values when converted back to datasets.
    b) Azure Dataset are converted to data frames in Pandas. Pandas module is used to work with data in Python.
    c) Number names columns are not ignored. str() function is applied to those.
    d) Duplicate column names are not ignored. The duplicate column names are modified to make sure they have unique names.

7) The only platform that supports ACID transaction in Hadoop file storage options is Apache Orc.

8) You have three utilities you can use to move data from local storage to managed cluster blob storage. These tools are: Azure CLI, PowerShell & AzCopy.

9) How to improve Hive queries using static vs dynamic partitioning, read more here.

10) Understand when to use Filter based Feature Selection in Azure ML.

11) AzureML requires Python to store visualizations as PNG Files. To configure MatPlotLib in AzureML, you should configure it to use AGG backend for rendering and you should save charts as PNG files.

12) To detect potential SQL injection attempts on Azure SQL database in ADL cluster: Enable Threat Detection.

13) To create synthetic samples of dataset for classes that are under represented: use SMOTE module in AzureML.

14) D14 V2 Virtual Machines in Azure supports 100GB in memory processing.

15) You can add multiple contributors to AzureML workspace as users.

16) Understand the minimum requirements for each cluster type in HDInsight;
       a) At least 1 data node for Hadoop cluster type.
       b) At least 1 region server for HBase cluster type.
       c) Two Nimbus nodes for Storm cluster type.
       d) At least 1 worker role for Spark cluster type.

17) If you want to store a file with a file size is greater than 1 TB, you need to use Azure Data Lake Store.

18) In Azure Data Factory (ADF), you can train, score and publish experiments to AzureML using:
      a) AzureML Batch execution: to train and score.
      b) AzureML Update resource activity: to update AzureML web services.

19) In Azure Data Factory (ADF), A pipeline is used to configure several activities, including the sequence and timing activities in a pipeline can be managed as a unit.

20) Working with R models in SQL Server 2016/AzureSQL: read more here.

21) Apache Spark in HDInsight can read files from Azure blob storage (WASB) but not SQL Server.

22) Always Encrypted protects data in transit and at rest will be encrypted. Also this feature allows you to store encryption keys on premise.

23) Transparent Data encryption (TDE) : secure data at rest, it will not protect data in transit and the keys are stored in the cloud.

24) Distcp is a Hadoop tool to copy data to and from HDInsight clusters storage blob into Azure Data lake store.

25) Adlcopy: is a command line utility to copy data from azure blob storage into azure data lake storage account.

26) AzCopy: A tool to copy data from and to Azure blob storage.

27) While working with large binary files and you would like to optimize the speed of AzureML experiment, you can do the following:
      a) Developers should write data as block blob.
      b) The blob format should be in CSV or TSV.
      c) You should NOT turn off the cached results option.
      d) You can NOT filter data using SQL but R language.

28) SQL DB contributor role allows monitoring and auditing of SQL databases without granting permissions to modify security or audit policies.

29) To process data in HDInsight clusters in Azure Data Factory (ADF):
      a) Add a new item to the pipeline in the solution explorer.
      b) Select Hive Transformation.
      c) Construct JSON to process the cluster data in an activity.

30) Understanding Tumbling vs Hopping vs Sliding Windows in Azure Stream Analytics. (link)

Hope this helps you get ready to pass the test, and good luck everyone!
Let's get all certified ya'll data wranglers :-)

-- ME

1) Microsot Exam 70-475 details, skills measured and more: