Simplifying Azure Synapse Analytics
- Azure Synapse is an analytics service that brings together enterprise data warehousing and Big data analytics.
- It gives you the freedom to query data on your terms, using either serverless on-demand or provisioned resources at scale.
- Azure synapse combines the two worlds with a unified experience to ingest, prepare, manage, and serve data for immediate business Intelligence and machine learning needs.
High level Azure Synapse Architecture:
Azure Synapse has 4 major components:
1. Synapse SQL:
Synapse SQL leverages a scale-out architecture to distribute computational processing of data across multiple nodes. ‘Compute’ is separate from storage, enabling you to scale compute independently of the data in your system.
The unit of scale is an abstraction of computing power known as a data warehouse unit for a dedicated SQL pool.
For serverless SQL pool, being serverless, scaling is done automatically to accommodate query resource requirements. As topology changes over time by adding, removing nodes or failovers, it adapts to changes. It makes sure your query has enough resources and finishes successfully. For example, the image below shows a serverless SQL pool utilizing 4 compute nodes to execute a query.
Generally, Synapse SQL Pools are part of an Azure SQL Server instance browsed using tools like SSMS. Synapse SQL feature is also available serverless (in preview as of Sept 2020), where no fixed capacity of the infrastructure needs to be provisioned. Instead, Azure manages the required infrastructure capacity to meet the needs of the workloads. This is a data virtualization feature supported by Synapse SQL. The pricing model, in this case, is based on the data volumes processed instead of the number of DWUs allocated to the instance.
2. Apache Spark for Synapse:
This Synapse component provides Spark runtime to perform the same set of tasks like data loading, data processing, data preparation, ETLs, and other functions related to data warehousing. Azure provides Data Bricks, too, as a service based on Spark runtime with a specific set of optimizations, typically used for a similar set of purposes. One of the advantages of this feature, compared to Azure Databricks, is that no additional or separate clusters need to be managed to process data. It is an integral part of Synapse. It provides Spark-based processing with auto-scaling, support for features like .NET for Spark, SparkML algorithms, Delta Lake, Azure ML Integration for Apache Spark, Jupyter style notebooks, etc. In addition, it has multi-language support for languages like C#, Pyspark, Scala, Spark SQL, Java, etc. Once a Synapse workspace is created, one can provision Apache Spark pools or Synapse SQL pools from a familiar interface.
Spark pools in Azure Synapse Analytics enable the following critical scenarios:
Data Engineering/Data Preparation:
Apache Spark includes many language features to support the preparation and processing of large volumes of data to be made more valuable and then consumed by other services within Azure Synapse Analytics. This is enabled through multiple languages (C#, Scala, PySpark, Spark SQL) and supplied libraries for processing and connectivity.
Apache Spark comes with MLlib, a machine learning library built on top of Spark that you can use from a Spark pool in Azure Synapse Analytics. Spark pools in Azure Synapse Analytics also include Anaconda, a Python distribution with various packages for data science, including machine learning. When combined with built-in support for notebooks, you have an environment for creating machine learning applications.
3. Synapse Pipelines:
Different tools can be used to load data into Synapse, but having an integrated orchestration engine help to reduce dependency and management of separate tool instances and data pipelines. This service comes with an integrated orchestration engine identical to Azure Data Factory to create data pipelines and rich data transformation capabilities within the Synapse workspace itself. Key features include support for 90+ data sources that contain almost 15 Azure-based data sources, 26 open-source and cross-cloud data warehouses and databases, 6 file-based data sources, 3 No SQL based data sources, 28 Services and Apps that can serve as data providers, as well as four generic protocols like ODBC, REST, etc. that can serve data.
4. Azure Synapse Studio:
This tool is a web-based SaaS tool that provides developers to work with every aspect of Synapse Analytics from a single console. In an analytical solution development life-cycle using Synapse, one generally starts with creating a workspace and launching this tool that provides access to different synapse features like Ingesting data using import mechanisms or data pipelines and develop data flows, explore data using notebooks, analyze data with spark jobs or SQL scripts, and finally visualize data for reporting and dashboarding purposes. This tool also provides features for authoring artifacts, debugging code, optimizing performance by assessing metrics, integration with CI/CD tools, etc.
This provides a high-level understanding on what is Synapse Analytics and why it is imperative to solve and optimize organization’s goals.
Hoping that you have gained some insights, let us know about your feedback and queries.