Performance impact of memory grants on data loads into Columnstore tables

Reviewed by: Dimitri Furman, Sanjay Mishra, Mike Weiner, Arvind Shyamsundar, Kun Cheng, Suresh Kandoth

Background

Some of the best practices when bulk inserting into a clustered Columnstore table are:

  • Specifying a batch size close to 1048576 rows, or at least greater than 102400 rows, so that they land into compressed row groups directly.
  • Using concurrent bulk loads if you want to reduce the time to load.

For additional details, see the blog post titled Data Loading performance considerations with Clustered Columnstore indexes, specifically the Concurrent Loading section.

Customer Scenario

I was working on a customer engagement SQL Server 2017 CTP 2.0 on Linux with a Columnstore implementation, where the speed of the load process was of critical importance to the customer.

  • Data was being loaded concurrently from 4 jobs, each one loading a separate table.
  • Each job spawned 15 threads, so in total there were 60 threads concurrently bulk loading data into the database.
  • Each thread specified the commit batch size to be 1048576.

Observations

When we tested with 1 or 2 jobs, resulting in 15 or 30 concurrent threads loading, performance was great. Using the concurrent approach, we had greatly reduced the load time. However, when we increased the number of jobs to 4 jobs running concurrently, or 60 concurrent threads loading, the overall load time more than doubled.

Digging into the problem

Just like in any performance troubleshooting case, we checked physical resources, but found no bottleneck in CPU, Disk IO, or memory at the server level.  CPU on the server was hovering around 30% for the 60 concurrent threads, and that was almost the same as with 30 concurrent threads. Mid-way into job execution, we also checked DMVs such as sys.dm_exec_requests and sys.dm_os_wait_stats, and saw that INSERT BULK statements were executing, but there was no predominant wait. Periodically, there was LATCH contention, which made little sense – given the ~1 million batch sizes, data from each bulk insert session should have landed directly in its own compressed row group.

Then we spot checked the row group physical stats DMVs, and observed that despite the batch size specified, the rows were landing in the delta , and not into the compressed row groups directly, as we expected they would.

Below is an example of what we observed from sys.dm_db_column_store_row_group_physical_stats:

select row_group_id, delta_store_hobt_id,state_desc,total_rows,trim_reason_desc
from sys.dm_db_column_store_row_group_physical_stats
where object_id = object_id('MyTable')

 

 

As you may recall from the previously referenced blog, inserting into the delta store, instead of into compressed row groups directly, can significantly impact performance. This also explained the latch contention we saw since we were inserting from many threads into the same btree.  At first, we suspected that the code was setting the batch size incorrectly, but then we ran an XEvent session and observed the batch size of 1 million specified as expected, so that wasn’t a factor. I didn’t know of any factors that caused a bulk insert to revert to delta store when it was supposed to go to compressed row groups. Hence, we collected a full set of diagnostics for a run using PSSDIAG, and did some post analysis.

Getting closer…

We found that only at the beginning of the run, there was contention on memory grants (RESOURCE_SEMAPHORE waits), for a short period of time. After that and later into the process, we could see some latch contention on regular data pages, which we didn’t expect as each thread was supposed to insert into its own row group. You would also see this same data by querying sys.dm_exec_requests live, if you caught it within the first minute of execution, as displayed below.

 Figure 1: Snapshot of sys.dm_exec_requests

Looking at the memory grant DMV sys.dm_exec_query_memory_grants, we observed that at the beginning of the data load, there was memory grant contention. Also, interestingly, each session had a grant of ~5GB (granted_memory_kb), but was using only ~1GB (used_memory_). When loading data from a file, the optimizer doesn’t have knowledge of number of rows in the file and memory grant is estimated based on the schema of the table, taking into account maximum length of variable length columns defined. In this specific case, this server was commodity hardware with 240 GB of memory. Memory grants of 5 GB per thread across 60 threads exceeded the total memory on the box. If this were a larger machine, this situation would not arise. You can also observe multiple sessions that have requested memory, but memory has not yet been granted (second and third rows in the snapshot in Figure 2). See additional details on memory grants here.

Figure 2: Snapshot of sys.dm_exec_query_memory_grants

Root cause discovered!

We still didn’t know the reason for reverting into delta store, but armed with the knowledge that there was some kind of memory grant contention, we created an extended event session on the query_memory_grant_wait_begin and query_memory_grant_wait_end events, to see if there were some memory grant timeouts that caused this behavior. This XE session did strike gold; we were able to see several memory grants time out after 25 seconds and could correlate these session_ids to the same session_ids that were doing the INSERT BULK commands.

Figure 3: Output of Extended event collection. Duration of wait is the difference between the query_memory_grant_wait_end and query_memory_grant_wait_begin time for that specific session.

Collecting a stack on the query_memory_grant_wait_begin extended event and with some source code analysis, we found out the root cause for this behavior. For every bulk insert we first determine whether it can go into a compressed row group directly based on batch size. If it can, we request a memory grant  with a timeout of 25 seconds. If we cannot acquire the memory grant in 25 seconds, that bulk insert reverts to the delta store instead of compressed row group.

Working around the issue

Given our prior dm_exec_query_memory_grants diagnostic data, you could also observe from Figure 2 that we asked for a 5GB grant, but used only 1GB. There was room to reduce the grant size, to avoid memory grant contention, and still maintain performance. Therefore, we created and used a resource governor workload group that reduced the grant percent parameter to allow greater concurrency during data load. We then tied this workload group via a classifier function for just the login that the data load jobs were executed under. We first lowered the grant percentage to 10% from the default %, but even at that level, we couldn’t sustain 60 sessions concurrently bulk loading due to RESOURCE_SEMAPHORE waits, as each memory grant requested was still 5 GB. We iterated on the grant percentage a couple times, lowering it until we landed at 2% for this specific data load. Setting it to 2% means that we are preventing a query from being able to get a memory grant greater than 2% of the target_memory_kb value in the DMV sys.dm_exec_query_resource_semaphores. Binding the specific login that was only used for data load jobs to the workload group prevented this configuration from affecting the rest of the workload. Only load queries ended up in the workload group with the 2% limit on memory grants, while the rest of the workload used the default workload group configuration. At 2%, the memory grant requested for each thread was around 1GB, and allowed the level of concurrency we were looking for.

-- Create a Workload group for Data Loading
CREATE WORKLOAD GROUP DataLoading
WITH (REQUEST_MAX_MEMORY_GRANT_PERCENT = 2)

-- If the Login is DataLoad it will go to workload group DataLoading
DROP FUNCTION IF EXISTS DBO.CLASSIFIER_LOGIN
GO
CREATE FUNCTION CLASSIFIER_LOGIN ()
RETURNS SYSNAME WITH SCHEMABINDING
BEGIN
DECLARE @val varchar(32) = 'default';
IF 'DataLoad' = SUSER_SNAME()
SET @val = 'DataLoading';
RETURN @val;
END
GO
-- Make function known to the Resource Governor as its classifier
ALTER RESOURCE GOVERNOR WITH (CLASSIFIER_FUNCTION = dbo.CLASSIFIER_LOGIN)
GO

Note: Usually with memory grants, you can often use query level hints MAX_GRANT_PERCENT and MIN_GRANT_PERCENT. In this case given it was an ETL workflow there wasn’t a user defined query to add the hint for example in the case of an SSIS package.

Final Result

Once we did that, our 4 Jobs could execute in parallel (60 threads loading data simultaneously) in the same timeframe that our prior 2 Jobs, reducing total data load time significantly. Running 4 jobs in parallel in almost the same interval of time allowed us to load twice the amount of data, increasing our data load throughput.

Concurrent Load Jobs Tables Loaded Threads loading data RG Configuration Data Load Elapsed Time  (sec)
2 2 30 Default 2040
4 4 60 Default 4160
4 4 60 REQUEST_MAX_MEMORY_GRANT_PERCENT = 2 1040

We could drive CPU to almost 100% now, compared to 30% before the Resource Governor changes.

Figure 4: CPU Utilization Chart

 

 

 

 

 

 

 

 

Conclusion

Concurrently loading data into clustered Columnstore indexes requires some considerations including memory grants. Use the techniques outlined in this article to identify if you are running into similar bottlenecks related to memory grants, and if so use the Resource Governor to adjust the granted memory to allow for higher concurrency. We hope you enjoyed reading this as much as we enjoyed bringing it to you! Feedback in the Comments welcome.

from SQL Server Customer Advisory Team https://blogs.msdn.microsoft.com/sqlcat/2017/06/02/performance-impact-of-memory-grants-on-data-loads-into-columnstore-tables/

Azure SQL Data Warehouse loading patterns and strategies

Authors: John Hoang, Joe Sack and Martin Lee

Abstract

This article provides an overview of the Microsoft Azure SQL Data Warehouse architecture. This platform-as-a service (PaaS) offering provides independent compute and storage scaling on demand. This document provides data loading guidelines for SQL Data Warehouse. Several common loading options are described, such as SSIS, BCP, Azure Data Factory (ADF), and SQLBulkCopy, but the main focus is the PolyBase technology, the preferred and fastest loading method for ingesting data into SQL Data Warehouse. See also What is Azure SQL Data Warehouse?

Introduction

Whether you are building a data mart or a data warehouse, the three fundamentals you must implement are an extraction process, a transformation process, and a loading process—also known as extract, transform, and load (ETL). When working with smaller workloads, the general rule from the perspective of performance and scalability is to perform transformations before loading the data. In the era of big data, however, as data sizes and volumes continue to increase, processes may encounter bottlenecks from difficult-to-scale integration and transformation layers.

As workloads grow, the design paradigm is shifting. Transformations are moving to the compute resource, and workloads are distributed across multiple compute resources. In the distributed world, we call this massively parallel processing (MPP), and the order of these processes differs. You may hear it described as ELT—you extract, load, and then transform as opposed to the traditional ETL order. The reason for this change is today’s highly scalable parallel computing powers, which put multiple compute resources at your disposal such as CPU (cores), RAM, networking, and storage, and you can distribute a workload across them.

With SQL Data Warehouse, you can scale out your compute resources as you need them on demand to maximize power and performance of your heavier workload processes.

However, we still need to load the data before we can transform. In this article, we’ll explore several loading techniques that help you reach maximum data-loading throughput and identify the scenarios that best suit each of these techniques.

Architecture

SQL Data Warehouse uses the same logical component architecture for the MPP system as the Microsoft Analytics Platform System (APS). APS is the on-premises MPP appliance previously known as the Parallel Data Warehouse (PDW).

As you can see in the diagram below, SQL Data Warehouse has two types of components, a Control node and a Compute node:

Figure 1. Control node and Compute nodes in the SQL Data Warehouse logical architecture

image

The Control node is the brain and orchestrator of the MPP engine. We connect to this area when using SQL Data Warehouse to manage and query data. When you send a SQL query to SQL Data Warehouse, the Control node processes that query and converts the code to what we call a DSQL plan, or Distributed SQL plan, based on the cost-based optimization engine. After the DSQL plan has been generated, for each subsequent step, the Control node sends the command to run in each of the compute resources.

The Compute nodes are the worker nodes. They run the commands given to them from the Control node. Compute usage is measured using SQL Data Warehouse Units (DWUs). A DWU, similar to the Azure SQL Database DTU, represents the power of the database engine as a blended measure of CPU, memory, and read and write rates. The smallest compute resource (DWU 100) consists of the Control node and one Compute node. As you scale out your compute resources (by adding DWUs), you increase the number of Compute nodes.

Within the Control node and in each of the Compute resources, the Data Movement Service (DMS) component handles the movement of data between nodes—whether between the Compute nodes themselves or from Compute nodes to the Control node.

DMS also includes the PolyBase technology. An HDFS bridge is implemented within the DMS to communicate with the HDFS file system. PolyBase for SQL Data Warehouse currently supports Microsoft Azure Storage Blob and Microsoft Azure Data Lake Store.

Network and data locality

The first considerations for loading data are source-data locality and network bandwidth, utilization, and predictability of the path to the SQL Data Warehouse destination. Depending on where the data originates, network bandwidth will play a major part in your loading performance. For source data residing on your premises, network throughput performance and predictability can be enhanced with a service such as Azure Express Route. Otherwise, you must consider the current average bandwidth, utilization, predictability, and maximum capabilities of your current public Internet-facing, source-to-destination route.

Note Express Route routes your data through a dedicated connection to Azure without passing through the public Internet. ExpressRoute connections offer more reliability, faster speeds, lower latencies, and higher security than typical Internet connections. For more information, see Express Route.

Using PolyBase for SQL Data Warehouse loads

SQL Data Warehouse supports many loading methods, including SSIS, BCP, the SQLBulkCopy API, and Azure Data Factory (ADF). These methods all share a common pattern for data ingestion. By comparison, the PolyBase technology uses a different approach that provides better performance.

PolyBase is by far the fastest and most scalable SQL Data Warehouse loading method to date, so we recommend it as your default loading mechanism. PolyBase is a scalable, query processing framework compatible with Transact-SQL that can be used to combine and bridge data across relational database management systems, Azure Blob Storage, Azure Data Lake Store and Hadoop database platform ecosystems (APS only).

Note As a general rule, we recommend making PolyBase your first choice for loading data into SQL Data Warehouse unless you can’t accommodate PolyBase-supported file formats. Currently PolyBase can load data from UTF-8 and UTF-16 encoded delimited text files as well as the popular Hadoop file formats RC File, ORC, and Parquet. PolyBase can load data from gzip, zlib and Snappy compressed files. PolyBase currently does not support extended ASCII, fixed-file format, and compression formats such as WinZip, JSON, and XML.

As the following architecture diagrams show, each HDFS bridge of the DMS service from every Compute node can connect to an external resource such as Azure Blob Storage, and then bidirectionally transfer data between SQL Data Warehouse and the external resource.

Note As of this writing, SQL Data Warehouse supports Azure Blob Storage and Azure Data Lake Store as the external data sources.

Figure 2. Data transfers between SQL Data Warehouse and an external resource

image[13]

PolyBase data loading is not limited by the Control node, and so as you scale out your DWU, your data transfer throughput also increases. By mapping the external files as external tables in SQL Data Warehouse, the data files can be accessed using standard Transact-SQL commands—that is, the external tables can be referenced as standard tables in your Transact-SQL queries.

Copying data into storage

The general load process begins with migrating your data into Azure Blob Storage. Depending on your network’s capabilities, reliability, and utilization, you can use AZCOPY to upload your source data files to Azure Storage Blobs with an upload rate from 80 MB/second to 120 MB/second.

Then, in SQL Data Warehouse, you configure your credentials that will be used to access Azure Blob Storage:

CREATE DATABASE SCOPED CREDENTIAL myid_credential WITH IDENTITY = ‘myid’, Secret=’mysecretkey’;

 

Next you define the external Azure Blob Storage data source with the previously created credential:

CREATE EXTERNAL DATA SOURCE data_1tb WITH (TYPE = HADOOP, LOCATION = ‘wasbs://data_1tb@myid.blob.core.windows.net’, CREDENTIAL= myid_credential);

 

And for the source data, define the file format and external table definition:

CREATE EXTERNAL FILE FORMAT pipedelimited

WITH (FORMAT_TYPE = DELIMITEDTEXT,

      FORMAT_OPTIONS(

          FIELD_TERMINATOR = ‘|’,

          STRING_DELIMITER = ”,

          DATE_FORMAT = ”,

          USE_TYPE_DEFAULT = False)

);

CREATE EXTERNAL TABLE orders_ext (

    o_orderkey bigint NULL,

    o_custkey bigint NULL,

    o_orderstatus char(1),

    o_totalprice decimal(15, 2) NULL,

    o_orderdate date NULL,

    o_orderpriority char(15),

    o_clerk char(15),

    o_shippriority int NULL,

    o_comment varchar(79)

)

WITH (LOCATION=’/orders’,

      DATA_SOURCE = data_1tb,

      FILE_FORMAT = pipedelimited,

      REJECT_TYPE = VALUE,

      REJECT_VALUE = 0

);

 

For more information about PolyBase, see SQL Data Warehouse documentation.

Using CTAS to load initial data

Then you can use a CTAS (CREATE TABLE AS SELECT) operation within SQL Data Warehouse to load the data from Azure Blob Storage to SQL Data Warehouse:

       CREATE TABLE orders_load

       WITH (CLUSTERED COLUMNSTORE INDEX, DISTRIBUTION = HASH(o_orderkey),

            PARTITION (o_orderdate RANGE RIGHT FOR VALUES (‘1992-01-01′,’1993-01-01′,’1994-01-01′,’1995-01-01’)))

        as select * from orders_ext;

 

CTAS creates a new table. We recommend using CTAS for the initial data load. This is an all-or-nothing operation with minimal logging.

Using INSERT INTO to load incremental data

For an incremental load, use INSERT INTO operation. This is a full logging operation but has minimal effect on the load performance. However, roll-back operation on a large transaction can be expensive. Consider breaking your transaction into smaller batches.

       INSERT INTO TABLE orders_load

       select * from orders_current_ext;

Note The source is using different external table, orders_current_ext.  This is the external table defining the path for the incremental data on ASB.

Data Reader, Writers consideration

SQL Data Warehouse adjusts the number of external move readers and writers as you scale. As illustrated in Table 1 below, each DWU has a specific number of readers.  As you scale out, each node gets additional number of readers and writers.  The number of readers is an important factor in determining your load performance.

Table 1. Number of readers and writers per DWU 100











Number of:

DWU

100

200

300

400

500

600

1000

1200

1500

2000

3000

6000

Readers

8

16

24

32

40

48

80

96

120

160

240

480

Writers

60

60

60

60

60

60

60

60

120

120

240

480

Best practices and considerations when using PolyBase

Here are a few more things to consider when using PolyBase for SQL Data Warehouse loads:

  • A single PolyBase load operation provides best performance.
  • The load performance scales as you increase DWUs.
  • PolyBase automatically parallelizes the data load process, so you don’t need to explicitly break the input data into multiple sources and issue concurrent loads, unlike some traditional loading practices.
  • Multiple readers will not work against compressed text files (e.g. gzip). Only a single reader is used per compressed file since uncompressing the file in the buffer is single threaded. Alternatively, generate multiple compressed files.  The number of files should be greater than or equal to the total number of readers. 
  • Multiple readers will work against compressed columnar/block format files (e.g. ORC, RC) since individual blocks are compressed independently.

Known issues when working with different file formats

In addition to the UTF-8/UTF-16 encoding considerations, other known file format issues can arise when using PolyBase.

Mixed intra-file date formats

In a CREATE EXTERNAL FILE FORMAT command, the DATE_FORMAT argument specifies a single format to use for all date and time data in a delimited text file. If the DATE_FORMAT argument isn’t designated, the following default formats are used:

DateTime: ‘yyyy-MM-dd HH:mm:ss’

  • SmallDateTime: ‘yyyy-MM-dd HH:mm’
  • Date: ‘yyyy-MM-dd’
  • DateTime2: ‘yyyy-MM-dd HH:mm:ss’
  • DateTimeOffset: ‘yyyy-MM-dd HH:mm:ss’
  • Time: ‘HH:mm:ss’

For source formats that don’t reflect the defaults, you must explicitly specify a custom date format. However, if multiple non-default formats are used within one file, there is currently no method for specifying multiple custom date formats within the PolyBase command.

Fixed-length file format not supported

Fixed-length character file formats—for example, where each column has a fixed width of 10 characters—are not supported today.

If you encounter the restrictions from using PolyBase, considers changing the data extract process to address those limitations.  This could be formatting the dates to PolyBase supported format, transforming JSON files to text files, etc.  If the option is not possible, then your option is to use any one of the methods in the next section.

Using Control-node and single-client gated load methods

In the Architecture section we mentioned that all incoming connections go through the Control node. Although you can increase and decrease the number of compute resources, there is only a single Control node. And as mentioned earlier, one reason why PolyBase provides a superior load rate is that PolyBase data transfer is not limited by the Control node. But if using PolyBase is not currently an option, the following technologies and methods can be used for loading into SQL Data Warehouse:

  • BCP
  • Bulk Insert
  • SSIS
  • SQLBulkCopy
  • Azure Data Factory (ADF)

Note By default, ADF uses the same engine as SQLBulkCopy. However, there is an option to use PolyBase so you can leverage the performance improvement.  See Copy activity and performance tuning guide for performance reference and detailed information.

For these load methods, the bottleneck is on the client machine and the single Control node. Each load uses a single core on the client machine and only accesses the single Control node. Therefore, the load does not scale if you increase DWUs for an SQL Data Warehouse instance.

Note You can, however, increase load throughput if you add parallel loads into either the same table or different tables.

When connecting via a Control-node load method such as SSIS, the single point of entry constrains the maximum throughput you can achieve with a single connection.

Figure 3. Using SSIS, a Control-node load method, for SQL Data Warehouse loading

image

To further maximize throughput, you can run multiple loads in parallel as the following diagram shows:

Figure 4. Using SSIS (parallel loading) for SQL Data Warehouse loading

image

Using multiple client concurrent executions should improve your load throughput – to a point. The number of parallel loads no longer improves your throughput when the maximum capacity of the Control node is reached.

Best practices and considerations for single-client gated load methods

Consider the following when using SSIS, BCP, or other Control-node and client-gated loading methods:

  • Include retry logic—very important for slower methods such as BCP, SSIS, and SQLBulkCopy.
  • For SSIS, consider increasing the client/connection timeout from the default 30 seconds to 300 seconds. For more information about moving data to Azure, see SSIS for Azure and Hybrid Data Movement.
  • Don’t specify the batch size with Control-node gated methods. The goal is to load all or nothing so that the retry logic will restart the load. If you designate a batch size and the load encounters failure (for example, network or database not available), you may need to add more logic to restart from the last successful commit.

Comparing load method performance characteristics

The following table details the results of four separate Azure SQL Data Warehouse load tests using PolyBase, BCP, SQLBulkCopy/ADF, and SSIS:

Table 2. SQL Data Warehouse performance testing results

 

PolyBase

BCP

SQLBulkCopy/ADF

SSIS

Load Rate

FASTEST=================>>>>>>>>>>>>>>>>>>SLOWEST

Rate increase as you increase DWU

Yes

No

No

No

Rate increase as you add concurrent load

No

Yes

Yes

Yes

As you can see, the PolyBase method shows a significantly higher throughput rate compared to BCP, SQLBulkCopy, and SSIS Control-node client gated load methods. If PolyBase is not an option, however, BCP provides the next best load rate.

Regarding loads that improved based on concurrent load (the third row in the chart), keep in mind that SQL Data Warehouse supports up to 32 concurrent queries (loads). For more information about concurrency, see Concurrency and workload management in SQL Data Warehouse.

Conclusion

SQL DW provides many options to load data as we discussed in this article. Each method has its own advantages and disadvantages. It’s easy to “lift and shift” your existing SSIS packages, BCP scripts and other Control-node client gated methods to mitigate migration effort. However, if you require higher speeds for data ingestion, consider rewriting your processes to take advantage of PolyBase with its high throughput, highly scalable loading methodology.

from SQL Server Customer Advisory Team https://blogs.msdn.microsoft.com/sqlcat/2017/05/17/azure-sql-data-warehouse-loading-patterns-and-strategies/

Build a recommendation system with the support for graph data in SQL Server 2017 and Azure SQL DB

Authored by Arvind Shyamsundar and Shreya Verma

Reviewed by Dimitri Furman, Joe Sack, Sanjay Mishra, Denzil Ribeiro, Mike Weiner, Rajesh Setlem

Graphs are a very common way to represent networks and relationships between objects. Historically, it is not easy to represent such data structures in relational databases like SQL Server and Azure SQL DB. To address this requirement, in November 2016 (through a private preview program for a set of early adopter customers) we introduced extensions to which allow us to natively store and query graphs inside the database on Azure SQL DB.

We recently made these features publicly available as part of the SQL Server 2017 CTP 2.0 release (note that the feature is still in private preview for Azure SQL DB at this time). Please review this related blog post for an overview of the feature. In our blog post we look at a typical use case for graph data in SQL Server and Azure SQL DB.

Scenario

A common scenario we’ve seen with our early adopter customers is their interest to use graph technology to implement ‘recommendation systems’. For this walkthrough, imagine we have to implement a recommendation system for songs. Specifically, let’s imagine a scenario where there’s a user who likes Lady Gaga’s song ‘Just Dance‘. Now, our objective is to implement a recommendation system which will suggest songs which are similar to ‘Just Dance’. So, how do we get started? First, we need data!

What data can we use?

Many approaches to implementing recommendation systems involve using two distinct sets of data: one which contains users, and the other which contains details of the entities that those users are related to.

  • In retail scenarios, these would be the products purchased by the user.
  • In our current scenario, these are the songs which those users listened to.

It so happens that there is an amazing source of such data for songs and ‘user tastes’ (which songs did each user listen to) available online. This dataset is called the Million Song Dataset (MSD), and while it has a lot of other information, the specific subset of data that is of immediate interest to us is summarized below:

  • The list of all the songs is contained in a delimited file available  here. There are a million songs in this dataset.
  • On the MSD website there is a link to another dataset called the  ‘User Taste Profile’ data which contains (anonymized) user listening profiles and that is available here. There are a million unique users, and a total of 48 million ‘relationships’ (each corresponding to a row in this file) in this dataset.

What algorithm?

Now that we know what data is available to us, let’s think about the algorithm to be used. A standard approach called collaborative filtering can be used in conjunction with our graph data. Presented below is a simplified graphical representation of the algorithm that we will use.AnimatedApproach

As you can see from the animation, the algorithm is quite simple:

  • First, we identify the user and ‘current’ song to start with (red line)
  • Next, we identify the other users who have also listened to this song (green line)
  • Then we find the other songs which those other users have also listened to (blue, dotted line)
  • Finally, we direct the current user to the top songs from those other songs, prioritized by the number of times they were listened to (this is represented by the thick violet line.)

The algorithm above is quite simple, but as you will see it is quite effective in meeting our requirement. Now, let’s see how to actually implement this in SQL Server 2017.

Implementation

To begin, we recommend that you quickly review this feature overview video as well as the official documentation links for more details on the new functionality:

Once you have the background, it’s easy to understand how to represent the scenario as ‘graph tables’ in SQL Server 2017. We will create two ‘node’ tables – one for the users and one for the songs. We will then ‘connect’ these two node tables with an ‘edge’ table. Here’s a quick visual summary of what we will be doing:ImplementationAnimated

Importing the data

Now, let’s get to the nuts and bolts! The first step is to declare tables into which we will insert the source data. These ‘staging’ tables are ‘regular’ tables and have no ‘graph’ attributes. Here are the scripts for this:
https://gist.github.com/arvindshmicrosoft/ae25112b146ce276f375c8908d780fe8.js?file=0_CreateSourceTables.sql

The next step is to use the OPENROWSET BULK functionality to rapidly ingest the text files into their staging tables in SQL Server. Here are the scripts for this:
https://gist.github.com/arvindshmicrosoft/ae25112b146ce276f375c8908d780fe8.js?file=1_ImportSourceTables.sql

Constructing the graph

Once we have the raw data in staging tables, we can then ‘convert’ them into their Graph equivalents. Here are the table definitions; note the usage of AS NODE and AS EDGE to define the tables involved in the graph:
https://gist.github.com/arvindshmicrosoft/ae25112b146ce276f375c8908d780fe8.js?file=2_CreateGraphTables.sql

To actually ‘convert’ the data, we use INSERT…SELECT statements as shown below:

https://gist.github.com/arvindshmicrosoft/ae25112b146ce276f375c8908d780fe8.js?file=3_ConvertData.sql

Querying the graph

Now that we have all the data in the ‘graph’ form, we can proceed to use the new MATCH function to express our query over the set of nodes and edges. The query below finds songs that are similar to Lady Gaga’s song called ‘Just Dance’!

https://gist.github.com/arvindshmicrosoft/ae25112b146ce276f375c8908d780fe8.js?file=4_FindRelatedSongs.sql

Optimizing performance

The above query performs relatively quickly (in around 3 seconds on a laptop with an i7 processor). Consider that this query has to deal with a million users, a million songs and 48 million relationships between those entities. Most of the cost is taken by the time to scan through the tables, one row at a time and then match them using hash joins, as you can visualize by looking at the execution plan:

image

While 3 seconds is not bad, can we make this even faster? The good news is that in SQL Server 2017 CTP 2.0, graph tables support clustered columnstore indexes. While the compression offered is definitely going to help reduce I/O, the bigger benefit is that queries on these tables leverage the ‘batch-mode’ execution which allows much faster execution of queries. This is really useful for us given that the above query is doing large aggregations (GROUP BY). Let’s proceed to create these clustered columnstore indexes:
https://gist.github.com/arvindshmicrosoft/ae25112b146ce276f375c8908d780fe8.js?file=5_ImprovePerf.sql

Once we create these indexes, the performance actually improves substantially and reduces the query execution time to half a second, which is 6x faster than before. That’s really impressive considering the sheer amount of data that the query needs to look at to arrive at the result!
Let’s take a minute to look at the new execution plan. Observe the ‘Batch mode’ execution highlighted below:

image

The other interesting thing to note is the new adaptive join type highlighted above. This is great to see – queries on graph data benefit with these new query processing improvements inside SQL Server 2107!

Let’s summarize the ‘before’ and ‘after’ states:

Query execution time (seconds)

Logical Reads (for the Likes table)

Space occupied by the Likes table on disk

Heap tables

3.6

588388

3.4GB

Clustered columstore

0.6

174852

1.7GB

In summary, having graph data inside SQL Server allows database administrators and developers to leverage the familiar, mature and robust query processing capabilities within SQL Server. This is crucial to reducing the learning curve and likely complexity associated with using other technologies to store and query graph data.

Visualizing graphs

While we can use external applications and tools like PowerBI to visualize graphs, the ‘icing on the cake’ is the fact that we can use R Services in SQL Server to visualize graph data. With an open-source R package called ‘igraph’ we can visualize graphs relatively easily and render them to standard image formats like PNG. Here is a code snippet showing you how that can be done:

https://gist.github.com/arvindshmicrosoft/ae25112b146ce276f375c8908d780fe8.js?file=6_VisualizationUsingR.sql

Here’s a section of the visualization (refer the comments in the above script to understand what the visualization represents) generated. While it is quite basic, but as you can see it is functionally very useful:

image

Conclusion

The support for graph data in SQL Server 2017 is an exciting new development and opens up doors to a new category of workloads which can leverage this functionality. It is one more step in bringing algorithms and intelligence closer to where the data resides.

Recommendation systems (such as the simple example presented here), fraud detection systems, content and asset management and many other scenarios can also benefit from the integration that graph data in SQL Server 2017 offers. The support for graph data in the database will be also be publicly available for Azure SQL DB in due course of time.

The complete code for this walkthrough is available here. Please use the Comments section below to ask questions and provide your feedback. We are eager to hear from you!

Citations

Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. The Million Song Dataset. In Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR 2011), 2011.

The Echo Nest Taste profile subset, the official user data collection for the Million Song Dataset, available at: http://labrosa.ee.columbia.edu/millionsong/tasteprofile

from SQL Server Customer Advisory Team https://blogs.msdn.microsoft.com/sqlcat/2017/04/21/build-a-recommendation-system-with-the-support-for-graph-data-in-sql-server-2017-and-azure-sql-db/

Try and try again: not always a good idea (at least not for SSMS!)

Contributions from, and reviewed by: Ken Van Hyning, David Shiflet, Charles Gagnon and Alan Ren (SSMS dev team), Dimitri Furman, Mike Weiner and Rajesh Setlem (SQLCAT)

Background

SQL Server Management Studio (SSMS) is the most popular client used to administer and work with SQL Server and Azure SQL DB. Internally, the SSMS code uses the SqlClient class (implemented in the .NET Framework) to connect to SQL Server. Recent versions of SSMS have been compiled with .NET Framework 4.6.1. What this means is that SSMS gets to leverage many of the newer capabilities in the .NET Framework (for example, the changes for Always Encrypted). It also implies that SSMS ‘gets for free’ some underlying changes in .NET. Many of these new, ‘for free’ behaviors in SqlClient were aimed at ‘cloud applications’ connecting to Azure SQL Database.

One such change SSMS got for free is the connection resiliency logic within the SqlConnection.Open() method. To improve the default experience for clients which connect to Azure SQL Database, the above method will (in the case of initial connection errors / timeouts) now retry 1 time after sleeping for 10 seconds. These numbers are configurable by properties called ConnectRetryCount (default value 1) and ConnectRetryInterval (default value 10 seconds.) The previous versions of the SqlConnection class would not automatically retry in cases of connection failure.

In general, because of these changes, transient errors (across slow networks or when working with Azure SQL Database) are less frequent. However, when you consider that a lot of SSMS users still use it with ‘regular’ SQL Server – either in a VM in the cloud or on-premises, there is a subtle but distinct impact of these changes which may affect administrators of ‘regular’ SQL Server databases.

Impact of these changes

Take for example, a case when the very first user database in your server is inaccessible (for example, when the database is either offline or in a recovering status when it is the non-readable secondary in an Availability Group.) Now, it so happens that when you expand the SSMS Databases section for a server, it enumerates all the databases and tries to read some information from the very first database in the list. When that first database is non-readable, that constitutes a ‘connection error’ for SqlClient. In turn, the connection resiliency logic kicks in and sleeps for 10 seconds and retries the connection. Obviously the second connection will also fail, and then SSMS returns to a ‘responsive state’. Unfortunately, in that 10 seconds the user perceives a hang in the SSMS application.

Reverting to original behavior

There is a simple workaround for this situation. It is to add the following parameter string into the ‘Additional Connection Parameters’ tab within the SSMS connection window. The good news is that you only need to do this once, as the property is saved for future sessions for that SQL Server (until of course it is removed by you later.)

ConnectRetryCount=0

Here’s a screenshot to help as well:

Adding the ConnectRetryInterval parameter in the SSMS 'Additional Connection Parameters' dialog

Do note that ConnectRetryInterval is a single term (no spaces in between!). Ideally, we would want to expose this through a setting in the Connection Properties screen in SSMS. For now, you can use the above method to revert to original behavior, or if you’d like to see this setting exposed as a GUI option in SSMS, do let us know by leaving your comments below! We are eager to hear from you!

from SQL Server Customer Advisory Team https://blogs.msdn.microsoft.com/sqlcat/2017/04/06/try-and-try-again-not-always-a-good-idea-at-least-not-for-ssms/

SQLCAT at SQLBits 2017

The widely popular SQLBits conference will take place April 5-8 2017 in Telford, UK. Several SQLCAT team members will be presenting a number of breakout sessions. Our topics this year are SQL Server on Linux and Azure SQL Database.

SQLCAT sessions are unique. We bring in real customer stories, and present their deployments, architectures, challenges, and lessons learned. This year, we will have four sessions at SQLBits:

Azure SQL Database best practices in resiliency design, Friday April 7, 11:00.

HADR for SQL Server on Linux, Friday April 7, 15:30.

Migrating from SQL Server to Azure SQL Database, Friday April 8, 17:00.

Troubleshooting common DBA scenarios with SQL on Linux, Saturday April 8, 12:30.

In addition to the sessions, you can also find us at the Microsoft stand, and at the “Ask the Experts” event. We hope that you can attend, and are looking forward to seeing you all in Telford in less than a week!

from SQL Server Customer Advisory Team https://blogs.msdn.microsoft.com/sqlcat/2017/03/30/sqlcat-at-sqlbits-2017/