Transforming Roles: The Evolution of SQL Database Administrators

The role of a SQL Database Administrator (DBA) has evolved significantly over the years as technology and business needs have advanced. This evolution can be traced back to the early days of databases and continues to adapt to the latest trends in data management. In this post, we will explore how the SQL Database Administrator position has changed over the years, focusing on key developments and shifts in responsibilities.

Early Days of SQL DBA (1970s-1980s)

The history of SQL databases can be traced back to the 1970s when the concept of relational databases was first introduced by Edgar F. Codd. During this era, SQL databases were primarily used by large organizations for data storage and retrieval. The role of a SQL DBA was relatively straightforward, involving tasks such as data modeling, schema design, and query optimization.

SQL DBAs in the early days were responsible for managing physical storage, ensuring data integrity, and optimizing database performance. They worked closely with developers to design efficient database schemas and tune SQL queries for better performance. However, the scope of their responsibilities was limited compared to what it would become in the future.

The Rise of Enterprise Databases (1990s)

The 1990s saw the proliferation of enterprise databases, with Microsoft SQL Server, Oracle, and IBM DB2 gaining popularity. This period marked the beginning of a significant shift in the role of SQL DBAs. As organizations increasingly relied on databases to store critical business data, SQL DBAs became more integral to the IT infrastructure.

During the 1990s, SQL DBAs were tasked with database installation, configuration, and maintenance. They had to ensure high availability and data backup strategies to prevent data loss. Additionally, security became a more prominent concern, with SQL DBAs responsible for implementing access controls and encryption to protect sensitive data.

Internet Boom and E-Commerce (Late 1990s-2000s)

The late 1990s and early 2000s witnessed the explosion of the internet and the rise of e-commerce. This had a profound impact on the role of SQL DBAs. Databases became the backbone of online applications, and uptime and scalability became paramount.

SQL DBAs had to adapt to the demands of 24/7 availability and handle large volumes of data. They were now responsible for performance tuning on a massive scale, employing techniques like indexing, caching, and partitioning to ensure fast query response times. Scaling databases horizontally and vertically to accommodate growing workloads became a common challenge.

Cloud Computing Era (2010s)

The 2010s brought about a significant transformation in the IT landscape with the advent of cloud computing. Cloud-based databases, such as Amazon RDS, Azure SQL Database, and Google Cloud SQL, became popular choices for organizations looking to reduce infrastructure costs and increase scalability.

SQL DBAs had to adapt to managing databases in the cloud, which introduced new challenges and opportunities. They had to master cloud-specific database services and learn how to optimize costs while maintaining performance and security. Automation and scripting also became crucial skills as cloud providers offered tools for infrastructure as code (IAC) and database management.

Data Explosion and Big Data (2010s-Present)

The explosion of data in the 2010s, driven by social media, IoT devices, and increased digitization, posed another major shift in the role of SQL DBAs. Traditional relational databases were no longer sufficient to handle the sheer volume of data being generated.

SQL DBAs had to adapt to the world of big data, which included technologies like Hadoop, NoSQL databases, and distributed data processing frameworks. They needed to understand when to use traditional SQL databases and when to leverage alternative solutions for specific use cases. This required a broader skill set and the ability to work with a variety of data storage and processing technologies.

Data Security and Compliance (2010s-Present)

With data breaches becoming more prevalent, data security and compliance became a top priority for organizations. SQL DBAs found themselves taking on additional responsibilities related to securing data, implementing encryption, and ensuring compliance with regulations such as GDPR and HIPAA.

SQL DBAs also had to stay updated on the latest security threats and vulnerabilities and implement best practices to protect databases from unauthorized access and data breaches. This aspect of the role required a deep understanding of cybersecurity principles and the ability to work closely with security teams.

Automation and DevOps (2010s-Present)

In recent years, automation and DevOps practices have transformed the way SQL DBAs work. DevOps principles emphasize collaboration between development and operations teams, with a focus on automating repetitive tasks and achieving continuous integration and continuous delivery (CI/CD).

SQL DBAs have embraced automation tools and scripting languages to streamline database deployment, configuration management, and monitoring. They now play a critical role in enabling the rapid release of database changes while maintaining stability and reliability. This shift has also led to a more proactive approach to database management, with DBAs actively participating in the development process.

Data Analysis and Business Intelligence (2010s-Present)

As organizations recognize the value of data-driven decision-making, SQL DBAs have expanded their roles to include data analysis and business intelligence (BI) tasks. They are now involved in creating data warehouses, designing data models for analytics, and supporting BI tools like Tableau, Power BI, and QlikView.

SQL DBAs work closely with data analysts and data scientists to ensure that data is available, accurate, and accessible for reporting and analysis. This shift highlights the need for SQL DBAs to have a broader understanding of the business context and the ability to translate data into actionable insights.

Machine Learning and AI Integration (2020s-Present)

The integration of machine learning (ML) and artificial intelligence (AI) into applications has further expanded the role of SQL DBAs. They are now tasked with managing databases that store and serve data for ML and AI models. This includes optimizing database performance for real-time inference, handling large datasets for training, and ensuring data quality for ML/AI algorithms.

SQL DBAs may also collaborate with data scientists to deploy ML models within databases and establish data pipelines that feed data to these models. This intersection of traditional database management and emerging technologies highlights the evolving nature of the role.

Conclusion

In conclusion, the role of a SQL Database Administrator has undergone significant changes over the years, reflecting advancements in technology, the growth of data, and evolving business needs. From its humble beginnings as a data storage and retrieval specialist, the SQL DBA has become a multifaceted professional responsible for ensuring the availability, performance, security, and strategic use of data within organizations.

As we move into the future, SQL DBAs will continue to adapt to emerging trends, including cloud-native databases, data analytics, AI/ML integration, and the evolving cybersecurity landscape. The ability to learn and evolve with the ever-changing technology landscape will remain a key characteristic of successful SQL DBAs, ensuring their continued relevance in the world of data management and IT infrastructure.

Enhanced patching for SQL Server on Azure VM with Azure Update Manager

With Azure Update Manager, unlike with the existing Automated Patching feature, you’ll be able to automatically install SQL Server Cumulative Updates (CUs), in addition to updates that are marked as Critical or Important.    

Azure Update Manager is a unified service that helps manage updates for all your machines. By enabling Azure Update Manager, customers will now be able to:   

  • Perform one-time updates (or maybe Patch on-demand): Schedule manual updates on demand 
  • Update management at scale: patch multiple VMs at the same time 
  • Configure schedules: configure robust schedules to patch groups of VMs based on your business needs:  
  • Periodic Assessments:  Automatically check for new updates every 24 hours and identify machines that may be out of compliance  

thumbnail image 1 of blog post titled 

							Announcing public preview of enhanced patching for SQL Server on Azure VM with Azure Update Manager

 

Read more here…

SSIS Series: How to use Conditional Split

From Microsoft, the Conditional Split transformation can route data rows to different outputs depending on the content of the data. The implementation of the Conditional Split transformation is similar to a CASE decision structure in a programming language. The transformation evaluates expressions, and based on the results, directs the data row to the specified output. This transformation also provides a default output, so that if a row matches no expression it is directed to the default output.

You can configure the Conditional Split transformation in the following ways:

  • Provide an expression that evaluates to a Boolean for each condition you want the transformation to test.
  • Specify the order in which the conditions are evaluated. Order is significant, because a row is sent to the output corresponding to the first condition that evaluates to true.
  • Specify the default output for the transformation. The transformation requires that a default output be specified.

Let’s take a look at how this transformation might be used in the real world.

Open Visual Studio and drag a Data Flow task into the design pane. Open the Data Flow task and drag in an OLE DB Source task. For this post, I’m going to use the AdventureWorks2019 database and the HumanResources.vEmployeeDepartment view.

This view has some good data to play around with, but we’re going to focus on the Department and Start Date columns. Let’s pretend the bossman needs to see all of the Employees in the Quality Assurance (QA), Production and Sales Department in a separate database table. Bonus, he needs to see all of the Production employees split up into two more tables based on who started before and after Jan 1, 2010. All other employees can go into their own table. Got it? Great! That’s 5 total tables. QA=1, Production=2, Sales=1, Leftovers=1 Let’s go.

Back in Visual Studio, drag in a Conditional Split task and connect it to our OLE DB Source.

Open the Conditional Split task editor and you’ll see a few options (from left to right, top to bottom):

  1. We can use columns and/or variables and parameters in our expressions that define how to split the data flow.
  2. We can use functions such as Date/Time, NULL and String in our expressions that define how to split the data flow.
  3. These are the conditions that define how to split the data flow. These need to be set in priority order; any rows that evaluate to true for one condition will not be available to the condition that follows.


Let’s start adding some conditions for our data. First, we’ll add a condition for all of our QA Department Employees. I’ll name the output “QA” and my condition is pretty simple whereas Department == “Quality Assurance”.

I’ll do the same for Production, Sales and Leftovers (everything else that doesn’t satisfy a condition). Since Leftovers is everything else we’ll just change the name of the Default Output name to identify it.

Let’s go ahead and add our destination tasks (except for Production since we need another condition) and link them to the appropriate condition. See below for QA as an example. When we drag our connector to our destination task we get prompted with an Input Output Selection box. Here is where we choose our Condition that will match up with our table. For the screenshot below, we’ll choose QA output for our QA destination.

Now that we have QA mapped, go ahead and map Sales and Leftovers.

Looks great!  QA, Sales and Leftovers are mapped successfully. Let’s take a look at adding another Conditional Split task for Production. Drag a Conditional Split task into the design pane and connect it to the current Conditional Split. It automatically maps to Production since it’s the only output left.

From our new Conditional split task, let’s open the editor and configure the date conditions for Production. We’ll leave the Default output name box as is since we shouldn’t have any leftover data from this split.

Now we can map the two new conditions to their appropriate destinations.

Cross fingers and hit Execute.

Yay, no red X! Let’s take a look at our SQL tables to make sure everything exported correctly.

Boom. Let’s go grab a bourbon!

Using ChatGPT with SQL Server

If you’re in the tech field you’ve probably heard the buzz around ChatGPT and Artificial Intelligence.

ChatGPT is a pre-trained language model developed by OpenAI. It uses deep learning techniques, specifically a transformer neural network architecture, to generate human-like text. It can be fine-tuned on specific tasks such as language translation, question answering, and text summarization. ChatGPT can be used to generate a wide range of text, including conversation, news articles, stories, and more. Because it is pre-trained on a large corpus of text, it can generate text that is often indistinguishable from text written by a human.

To preview ChatGPT, shoot over to https://openai.com, create a login and you can try it for free (at the time this post was written).

For this post, since I’m a DBA, I want to test how it performs when I give it questions about SQL statements.

So let’s start! I asked the following question:

“How can ChatGPT be used to learn TSQL?”

And here’s a great response:

ChatGPT can be used to learn TSQL in several ways:

  1. You can use ChatGPT to generate TSQL code snippets and queries based on natural language prompts. For example, you can ask ChatGPT to "generate a TSQL query to select all columns from a table named 'customers'" and it will respond with the appropriate TSQL code.
  2. You can also use ChatGPT to explain TSQL concepts and provide examples. For example, you can ask ChatGPT to "explain how to use the JOIN clause in TSQL" and it will provide a detailed explanation and examples of how to use the JOIN clause in TSQL.
  3. You can also train ChatGPT model on TSQL text data and fine-tune the model for generating TSQL queries, this way you can generate more accurate and complex TSQL queries.
  4. You can use ChatGPT as a code completion tool to assist you in writing TSQL code.

 

Not a bad start. Let’s see if it can create a database.

“Can you write SQL code to create a SQL database named AI?”

Look correct. Let’s test in SQL Management Studio.

OK, OK, that was too easy ChatGPT. Let’s see if it can create a table in our new database and insert some dummy data.

“Can you write SQL code to create a table called Orders in my AI database and populate with some test data?”

Let’s test the code.

Looks good. How about a basic SELECT statement?

“Can you write SQL code to give me all data from my Orders table?”

Again, looks good. It even goes as far as explaining it and giving an example by only selecting specific rows. This is great for learning.

Let’s see if it can create an aggregate query.

“Can you write SQL code to give me the Customer that has the highest TotalAmount from the Orders table?”

From my question, I was hoping for something a little different, but I can’t get mad at the results. The query uses LIMIT 1 instead of SELECT TOP 1 which would work in MySQL, but doesn’t exist in SQL Server. It was really close though and it even gives me an example and explanation of how this would be used in a JOIN.

Let’s be more specific. If I change the question to specify SQL Server:

“Can you write SQL code to give me the Customer that has the highest TotalAmount from the Orders table in SQL Server?”

Again, not really what I was looking for, but it works. ChatGPT ended up writing more code than it needed and even wrote this statement in a common table expression (CTE). I was looking for something more along the lines of this:

SELECT TOP 1 CustomerID, MAX(TotalAmount) as MaxAmount
FROM Orders
GROUP BY CustomerID
ORDER BY MaxAmount DESC

Either way, both statements work and produce the same results.

How about something a little more difficult such as creating a partition. Partitions are heavily used in other database platforms so I’m going to specify SQL Server again in this question.

“Can you write T-SQL code to partition my Orders table, OrderDate column by year on SQL Server?”

This was a little more challenging and it wrote out the Partition Function and Partition Scheme statements correctly, but it added a OrderDateRange column as an integer and then it tries to create a clustered index where OrderDate is datetime and OrderDateRange is int so the end result is a failure.

All in all, I think this is a great tool for learning basic (and even some advanced) SQL, but it still has some bugs to work out before it tries to replace me. 😉

 

SSIS Series: How to use SSIS Balanced Data Distributor

From Microsoft, the Balanced Data Distributor (BDD) transformation takes advantage of concurrent processing capability of modern CPUs. It distributes buffers of incoming rows uniformly across outputs on separate threads. By using separate threads for each output path, the BDD component improves the performance of an SSIS package on multi-core or multi-processor machines.

The Balanced Data Distributor transformation helps improve performance of a package in a scenario that satisfies the following conditions:

  1. There is large amount of data coming into the BDD transformation. If the data size is small and only one buffer can hold the data, there is no point in using the BDD transformation. If the data size is large and several buffers are required to hold the data, BDD can efficiently process buffers of data in parallel by using separate threads.
  2. The data can be read faster than the rest of the data flow can process it. In this scenario, the transformations that are performed on the data run slowly compared to the rate at which data is coming. If the bottleneck is at the destination, the destination must be parallelizable though.
  3. The data does not need to be ordered. For example, if the data needs to stay sorted, you should not split the data using the BDD transformation.

Let’s dive in and take a quick look at how the Balanced Data Distributor works.

Go ahead and open Visual Studio, it might take a minute to load.

We’re going to use AdventureWorks2019 database for this example. The SalesOrderDetail table has over 121k records so that’s a good candidate.

SELECT * FROM Sales.SalesOrderDetail

Now that we have data let’s go back over to Visual Studio and see if it’s still spinning.

Drag in a Data Flow task, double click to open and then let’s drag in an OLE DB Source and a Flat File Destination task.

I’m going to configure the source to point to my AdventureWorks2019 database and the destination to point to a .csv file on my local laptop.

Easy enough. Let’s go ahead and run this and see what happens.

We can see that our 121k rows were written to our CSV file and the CSV file ended up being 12.3 MB. That’s pretty large for an email file attachment and might crash your laptop even trying to open this file. BDD not only offers performance benefits by using multiple threads, but it can also break up large files into smaller ones. IMO, this is what makes this task great.

We need to get this file down to less than 4MB so we’ll need to break this up 4 times. With that said,  let’s add the BDD task between the source and destination tasks. There is nothing configurable *in* this task, however, there are some properties that may need to be tweaked. After adding the BDD task, we’ll need to add 3 more flat file destination tasks and 3 more flat file connection managers. End result should look something like this:

That’s really about it. Let’s fire it off and see the results of our flat files.

It worked! 121k rows were written across 4 flat files. Our flat file size is less than 4MB each and we can send these very easily via email. This is a very quick and easy way to split data between files, however, there is no order to these records so an ORDER BY is not helpful here. If the goal is to separate data by a category or condition then you’ll need to use a Conditional Split task which I wrote about in this post.

SSIS Series: How to use SSIS Aggregate Data Flow Task – Basic and Advanced

From Microsoft, the Aggregate transformation applies aggregate functions, such as Average, to column values and copies the results to the transformation output. Besides aggregate functions, the transformation provides the GROUP BY clause, which you can use to specify groups to aggregate across.

The Aggregate transformation supports the following operations.

Operation Description
Group by Divides datasets into groups. Columns of any data type can be used for grouping. For more information, see GROUP BY (Transact-SQL).
Sum Sums the values in a column. Only columns with numeric data types can be summed. For more information, see SUM (Transact-SQL).
Average Returns the average of the column values in a column. Only columns with numeric data types can be averaged. For more information, see AVG (Transact-SQL).
Count Returns the number of items in a group. For more information, see COUNT (Transact-SQL).
Count distinct Returns the number of unique nonnull values in a group.
Minimum Returns the minimum value in a group. For more information, see MIN (Transact-SQL). In contrast to the Transact-SQL MIN function, this operation can be used only with numeric, date, and time data types.
Maximum Returns the maximum value in a group. For more information, see MAX (Transact-SQL). In contrast to the Transact-SQL MAX function, this operation can be used only with numeric, date, and time data types.

This transformation could also be implemented with a SQL query after an initial load, but this task makes it nice to take care of the transformation within SSIS so that the data is ready on export. Let’s take a quick look at how it works.

Using AdventureWorks database, I’ll run the following query which gives me everyone in the Person.Person table that is located in the US.

SELECT firstname + ' ' + lastname AS PersonName, AddressLine1, AddressLine2, City
,sp.[Name] AS STATE, PostalCode, CountryRegionCode
FROM person.Person p
JOIN person.BusinessEntityAddress bea ON p.BusinessEntityID = bea.BusinessEntityID
JOIN person.Address a ON bea.AddressID = a.AddressID
JOIN person.StateProvince sp ON a.StateProvinceID = sp.StateProvinceID
WHERE CountryRegionCode = 'US'

For this post, let’s pretend our boss wants to see how many people we have in each state. So a simple COUNT(*) and GROUP BY State query would work, but we’ll use SSIS to export this into a table using the Aggregate task.

Fire up Visual Studio and let’s create a Data Flow Task, add source connection and destination connection.

Next, let’s add our Aggregate task between our Source and Destination and open the editor:

This example is pretty straightforward as we want to do a COUNT on all records so we’ll choose (*) under Input Column, give it a name under Output Alias and the only available option for * is Count All.

Next, we want to get the number of people per state, so we’ll add State to our Input Column, give it a name, and select GROUP BY for our operation:

We can use multiple options on this screen such as SUM, AVG, MIN, MAX, etc. depending on data types.

Hit OK on the screen which will take us back to our Data Flow task. Taking a look at our Destination task, you can see I have a table named EmployeesPerState with only two columns, Employee and State:

Save the package and execute and you can see that we’ve inserted the data we need. Boss is happy and gives you a $6 bonus.

For Advanced Mode, you can create more than 1 GROUP BY criteria. For example, the boss now wants to see the number of people in each state, but he also wants to see how many people per zip code. For this, we would expand the Advanced/Basic drill down and add another input for his request. We’ll name the Aggregation PostalCode and Group by PostalCode. Like we did for our State aggregation, we’ll add a Count all Operation and (*) Input.

 

B

Back on the Data Flow task, we’ll drag another Destination task into the editor and configure.

You can now see both Destination tasks with a label for each.

If we run the package again, you can see results have been inserted into our PostalCode table and the boss gives you another $6 bonus and let’s you take Sunday off.

 

Quick Backup/Restore Script

Throughout the years I’ve probably performed hundreds if not thousands of migrations or “refreshes” to lower environments. It’s pretty simple to run a quick maintenance plan or SQL job to backup all databases, but the more complex piece is creating a restore script to point to that backup location and the backup name. It can be done, but it takes a little longer, IMO. A lot of backup file names contain numeric characters such as date or time of backup.

John Morehouse created a dynamic script that simplifies everything and it can be easily modified to fit particular needs. I’ve had this script bookmarked for years and use it at least once a week. Kudos to John for sharing this with the SQL community. I’ll link the script below, but here’s a quick snippet.

DECLARE @date CHAR(8)
SET @date = (SELECT CONVERT(char(8), GETDATE(), 112))

DECLARE @path VARCHAR(125)
SET @path = '\\UNCPath\Folder\'

;WITH MoveCmdCTE ( DatabaseName, MoveCmd )
          AS ( SELECT DISTINCT
                        DB_NAME(database_id) ,
                        STUFF((SELECT   ' ' + CHAR(13)+', MOVE ''' + name + ''''
                                        + CASE Type
                                            WHEN 0 THEN ' TO ''D:\SQLData\'
                                            ELSE ' TO ''E:\SQLTLogs\'
                                          END
                                        + REVERSE(LEFT(REVERSE(physical_name),
                                                       CHARINDEX('\',
                                                              REVERSE(physical_name),
                                                              1) - 1)) + ''''
                               FROM     sys.master_files sm1
                               WHERE    sm1.database_id = sm2.database_ID
                        FOR   XML PATH('') ,
                                  TYPE).value('.', 'varchar(max)'), 1, 1, '') AS MoveCmd
               FROM     sys.master_files sm2
  )
SELECT
	'BACKUP DATABASE ' + name + ' TO DISK = ''' + @path + '' + name + '_COPY_ONLY_' + @date + '.bak'' WITH COMPRESSION, COPY_ONLY, STATS=5',
	'RESTORE DATABASE '+ name + ' FROM DISK = ''' + @path + '' + name + '_COPY_ONLY_' + @date + '.bak'' WITH RECOVERY, REPLACE, STATS=5 ' + movecmdCTE.MoveCmd
FROM sys.databases d
	INNER JOIN MoveCMDCTE ON d.name = movecmdcte.databasename
WHERE d.name LIKE '%DatabaseName%'
GO

To change the backup/restore path you can modify this section of code to fit your needs:

SET @path = '\\UNCPath\Folder\'

Another modification I perform initially is changing the where clause, for example, if I don’t want system databases I can change:

WHERE d.name LIKE '%DatabaseName%'

to

WHERE d.database_id > 4

A lot of times I don’t need to move the data and log files to a new location because I’m overwriting other databases so I just comment all of the MoveCmdCTE CTE code out:

DECLARE @date CHAR(8)
SET @date = (SELECT CONVERT(char(8), GETDATE(), 112))

DECLARE @path VARCHAR(125)
SET @path = '\\UNCPath\Folder\'

--;WITH MoveCmdCTE ( DatabaseName, MoveCmd )
--          AS ( SELECT DISTINCT
--                        DB_NAME(database_id) ,
--                        STUFF((SELECT   ' ' + CHAR(13)+', MOVE ''' + name + ''''
--                                        + CASE Type
--                                            WHEN 0 THEN ' TO ''D:\SQLData\'
--                                            ELSE ' TO ''E:\SQLTLogs\'
--                                          END
--                                        + REVERSE(LEFT(REVERSE(physical_name),
--                                                       CHARINDEX('\',
--                                                              REVERSE(physical_name),
--                                                              1) - 1)) + ''''
--                               FROM     sys.master_files sm1
--                               WHERE    sm1.database_id = sm2.database_ID
--                        FOR   XML PATH('') ,
--                                  TYPE).value('.', 'varchar(max)'), 1, 1, '') AS MoveCmd
--               FROM     sys.master_files sm2
--  )
SELECT
	'BACKUP DATABASE ' + name + ' TO DISK = ''' + @path + '' + name + '_COPY_ONLY_' + @date + '.bak'' WITH COMPRESSION, COPY_ONLY, STATS=5',
	'RESTORE DATABASE '+ name + ' FROM DISK = ''' + @path + '' + name + '_COPY_ONLY_' + @date + '.bak'' WITH RECOVERY, REPLACE, STATS=5 ' --+ movecmdCTE.MoveCmd
FROM sys.databases d
	--INNER JOIN MoveCMDCTE ON d.name = movecmdcte.databasename
WHERE d.name LIKE '%DatabaseName%'
GO

There’s a lot of option here and this script is quick and easy.

Link to script: https://gist.github.com/airtank20/a826c6f37439482edd5070e8aaeb1ee1

SQL Server 2022 is generally available!

On November 16, 2022, Microsoft announced the general availability of SQL Server 2022, the most Azure-enabled release of SQL Server yet, with continued innovation across performance, security, and availability1. This marks the latest milestone in the more than 30-year history of SQL Server.

SQL Server 2022 is a core element of the Microsoft Intelligent Data Platform. The platform seamlessly integrates operational databases, analytics, and data governance. This enables customers to adapt in real-time, add layers of intelligence to their applications, unlock fast and predictive insights, and govern their data—wherever it resides.

To learn more, click here

SQL Server 2022 in Preview

On November 2, 2021, Microsoft announced the preview of SQL Server 2022, the most Azure-enabled release of SQL Server yet, with continued innovation in performance, security, and availability.

The rise of data represents a tremendous opportunity and also poses challenges. Companies are seeing their relational and nonrelational data proliferate exponentially on-premises, in the cloud, at the edge, and in hybrid environments. The most transformative companies drive predictive insights on current data, whereas others may struggle to drive even reactive insights to their historical data. Information may be siloed across geographies and divisions.

To learn more, click here

SSIS Series: Insert/Update a SQL Server table using Merge Join in SSIS

I was asked a question awhile back on the easiest way to do an incremental load from one SQL Server table into another SQL Server table. This is common in data warehouses or reporting tables as you wouldn’t want to truncate a large table and perform a full insert. Instead, you would want to only copy the changes from the source to the destination. This could be an insert or an update. The easiest way, IMO, is SSIS. Let’s take a look.

Here’s how my environment is setup. I have two databases appropriately named Transactional (for transactional data) and Reporting (for static report data). Two tables, Source and Destination.

I also added some dummy data as you see below. I’ve highlighted what is different and will need to be changed in the destination table.

Let’s open SSIS and create a new SSIS project.
Drag and drop a Data Flow task into the design window, right click, Edit:

Next, drag and drop two OLE DB source components into the design window. I’m going to rename mine Source and Destination to match my table names.

Configure the two OLE DB sources to match the source table and the destination table. Below is a screenshot of my source connection manager.

Once your connection managers are configured let’s drag and drop two Sort components below each OLE DB Source and connect them to each source. We will be using a merge join next and the merge join component needs to have data sorted in ASC or DESC order. I always choose to sort on the Primary Key for each table.

In my example, I’ll choose to sort on my primary key, ID.

Next, drag the Merge Join transformation into the Design window and drag the data path from the Sort component to the Merge Join. When you attach the arrow to the transformation, the Input Output Selection dialog box appears, displaying two options: the Output drop-down list and the Input drop-down list. The Output drop-down list defaults to Source Output, which is what we want. From the Input drop-down list, select Merge Join Left Input, as shown below. We’ll use the other option, Merge Join Right Input, for the other connection.

Next, connect the data path from the other Sort component to the Merge Join transformation. This time, the Input Output Selection dialog box does not appear. Instead, the Input drop-down list defaults to the only remaining option: Merge Join Right Input.

Now, let’s configure the Merge Join transformation.

The first setting in the Merge Join Transformation Editor is the Join type drop-down list. From this list, you can select one of the following three join types:

  • Left outer join: Includes all rows from the left table, but only matching rows from the right table. You can use the Swap Inputs option to switch data source, effectively creating a right outer join.
  • Full outer join: Includes all rows from both tables.
  • Inner join: Includes rows only when the data matches between the two tables.

For our example, we want to include all rows from left table but only rows from the right table if there’s a match, so we’ll use the Left outer join option.

You now need to select which columns you want to include in the data set that will be outputted by the Merge Join transformation. For this exercise, we’ll include all columns. To include a column in the final result set, simply select the check box next to the column name in either data source.

Almost finished, but first let’s add a Conditional Split transformation. This will allow us to insert new records or update previous records.

In the Conditional Split Editor, I created two outputs. If (Destination) ID is NULL then the record doesn’t exist so we’ll perform an INSERT. If the ModifiedDate is different between the two tables then we know something has been updated since the last execution and we need to update the record. See below.

Since we can perform and INSERT or an UPDATE we’ll need two destinations. First, for the INSERT, we’ll simply be doing an INSERT into the table so we can drag the OLE DB Destination component into the window, choose “INSERT” in the Input Output selection window, and use the Reporting connection manager.

I’m going to check the “Keep Identity” box since the ID column is an identity column.

Next, for the UPDATE statement we’ll drag the OLE Command component into the window and configure it. Select “UPDATE” in the Input Output Selection window.

In the Connection Managers tab, assign the connection manager for Reporting.

In the Connection Managers tab, assign the connection manager for Reporting.

In the column mappings tab, assign parameters:

Final package should look like the following:

Save and execute. You can see that we updated two records and inserted one record:

Going back to our query you can see that everything matches up now: