Joining Tables in SQL Queries for Extracting Data in a Lake
Setting the framework for sustainable ETL
During a meeting with leadership, an end-user approached me about an idea for an analytic tool she had been considering. She shared with me a spreadsheet that included some manually entered data and her vision. I took the important move to ask her questions to understand the business purpose for this new analytic tool.
I don’t deal with mortgages, and neither does my company, but the first several years of my career, I helped people pay for their business remodel, or relocate to a new area. The mortgage data models the business challenge that aligns with a recent project presented to me.
I now have the goal my end-user is looking to obtain, so it’s time to start evaluating the data available. In my role, I access a data mart through Microsoft SQL Server Management Studio.
From there, I go about creating the framework for my first SQL Query in my data lake by writing out the following statements:
SELECT
FROM
WHERE
GROUP BY
In the overwhelming majority of my queries, these four statements provide a framework for writing a query that I can revisit, share with others, and easily reference during a screen share.
At this point, rather than looking for specific data fields to begin populating, I evaluate the specific tables within my database, seeking those most relevant to my goal.
SELECT
FROM
MortgageLoanListings AS ML
JOIN MortgageRatesTable AS MRT
ON ML.RatingIndex=MRT.RatingIndex
INNER JOIN OfficeBridge as OB
ON OB.OfficeId=ML.OfficeIdJOIN AccountInformationTable AS AC
ON ML.AccountNumber=AC.AccountNumberWHERE
GROUP BY
With regards to my format, take note that each SQL Statement is in ALL CAPS. Anyone reviewing your code can spot the various statements to make changes and provide feedback. Also, notice the indentation provides you another tool for easier readability.
I also must mention my schema. Oftentimes you will look through a data table in the lake and find you don’t have a data element that aligns completely with any of the data elements in another table.
Be prepared to spend time seeking the correct connections for the data. Oftentimes, we can establish those connections with bridge tables. These align a data in multiple chains, that still all interconnect, which creates the start of a snowflake schema.
As the last step before adding the data fields I’d like to query, I locate the data field that will filter my data. This is an important step from running the query, as you will save yourself potentially several hours waiting for a query to return results that are irrelevant to you.
SELECTFROM
MortgageLoanListings AS ML
JOIN MortgageRatesTable AS MRT
ON ML.RatingIndex=MRT.RatingIndexJOIN AccountInformationTable AS AC
ON ML.AccountNumber=AC.AccountNumberWHERE
OB.OfficeName='Pittsburgh'GROUP BY
Now that we’ve familiarized ourselves with the tables of data available to us, joined and filtered the correct data, we can go about selecting the data we need.
However, I may find some of the tables for this project, but I will be making some changes with my SELECT statements, along with the tables I join.
I may change my main table at the center of my snowflake schema that connects all my data, just finding the keys align better. But here’s the first iteration
SELECT
OB.OfficeName,
MRT.MortgageRatesTable,
AC.AccountCustomerName,
SUM(ML.MortgageNumber)FROM
MortgageLoanListings AS ML
JOIN MortgageRatesTable AS MRT
ON ML.RatingIndex=MRT.RatingIndexJOIN AccountInformationTable AS AC
ON ML.AccountNumber=AC.AccountNumberWHERE
OB.OfficeName='Pittsburgh'GROUP BY
OB.OfficeName,
MRT.MortgageRatesTable,
AC.AccountCustomerName,
SUM(ML.MortgageNumber)
With the above, you will notice I ordered my select statement with a hierarchy.
Whenever you find a discrepancy with the way your data is appearing on your end-product, it is best to be able to easily identify which data is superior. Customers can have multiple mortgages.
This could be cumbersome to an end-user that wouldn’t want to scroll through Stella Thompson’s First Mortgage, Home Equity Line of Credit, and Vacation Home Mortgage to see a total of how much money she owed.
Aggregating that data together was accomplished using SUM. From there can repeat this hierarchy in the GROUP BY
Statement.
Once I’ve generated my SQL Query, I load the data into Power BI and create an interactive visualization using a process where I meet with several end-users on a regular basis, gaining feedback, going back to edit my queries multiple times, and make many versions before the end product is launched.
I want to readdress the conversation I had in that meeting with leadership and the end-user. That is an important step to having that open-ended dialogue. Doing so doesn’t just give me the feedback I need, asking for others' feedback creates the engagement to get multiple teams to notice my work. For me, having this notice has given me the power to take charge of many of the ways I design my tools. I thrive in that power.