Hello! As it grows, this site will work to demonstrate my knowledge of and useful links for learning about SQL and T-SQL joins. I am totally self taught thus I will constantly be looking ways to improve my skills and will be documenting my journey here. In the long run, I hope this site helps you as much as other helped me.
A little more information about me
My data skillset markedly started to flourish at a now closed Pittsburgh company. On account of my final position there, I made operational decisions based on my own reporting. I could see the upstream and downstream impacts of every data driven decision I made. Also, I was only working with datasets not relational databases.
Next, I would move on to my first analyst job at a local health care company. I connected to relational databases primarily through Microsoft Access.
Data project management
One thing I’ve learned the hard way is that when someone with not a lot of experience enters into a contract-to-hire position as a data analyst, it’s all about ad hoc request results. This probably doesn’t surprise you if you’re reading this. I’ve had successes and failures in the past. I’m hoping this site can help both you as the reader – and me as the professional – to learn from my failures.
If you’re here, you are probably in a situation where you’re trying to complete a data project without a lot of guidance or understanding of best practices.
Make sure to you get as much information about the question being asked.
Dig into the data. What specifically is the question that needs answered? Does the project measure specific KPIs? Is the data normalized at all? If you’re working in SQL Server, where are the primary and foreign keys? Is there a data dictionary? If there a database diagram? Investigate the indexes. Use system queries to check out the columns in the tables in the database…
…and here I am getting ahead of myself.
More on data project management later.
Educational material for SQL joins
When I became an analyst for the first time, one of the first details I had to research was joins. There are a ton of resources out there…and I’m going to attempt to make this site stand out among them.
Here is a great cheat sheet I found that can be used to help you familiarize yourself with some basic syntax.
Here is another.
This is another.
One of the best ways to describe SQL joins is here on Wikipedia:
An SQL join clause – corresponding to a join operation in relational algebra – combines columns from one or more tables in a relational database. It creates a set that can be saved as a table or used as it is. AJoin (SQL). (2021. February 9) in Wikipedia. http://bit.ly/37Yjqiu
JOINis a means for combining columns from one (self-join) or more tables by using values common to each. ANSI-standard SQL specifies five types of
CROSS. As a special case, a table (base table, view, or joined table) can
JOINto itself in a self-Join.
Aside from using Access for SQL joins, my main experience is with Transact SQL (T-SQL) for the Microsoft stack and SQL Server. I will have some posts regarding Access SQL as this blog gets bigger, but for now I’m going to be focusing on T-SQL and SQL Server.
A closer look at queries
Depending on what you experience level is, you may look at a simple select statement and assume that since *you* read it a certain way that the server will do the same. Reality couldn’t be any further from the truth.
SQL is a declarative language. That means, per WIKIpedia:
The server looks at it in a particular order. Consider this list of how clauses are read in a SQL query:
FROM ON JOIN WHERE GROUP BY WITH CUBE or WITH ROLLUP HAVING SELECT DISTINCT ORDER BY TOP
So you want to make sure you keep this in mind as you build the query.
Ask yourself questions like this: Where is the data coming from? Clean tables? Veiws? CTEs? A Stored Procedure?
Depending on how complex your query is…you may need to look into the SQL Server query optimizer. I will cover that in another post.
Some Simple Queries
Being a baseball fan, I wanted to look for some unusual data to mine to illustrate various skills.
This is the set I chose.
Once I dug in and normalized some of the historic data (more on that later) I imported (more on that process later too) it into a locally hosted SQLEXRESS instance.
Being from and currently living in Altoona, I was surprised to see this once I looked at the franchises that were listed:
Take a look at the above screenshot. It’s a very simple select query using something I haven’t mentioned before – the asterisk in the select clause. what I’m telling my database here is from the dim_team_franchise table select (*) everything. The asterisk is a quick way to do that while you’re familiarizing yourself with a database… but please don’t use it in production!
So I was initially surprised to learn my hometown had a major league baseball franchise at one point. This is where my curiosity kicked it: what else can I learn?
Further Digging into the Data
Looking further into the franchise, the primary key for some of the data is the team_id. Another table aggregated each team’s information by year.
Knowing the team existed for one year, I queried the dim_team table looking for data using the ‘ALT’ identifier.
The team played 25 games! That’s cool. So let’s check on how the players did.
It looks like the team was pretty bad. One highlight though seems to be the last player in the list. I’m now wondering what kind of career he had. Lets take a look! I’ll do so using his player_id smithge01.
Let’s see what his real name was. Looking at the dim_player table I find this:
So it looks like his name was Germany Smith. Let’s take a look at Mr. Smith’s career using the dim_batting table. How many different teams did he play for? It looks like six.
Before I move on, take a look at the select clause in the above screenshot. I’m querying for a specific team ID when Mr. Smith could have obviously played for the same team for multiple years. I used the distinct operator to tell SQL that I only want one distinct value returned per team_id. This ultimately tells me Mr. Smith played for six teams.
So I just realized this post went off the rails into a select query post and not one about SQL joins. That is easily rectified based on the queries I’ve run so far.
Here is a screenshot of a left join where I query for what teams he played for, how many teams he played for each year and how man at bats he had:
My server read this query like this:
- from dim_batting
- on the first left join’s fields
- on the second left join’s fields
- where player_id = ‘smithge01’
- order on year ascending
This post has gotten very long. I’ll have to revisit this to explain the null values that the query returned. It looks like I still have some normalization to do.