My Logo

Project 3: Post Graduate College Student Salaries

Post College Salaries

Introduction

In this project I will delve into the how much Post-Graduate College Students make by looking at the degree they studied, how far into their career they are currently in, and their high meaning % (How much fulfillment they get from their career). I extrapolated this dataset from https://www.kaggle.com/datasets/rathoddharmendra/post-college-salaries which has 763 unique people with 6 distinct columns. The purpose of this research is to see which college degrees correlate to the highest salary and highest fulfillment rates.

What is Regression?

Regression in the context of data mining is a supervised learning model that uses continuous values to predict the Dependent Variable (In this case it would be % High Meaning) by utilizing the Independent Variables (College Degree and Salaries) to find a relationship. In most cases, the linear regression formula is used in which will be used for this project. More specifically, we will use Multiple Linear Regression since we have three different Independent Variables (Starting Salary, Mid-Career Salary, and College Degree). The Multi Linear Regression formula looks like this:

Regression Visualization

Understanding the Data

Upon first glances of the dataset, the growth between Early Career Pay and Mid-Career Pay does not necessarily correlate with a high meaning %, but rather depends on what the individual did for their college program. There were some individuals who had a salary jump as significant as 100,000 dollars but they only had a high meaning % of low to mid 40s while some people had a salary jump of only 30,000-50,000 dollars and had a high meaning % between 70-90. Also, I saw a broad correlation of:

  • Those who have degrees business, marketing, and finance degrees to very low high meaning %
  • Those who have degrees in Engineering and most Sciences had a normal to mid-high meaning %
  • Those who have degrees in Nursing, Medical Studies, and anything relating to the health sector having the highest high meaning %.
Data Table

Data Preprocessing

To start off, we need to eliminate the degree type since all entries consist of only Bachelor's degrees:

Taking Out Column Dataset

We also need to check for any null values that might throw off the linear regression model and remove them:

Null Values

So there are already no null values in the dataset

Also, converting all of the salaries into decimal format is required to train the models:

Decimilized Dataset

Experiment 1: Early and Mid Career Pay

The first experiment uses a linear regression model that compares the values of Early Career Pay and Mid Career Pay to see if there is any strong relationship between the two. After training the model, the R^2 value was 0.702 which is a strong positive relationship between Early Career Pay and Mid Career Pay, but not strong enough to have a causal effect. This means that if you go into a field that shows high potential for career pay groeth, you will more likely than not experience that growth after years of working.

Experiment 2: Mid Career Pay and High Meaning

Since we have established that there is a strong correlation between Early and Mid Career gap, let's now see if there is a correlation between Mid Career Pay and High Meaning using the linear regression model. We got an R^2 score of 0.0686 which is a value very close to zero. This means that there is an extremely Weak relationship between Mid-Career Pay and High Meaning. Someone who has been working in their job doesn't determine their importance or meaning of their job based on how much they make, but based on a different perhaps intangible factor.

Experiment 3: Early Career Pay and High Meaning

Finally, let's see if there is any correlation between Early Career Pay and High Meaning using linear regression to determine if money has any correlation with high meaning in the workplace. This time we got an R^2 score of 0.017, which is even lower than the last experiment. This means that we can rule money out of the equation when it comes to someone's importance of their job.

Impact

This project concludes that money is not the answer to all problems especially when it comes to one's self worth or importance in what they do in their career after college. Someone can be making a lot of money and have low meaning on their job while someone else can be making little money and have a high meaning on their job or there can be many other combinations that doesn't indicate a strong correlation between salary and meaning. This means that something else must be effecting one's self worth in their job. Perhaps it is the industry they work in, the company they work for, the coworkers they are surrounded with, the commute to and from work, etc. In essence, this project disproves the fact that money is an indicator on how someone sees their job as important or unimportant.

Conclusion

I have learned that linear regression is good at testing two different columns and determing whether or not there is a correlation between them. However, it is not possible to compare three different columns as it would require a different learning model, which I did not use in this project. Overall, I know how to use linear regression training models now.

References

Ahadli, T. (2020, January 7). A friendly introduction to K-means clustering algorithm. Medium. https://medium.com/@tarlanahad/a-friendly-introduction-to-k-means-clustering-algorithm-b31ff7df7ef1

My Step-by-Step Code

My Kaggle Jupyter Notebook