In this research project, I collaborate with Professor Sandra Cannon in measuring data literacy expectations from the ways employers describe jobs and the way they describe the people they are looking for. We find the discriminatory power between how employers describe jobs and what the actual work on the job entails.
linkedin.py
This is the file you use to generated the dataframe of linkedin postings - results will be stored in data/scraping_results (tagged with "linkedin")
indeed.py
Same as linkedin.py but for indeed postings - results will be stored in data/scraping_results (tagged with "indeed")
Data files:
merged_headings_df
: Contains both the LinkedIn and Indeed postings in a single DataFrame
to_wcdf
: Applies sklearn CountVectorizerpreprocess_heading_text
: Takes the Heading Text, which is initially intended formerged_headings_df
, and applies a preprocessing pipeline on itvisualize_counts
: Takes in a Pandas series of string row entiresand visualizes using Seaborn teh top n words in that corpusvisualize_seq_lengths
: Visualizes the distribution of word lengths in a sequence