I was thinking I would spend yesterday setting up a Raspberry Pi Minecraft Server for the boys, but William decided to have a meltdown over swim team practice and lost access to electronics for the weekend. So I put that project on a back burner and instead tackled some work related programming.
My new company, Codility, helps companies screen technical candidates during the hiring process. We thought the pandemic would actually be good for business since people would need to find ways to interview candidates remotely versus in person. The thing we learned, however, is that we need to find the companies that are actually hiring during the pandemic.
I have over 150 companies on my target account list. Visiting every company’s career page is an option, but time consuming. There are services like Burning Glass that track labor statistics, but they wanted $15,000 for an annual subscription to their service. Companies like Indeed.com provide a single source of job posting but I would still need to search 150 companies….and vary my searches. (Just finding out who is hiring is not enough as companies like Amazon are hiring like crazy, but most hires are factory workers and we help with technical roles, like software developers and data architects.)
I found a web scraping tool, ParseHub, but quickly found that the free version was going to be very limiting. After some research, I found that Python may once again be the answer, particularly thanks for an interestingly named library, BeautifulSoup. (This article was a great starting point…using a similar use case.)
This project wound up being:
- a refresher on Python
- an introduction to scraping with BeautifulSoup
- a refresher on REGEX (regular expression matching)
- a exploration of programmatic interaction with Google Sheets
I found that there were various helper libraries to read/write Google Sheets. I got distracted when trying to address the API Key authorization process and wound up using EZsheets instead of the core Google Drive APIs.
Here is the code I hacked together (note, this assumes you have done much of the API key authorization stuff previously):
from bs4 import BeautifulSoup
ss = ezsheets.Spreadsheet('[google sheet id]')
sheet = ss
Title = sheet[2,1]
i = 2
while (len(sheet[1,i]) != 0):
Company = sheet[1,i]
URL = 'https://www.indeed.com/jobs?q=title%3A' +'"'+ Title +'"'+ '%20company%3A' + '"' +Company + '"' + '&fromage=14'
sheet[3,i] = URL
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
# find the with the page counts
CountPages = soup.find(id="searchCountPages")
# trap if page has no job listings
if CountPages is None:
sheet[2,i] = 0
i += 1
# pattern = "Page 1 of 11 jobs"
stuff = re.findall('[0-9]+', CountPages.decode())
num = int (stuff)
sheet[2,i] = num
i += 1
I am pretty happy with the results. I was able to zip through all 150 accounts and develop a prioritization/segmentation that should help me identify companies that can really benefit from our technology.