Applying Data Science and Web Scraping to find a Job on LinkedIn-Part I

How to get a Data Science Job…using Data Science-Part I

Looking for a job is a full time job. You have to dedicate countless hours searching and reading job posting websites, writing cover letters, filling your CV over and over (even after you already uploaded it) only to later find out you weren’t a good match for the position.

When they tell you to enter your work experience after you upload your CV

But wait! we are Data Scientists, there has to be a better way to find our dream job, without having to read every job post out there. What if we could build a job postings database and then find the best match for our qualifications. Like seniority, programming languages, frameworks, etc. And then rate each of the jobs so we can give priority to the ones we have a better match with. That would be cool wouldn’t it?

Today we will be using this real life problem to introduce some web scraping techniques as well as learning how we can apply Data Science way of thinking to tackle problems…So we will be using Data Science to get a Data Science job!

The plan

Before jumping into coding, let’s solve the problem on paper and break it down into steps.

Find the data source — There are many job posting websites, but by far the most popular is LinkedIn, so this will be our data source.
Collect data — We will write a Python Script on Jupyter to access LinkedIn, search jobs with the keywords “data science” and then crawl page while we extract the information from each post.
Clean the collected data — This is a very important part for any ETL. Put the data in the best possible shape to make it understandable and searchable. Since we will be working with unstructured data, that is text written by humans, we will have to clean the text and remove unnecessary words for the scoring part.
Create a scoring function — What we need is to create a function to score how good is the match between us and each job post. Put it simply, the function will take our CV and a job post and the output will be a number between 0 and 1: f(CV, JobPost) = score

The code

Login to LinkedIn

The first task of the day is installing the Selenium library. We will use this library to open and control a Google Chrome window with Python code. We will also need to download a chromedriver from:

https://chromedriver.chromium.org/downloads

Make sure that the chromedriver version matches the version of the Google Chrome you have installed on your computer. You can check the version opening Google Chrome, clicking on the 3 dots on the top right, then go to Help and About Chrome.

Once we have Selenium installed and the chromedriver in our project folder we can start coding. We will import the libraries: Pandas, Selenium, and time. The later is a pre-installed library that we’ll use to add wait times to let the browser fully load a page before the next step. This is an important consideration when doing Web Scraping. Then we will create a browser instance and use that instance to go into the LinkedIn login page.

#Create browser instance by giving the address to chrome webdriver
browser = webdriver.Chrome('./chromedriver')
# Navigate to LinkedIn Login page
browser.get('https://www.linkedin.com/login/es?fromSignIn=true&trk=guest_homepage-basic_nav-header-signin')

You will see a Google Chrome window open and automatically navigate into LinkedIn.

We will add now some lines of code to automatically insert our LinkedIn user and password. To insert the username we use the following line:

browser.find_element(By.ID,'username').send_keys('erichhohenstein@gmail.com')

Let’s break up this line of code. First we have a browser instance which we named “browser”, then we have the method “find_element” to which we pass how are we going to find the browser element, in this case “By.id” and then we pass the id “username” . The way we work with Selenium is by finding elements on the browser screen and then interacting with them by sending strings, clicks, and other commands. If you type “By.” and then hit tab, you can see all the possible ways you can find elements. In this case we passed the “username” id. I found this id by using Chrome DevTools: Right click on the browser → Inspect → Select and element to inspect (This lets you click an element on the page and then shows the code behind the element)

https://medium.com/media/72b4ad72488b031bc802250df8db99ae/href

We do the same to send the password and then we use the following code to click the Login button.

browser.find_element(By.CLASS_NAME,’btn__primary — large’).click()

Notice in this case I found the element using the class name and used the .click() method to click the button…and we are in!

*Hacker voice* I’m in

In order to keep our code clean, we will be using functions for each specific task. So the final code for the login function is the following:

def loginLinkedIn(browser,usr,psw):
    # Go to LinkedIn Login page
    browser.get('https://www.linkedin.com/login/es?fromSignIn=true&trk=guest_homepage-basic_nav-header-signin')
    time.sleep(4)
    browser.find_element(By.ID,'username').send_keys(usr)
    time.sleep(1)
    browser.find_element(By.ID,'password').send_keys(psw)
    time.sleep(1)
    browser.find_element(By.CLASS_NAME,'btn__primary--large').click()
    return browser

Notice I added time.sleep(1) so the program waits for 1 second before each step. This practice is done to avoid errors of executing tasks too fast like clicking a button that hasn’t been rendered yet.

Web Crawling

The next step is to make the job search. For this, we will first do the job search manually to get the URL and understand the parts of if so we can use it to crawl LinkedIn.

I searched for job remote position, worldwide for data science and got the following URL:

https://www.linkedin.com/jobs/search/?f_WT=2&geoId=92000000&keywords=data science&location=Worldwide&sortBy=R and when I go to the next page 2, the link changes to https://www.linkedin.com/jobs/search/?f_WT=2&geoId=92000000&keywords=datascience&location=Worldwide&sortBy=R&start=25

Analyzing the URL we see there are 3 important things are happening here:

The number after “f_WT=”, searches for on-site, remote, or hybrid. Where 2 is remote.
After “keywords=” goes what we are searching for, replacing spaces with “%20”
After “start=” goes the number of job posts. Each page starts at multiple of 25.

With this knowledge we can now use some type of loop to go through all the result pages. Then our code to crawl the search results will look like this.

#PARAMETERS
jobtype = '2' #Remote 
keyword = 'data%20science' # "%20" = space 
page = '0' #start at 0. Goes in multiples of 25 
#URL
URL = 'https://www.linkedin.com/jobs/search/?f_WT='+jobtype+'&geoId=92000000&keywords='+keyword+'&location=Worldwide&sortBy=R&start='+page

Web Scraping LinkedIn Job posts

The next thing we need to solve is what to do with each post. When we click on a job post, the whole description is shown. So we need to write a function so each time our program clicks on a job post, the information within is extracted. The information we want to get is: jobTitle, company, company profile link, location, job link, poster name, poster profile link, and job description. We will now write a function that will receive a Selenium browser instance and will return this information. The code is the following:

def getLinkedinJobs(browser,keyword,jobType,maxPageNumSearch):
    actions = ActionChains(browser)
    jobtype = str(jobType) #2 = Remote
    keyword = keyword.replace(' ','%20') # %20 = space
    jobData = [['title','company','companyLink','location','jobLink','posterName','posterProfileLink','jobDescription']]
    for p in range(maxPageNumSearch):
        page = str(p*25)
        #print(page)
        browser.get('https://www.linkedin.com/jobs/search/?f_WT='+jobtype+'&geoId=92000000&keywords='+keyword+'&location=Worldwide&sortBy=R&start='+page)
        time.sleep(3)
        #Search for all job postings shown
        jobs = browser.find_elements(By.CLASS_NAME,'jobs-search-results__list-item')
        i=0
        for j in jobs:
            #print(i)
            i+=1
            time.sleep(2)
            actions.move_to_element(j).perform()
            j.find_element(By.TAG_NAME,'img').click() #Click on the image so it doesnt misclick a URL
            jobData.append(getJobInfo(browser))
    df = pd.DataFrame(jobData[1:],columns=jobData[0])
    #Drop any duplicate
    df.drop_duplicates(subset = ['jobLink'],inplace=True)
    return df

There are a lot of things happening here, so let’s break it down. Our function takes the following inputs: A browser instance, the keyword to be searched (Ex. “Software Developer”), a jobType (Ex: 2 for remote job), and maxPageNumSearch. When we make a search on LinkedIn the results are divided in pages, this number represents up to which page we want to scrape the results. This function will iterate through all the result pages (up to maxPageNumSearch) and for each page will click on each post to display the job description on the screen. After clicking a job it calls the function getJobInfo(browser) which extracts all the information we want and returns it as a list. Finally all the lists are put together into a Pandas data frame that is returned. The code for getJobInfo() is the following:

def getJobInfo(browser):
    title = ''
    company = ''
    companyLink = ''
    location = ''
    jobLink = ''
    posterName = ''
    posterProfileLink = ''
    jobDescription = ''
    try:
        title = browser.find_element(By.CLASS_NAME,'jobs-unified-top-card__content--two-pane').text.split('\n')[0]
    except:
        pass
    try:
        company = browser.find_element(By.CLASS_NAME,'jobs-unified-top-card__company-name').text
    except:
        pass
    try:
        companyLink = browser.find_element(By.CLASS_NAME,'jobs-unified-top-card__company-name').find_element(By.TAG_NAME,'a').get_attribute('href')
    except:
        pass
    try:
        location = browser.find_element(By.CLASS_NAME,'jobs-unified-top-card__bullet').text
    except:
        pass
    try:
        jobLink = browser.find_element(By.CLASS_NAME,'jobs-unified-top-card__content--two-pane').find_element(By.TAG_NAME,'a').get_attribute('href')
    except:
        pass
    try:
        posterName = browser.find_element(By.CLASS_NAME,'jobs-poster__name').text
    except:
        pass
    try:
        posterProfileLink = browser.find_element(By.CLASS_NAME,'jobs-poster__name-link').get_attribute('href')
    except:
        pass
    try:
        jobDescription = browser.find_element(By.CLASS_NAME,'jobs-box__html-content').find_element(By.TAG_NAME,'span').text.replace('\n',' ')
    except:
        pass
    return [title,company,companyLink,location,jobLink,posterName,posterProfileLink,jobDescription]

Although it looks complex, this function is following the same principles we discussed at the beginning. Find an element name using Chrome Dev Tools, then use selenium “find_element” to find the element by using the ID, or Class name, or any other method that works for us. Notice how I enclosed each find element instance in a try-pass. This is a good practice when there are elements that don’t always show. For instance, sometimes the “location” element isn’t included in the job post so we can skip securely with the try-pass.

1000 Web Scraped Jobs from LinkedIn!…Now what?

I run the code to search for remote Data Science jobs using our function:

getLinkedinJobs(browser,'data science','2',40)

I got back a data frame with 1000 job postings and all their data.

LinkedIn Data Science jobs Scraped Data

That’s a lot of job postings to read! What we need to do now is to find a way to process all these jobs posts so we can find the jobs that are a better match for us and prioritize our application process. Which is exactly what we’ll do in the next blogpost. We will apply Data Science to compare each job description with our CV. We will learn how to read PDFs with Python and then how to do text comparison to score each job post against our Resume and find the best Jobs for us.

Part II here: https://medium.com/@erichhohenstein/applying-data-science-and-web-scraping-to-find-a-job-on-linkedin-part-ii-f5f5caabd371?source=friends_link&sk=3e3a80f0561cc4b40358df1f7a36f9af

Code: https://github.com/erichho/LinkedIn-Job-Webscraper

Applying Data Science and Web Scraping to find a Job on LinkedIn-Part I was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.

Post date May 13, 2022
Post categories In data-science, linkedin, python, selenium, web-scraping

This content originally appeared on Level Up Coding - Medium and was authored by Erich Hohenstein

How to get a Data Science Job…using Data Science-Part I

When they tell you to enter your work experience after you upload your CV

The plan

Before jumping into coding, let’s solve the problem on paper and break it down into steps.

Find the data source — There are many job posting websites, but by far the most popular is LinkedIn, so this will be our data source.
Collect data — We will write a Python Script on Jupyter to access LinkedIn, search jobs with the keywords “data science” and then crawl page while we extract the information from each post.
Clean the collected data — This is a very important part for any ETL. Put the data in the best possible shape to make it understandable and searchable. Since we will be working with unstructured data, that is text written by humans, we will have to clean the text and remove unnecessary words for the scoring part.
Create a scoring function — What we need is to create a function to score how good is the match between us and each job post. Put it simply, the function will take our CV and a job post and the output will be a number between 0 and 1: f(CV, JobPost) = score

The code

Login to LinkedIn

https://chromedriver.chromium.org/downloads

#Create browser instance by giving the address to chrome webdriver
browser = webdriver.Chrome('./chromedriver')
# Navigate to LinkedIn Login page
browser.get('https://www.linkedin.com/login/es?fromSignIn=true&trk=guest_homepage-basic_nav-header-signin')

You will see a Google Chrome window open and automatically navigate into LinkedIn.

We will add now some lines of code to automatically insert our LinkedIn user and password. To insert the username we use the following line:

browser.find_element(By.ID,'username').send_keys('erichhohenstein@gmail.com')

We do the same to send the password and then we use the following code to click the Login button.

browser.find_element(By.CLASS_NAME,’btn__primary — large’).click()

Notice in this case I found the element using the class name and used the .click() method to click the button…and we are in!

*Hacker voice* I’m in

In order to keep our code clean, we will be using functions for each specific task. So the final code for the login function is the following:

def loginLinkedIn(browser,usr,psw):
    # Go to LinkedIn Login page
    browser.get('https://www.linkedin.com/login/es?fromSignIn=true&trk=guest_homepage-basic_nav-header-signin')
    time.sleep(4)
    browser.find_element(By.ID,'username').send_keys(usr)
    time.sleep(1)
    browser.find_element(By.ID,'password').send_keys(psw)
    time.sleep(1)
    browser.find_element(By.CLASS_NAME,'btn__primary--large').click()
    return browser

Web Crawling

The next step is to make the job search. For this, we will first do the job search manually to get the URL and understand the parts of if so we can use it to crawl LinkedIn.

I searched for job remote position, worldwide for data science and got the following URL:

https://www.linkedin.com/jobs/search/?f_WT=2&geoId=92000000&keywords=data science&location=Worldwide&sortBy=R and when I go to the next page 2, the link changes to https://www.linkedin.com/jobs/search/?f_WT=2&geoId=92000000&keywords=datascience&location=Worldwide&sortBy=R&start=25

Analyzing the URL we see there are 3 important things are happening here:

The number after “f_WT=”, searches for on-site, remote, or hybrid. Where 2 is remote.
After “keywords=” goes what we are searching for, replacing spaces with “%20”
After “start=” goes the number of job posts. Each page starts at multiple of 25.

With this knowledge we can now use some type of loop to go through all the result pages. Then our code to crawl the search results will look like this.

#PARAMETERS
jobtype = '2' #Remote 
keyword = 'data%20science' # "%20" = space 
page = '0' #start at 0. Goes in multiples of 25 
#URL
URL = 'https://www.linkedin.com/jobs/search/?f_WT='+jobtype+'&geoId=92000000&keywords='+keyword+'&location=Worldwide&sortBy=R&start='+page

Web Scraping LinkedIn Job posts

def getLinkedinJobs(browser,keyword,jobType,maxPageNumSearch):
    actions = ActionChains(browser)
    jobtype = str(jobType) #2 = Remote
    keyword = keyword.replace(' ','%20') # %20 = space
    jobData = [['title','company','companyLink','location','jobLink','posterName','posterProfileLink','jobDescription']]
    for p in range(maxPageNumSearch):
        page = str(p*25)
        #print(page)
        browser.get('https://www.linkedin.com/jobs/search/?f_WT='+jobtype+'&geoId=92000000&keywords='+keyword+'&location=Worldwide&sortBy=R&start='+page)
        time.sleep(3)
        #Search for all job postings shown
        jobs = browser.find_elements(By.CLASS_NAME,'jobs-search-results__list-item')
        i=0
        for j in jobs:
            #print(i)
            i+=1
            time.sleep(2)
            actions.move_to_element(j).perform()
            j.find_element(By.TAG_NAME,'img').click() #Click on the image so it doesnt misclick a URL
            jobData.append(getJobInfo(browser))
    df = pd.DataFrame(jobData[1:],columns=jobData[0])
    #Drop any duplicate
    df.drop_duplicates(subset = ['jobLink'],inplace=True)
    return df

def getJobInfo(browser):
    title = ''
    company = ''
    companyLink = ''
    location = ''
    jobLink = ''
    posterName = ''
    posterProfileLink = ''
    jobDescription = ''
    try:
        title = browser.find_element(By.CLASS_NAME,'jobs-unified-top-card__content--two-pane').text.split('\n')[0]
    except:
        pass
    try:
        company = browser.find_element(By.CLASS_NAME,'jobs-unified-top-card__company-name').text
    except:
        pass
    try:
        companyLink = browser.find_element(By.CLASS_NAME,'jobs-unified-top-card__company-name').find_element(By.TAG_NAME,'a').get_attribute('href')
    except:
        pass
    try:
        location = browser.find_element(By.CLASS_NAME,'jobs-unified-top-card__bullet').text
    except:
        pass
    try:
        jobLink = browser.find_element(By.CLASS_NAME,'jobs-unified-top-card__content--two-pane').find_element(By.TAG_NAME,'a').get_attribute('href')
    except:
        pass
    try:
        posterName = browser.find_element(By.CLASS_NAME,'jobs-poster__name').text
    except:
        pass
    try:
        posterProfileLink = browser.find_element(By.CLASS_NAME,'jobs-poster__name-link').get_attribute('href')
    except:
        pass
    try:
        jobDescription = browser.find_element(By.CLASS_NAME,'jobs-box__html-content').find_element(By.TAG_NAME,'span').text.replace('\n',' ')
    except:
        pass
    return [title,company,companyLink,location,jobLink,posterName,posterProfileLink,jobDescription]

1000 Web Scraped Jobs from LinkedIn!...Now what?

I run the code to search for remote Data Science jobs using our function:

getLinkedinJobs(browser,'data science','2',40)

I got back a data frame with 1000 job postings and all their data.

LinkedIn Data Science jobs Scraped Data

Part II here: https://medium.com/@erichhohenstein/applying-data-science-and-web-scraping-to-find-a-job-on-linkedin-part-ii-f5f5caabd371?source=friends_link&sk=3e3a80f0561cc4b40358df1f7a36f9af

Code: https://github.com/erichho/LinkedIn-Job-Webscraper

This content originally appeared on Level Up Coding - Medium and was authored by Erich Hohenstein

Print Share Comment Cite Upload Translate Updates

APA

Erich Hohenstein | Sciencx (2022-05-13T10:45:45+00:00) Applying Data Science and Web Scraping to find a Job on LinkedIn-Part I. Retrieved from https://www.scien.cx/2022/05/13/applying-data-science-and-web-scraping-to-find-a-job-on-linkedin-part-i/

MLA

" » Applying Data Science and Web Scraping to find a Job on LinkedIn-Part I." Erich Hohenstein | Sciencx - Friday May 13, 2022, https://www.scien.cx/2022/05/13/applying-data-science-and-web-scraping-to-find-a-job-on-linkedin-part-i/

HARVARD

Erich Hohenstein | Sciencx Friday May 13, 2022 » Applying Data Science and Web Scraping to find a Job on LinkedIn-Part I., viewed ,<https://www.scien.cx/2022/05/13/applying-data-science-and-web-scraping-to-find-a-job-on-linkedin-part-i/>

VANCOUVER

Erich Hohenstein | Sciencx - » Applying Data Science and Web Scraping to find a Job on LinkedIn-Part I. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2022/05/13/applying-data-science-and-web-scraping-to-find-a-job-on-linkedin-part-i/

CHICAGO

" » Applying Data Science and Web Scraping to find a Job on LinkedIn-Part I." Erich Hohenstein | Sciencx - Accessed . https://www.scien.cx/2022/05/13/applying-data-science-and-web-scraping-to-find-a-job-on-linkedin-part-i/

IEEE

" » Applying Data Science and Web Scraping to find a Job on LinkedIn-Part I." Erich Hohenstein | Sciencx [Online]. Available: https://www.scien.cx/2022/05/13/applying-data-science-and-web-scraping-to-find-a-job-on-linkedin-part-i/. [Accessed: ]

rf:citation

» Applying Data Science and Web Scraping to find a Job on LinkedIn-Part I | Erich Hohenstein | Sciencx | https://www.scien.cx/2022/05/13/applying-data-science-and-web-scraping-to-find-a-job-on-linkedin-part-i/ |

Please log in to upload a file.

There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.

How to get a Data Science Job…using Data Science-Part I

The plan

The code

Login to LinkedIn

Web Crawling

Web Scraping LinkedIn Job posts

1000 Web Scraped Jobs from LinkedIn!...Now what?

Related Posts