• Skip to primary navigation
  • Skip to main content

data Rebellion

Learning through Adventure

  • Home
  • Blog
  • Beginner Python Course
    • Preface – The Journey Ahead
    • Prologue – The Door
    • Chapter 1 – Arithmetic and Variables
    • Chapter 2 – Strings and Lists
    • Chapter 3 – Conditional Statements
    • Chapter 4 – Functions
    • Chapter 5 – Loops
    • Chapter 6 – Built-in Functions and Methods
    • Chapter 7 – Imports and Nesting
    • Chapter 8 – Opening the Door
    • Epilogue – Only the Beginning
  • About
  • Contact
You are here: Home / Selenium / Automate Your Browser: A Guided Selenium Adventure

Automate Your Browser: A Guided Selenium Adventure

Updated March 25, 2021. Published November 7, 2016. 6 Comments

Prerequisites: Have Python/Anaconda and Selenium installed. See the previous intro to Selenium if you’re not familiar with it. The full code for this post is included at the end. You might find it fun to first run the entire script and watch how it works before jumping in and following along with the post. And don’t forget to use your browser’s Developer Tools to see how XPaths were chosen. Best of luck.

One morning Todd, a loyal and industrious employee, was hard at work watching cat videos when suddenly his boss, Mr. Peabody, burst into his cubicle.

“Todd, the CEO is sick of paying everyone so much money. Figure out where in the Southwest US the wages are the lowest. We’re relocating there ASAP!”

This wasn’t the first time Todd was asked to look up something like this. Sometimes they want to know where wages are the lowest in the Northeast, or Midwest, or in individual states, and Todd is fed up with it. So many cat videos, but so little time to watch them. If only there was a way to automate it…

With a sudden flash of inspiration, Todd realizes the Selenium package for Python can solve his problem! He just needs to write up a quick script, and then he can kiss these lengthy wage look-ups goodbye.

Automating your office work with Python and Selenium

To start, Todd writes the code to import the packages he thinks he’ll need, open up a browser, and navigate to the Bureau of Labor Statistics (BLS) data page.

1
2
3
4
5
6
7
8
9
import re
from selenium import webdriver
 
## remember to input your own file path to your chrome driver here if you're using chrome
browser = webdriver.Chrome(executable_path="C:\\Users\\gstanton\\Downloads\\chromedriver.exe")
browser.maximize_window()
 
url = 'http://www.bls.gov/data/'
browser.get(url)

Todd wants to use the “Multi-Screen Data Search” tool in the “Employment” section to get earnings (aka wages) data for US metro areas. To do this, he identifies a viable XPath expression for the tool’s button with Chrome’s Developer Tools and then writes the code to select and click that element.

1
2
multi_screen_data_search = browser.find_element_by_xpath("//img[@title='Multi Screen Data Search for CES State and Metro Area']")
multi_screen_data_search.click()

This brings Todd to a page with a couple of checkboxes, one to get data that is seasonally adjusted and one to get data that isn’t. Todd decides he wants the non-adjusted data. He finds the checkbox element, clicks it, then finds and clicks on the “Next form” button to proceed with the query.

1
2
3
4
5
not_seasonally_adjusted = browser.find_elements_by_xpath("//input[@name='seasonal']")
not_seasonally_adjusted[1].click()
 
next_form = browser.find_element_by_xpath("//input[@value='Next form']")
next_form.click()

Todd wants the average hourly earnings of all employees in dollars, and selects the appropriate options.

1
2
3
4
5
average_hourly_earnings = browser.find_element_by_xpath("//select/option[@value='03']")
average_hourly_earnings.click()
 
next_form = browser.find_element_by_xpath("//input[@value='Next form']")
next_form.click()

Now Todd needs to select the Southwest US region. He knows the region is typically defined as Arizona, California, Colorado, Nevada, New Mexico, and Utah, and selects the states accordingly.

1
2
3
4
5
6
7
8
## selected states: AZ, CA, CO, NM, NV, UT
state_values = ['04', '06', '08', '32', '35', '49']
for value in state_values:
state_element = browser.find_element_by_xpath("//select/option[@value='{}']".format(value))
state_element.click()
next_form = browser.find_element_by_xpath("//input[@value='Next form']")
next_form.click()

Todd wants all the metros in these states, and so selects all the metro options, making sure to exclude the “Statewide” option.

1
2
3
4
5
6
7
8
9
10
all_metro_elements = browser.find_elements_by_xpath("//select[@name='area_code']/option")
for metro_element in all_metro_elements:
metro_element.click()
## de-select the statewide option at the top, we just want metros
statewide_option = browser.find_element_by_xpath("//option[@value='00000']")
statewide_option.click()
 
next_form = browser.find_element_by_xpath("//input[@value='Next form']")
next_form.click()

There are then a couple of screens where Todd specifies that he wants just private-sector wages.

1
2
3
4
5
6
7
8
9
10
11
total_private_option = browser.find_element_by_xpath("//option[@value='05']")
total_private_option.click()
 
next_form = browser.find_element_by_xpath("//input[@value='Next form']")
next_form.click()
 
total_private_option = browser.find_element_by_xpath("//option[@value='05000000']")
total_private_option.click()
 
next_form = browser.find_element_by_xpath("//input[@value='Next form']")
next_form.click()

And finally, Todd gets to retrieve his precious data.

1
2
retrieve_data = browser.find_element_by_xpath("//input[@value='Retrieve data']")
retrieve_data.click()

There’s just one problem: the format the data is in. Todd sees that it is displayed in tables on the page, and there are also links, one for each metro, to download spreadsheets of the wage data. Todd doesn’t want to go to the trouble of downloading all these spreadsheets, and he certainly doesn’t want to copy and paste all the most recent wage numbers into a spreadsheet by hand, so instead he comes up with a way to quickly grab the data from the web page and determine which metro has the lowest average earnings.

1
2
3
4
5
6
7
8
9
10
## all the most recent wage figures have a '(P)' (for 'Preliminary') after the number
## using this fact, one can easily grab all those most recent figures
recent_hourly_earnings = browser.find_elements_by_xpath("//table/tbody/tr/td[contains(text(), '(P)')]")
 
cleaned_recent_earnings = []
for wage in recent_hourly_earnings:
## use regex to exclude the '(P)' and grab just the wage figure
cleaned_wage = re.search('\d+.\d+', wage.text).group()
## we want to convert to floats for finding the minimum later on
cleaned_recent_earnings.append(float(cleaned_wage))

Great, now Todd has the wages. But he also wants the names of the metros to pair with the wages so he knows what wages go with what metros. The only problem is that the metro name is lumped into the same element as several other pieces of info in the text above each table. But Todd sees a way to extract it.

1
2
3
4
5
6
7
8
9
10
## get all the groups of text above each table
all_table_texts = browser.find_elements_by_xpath("//table/caption/pre")
 
metros = []
for table_text in all_table_texts:
"""Use regex to pick out just the metro name from all the text.
The name starts with a letter, followed by any combo of letters and spaces,
has a comma and a space and ends with two uppercase letters (the state abbreviation)"""
metro_name = re.search('\w[\w ]+, [A-Z][A-Z]', table_text.text).group()
metros.append(metro_name)

With a few final lines of code, Todd zips together the metro names and wage data and computes the minimum wage to return the name of the metro in the US Southwest with the lowest average hourly earnings.

1
2
3
metro_earnings_dict = dict(zip(metros, cleaned_recent_earnings))
metro_to_move_to = min(metro_earnings_dict, key=metro_earnings_dict.get)
print(metro_to_move_to)

Triumphantly, Todd runs the script, shoots his boss a quick email, and goes back to watching cat videos. He gets several similar requests throughout the day, but with his script’s help Todd blessedly incurs only minor disruptions to his cat-watching regimen.

For convenience, here’s the full script:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import re
from selenium import webdriver
 
## remember to input your own file path to your chrome driver here if you're using chrome
browser = webdriver.Chrome(executable_path="C:\\Users\\gstanton\\Downloads\\chromedriver.exe")
browser.maximize_window()
url = 'http://www.bls.gov/data/'
browser.get(url)
 
multi_screen_data_search = browser.find_element_by_xpath("//img[@title='Multi Screen Data Search for CES State and Metro Area']")
multi_screen_data_search.click()
 
not_seasonally_adjusted = browser.find_elements_by_xpath("//input[@name='seasonal']")
not_seasonally_adjusted[1].click()
 
next_form = browser.find_element_by_xpath("//input[@value='Next form']")
next_form.click()
 
average_hourly_earnings = browser.find_element_by_xpath("//select/option[@value='03']")
average_hourly_earnings.click()
 
next_form = browser.find_element_by_xpath("//input[@value='Next form']")
next_form.click()
 
## selected states: AZ, CA, CO, NM, NV, UT
state_values = ['04', '06', '08', '32', '35', '49']
for value in state_values:
state_element = browser.find_element_by_xpath("//select/option[@value='{}']".format(value))
state_element.click()
next_form = browser.find_element_by_xpath("//input[@value='Next form']")
next_form.click()
 
all_metro_elements = browser.find_elements_by_xpath("//select[@name='area_code']/option")
for metro_element in all_metro_elements:
metro_element.click()
## de-select the statewide option at the top, we just want metros
statewide_option = browser.find_element_by_xpath("//option[@value='00000']")
statewide_option.click()
 
next_form = browser.find_element_by_xpath("//input[@value='Next form']")
next_form.click()
 
total_private_option = browser.find_element_by_xpath("//option[@value='05']")
total_private_option.click()
 
next_form = browser.find_element_by_xpath("//input[@value='Next form']")
next_form.click()
 
total_private_option = browser.find_element_by_xpath("//option[@value='05000000']")
total_private_option.click()
 
next_form = browser.find_element_by_xpath("//input[@value='Next form']")
next_form.click()
 
retrieve_data = browser.find_element_by_xpath("//input[@value='Retrieve data']")
retrieve_data.click()
 
## all the most recent wage figures have a '(P)' (for 'Preliminary') after the number
## using this fact, one can easily grab all those most recent figures
recent_hourly_earnings = browser.find_elements_by_xpath("//table/tbody/tr/td[contains(text(), '(P)')]")
cleaned_recent_earnings = []
for wage in recent_hourly_earnings:
## use regex to exclude the '(P)' and grab just the wage figure
cleaned_wage = re.search('\d+.\d+', wage.text).group()
## we want to convert to floats for finding the minimum
cleaned_recent_earnings.append(float(cleaned_wage))
 
## get all the groups of text above each table
all_table_texts = browser.find_elements_by_xpath("//table/caption/pre")
metros = []
for table_text in all_table_texts:
## use regex to pick out just the metro name
## it starts with a letter, followed by any combo of letters and spaces...
## has a comma and a space and ends with two uppercase letters (the state abbreviation)
metro_name = re.search('\w[\w ]+, [A-Z][A-Z]', table_text.text).group()
metros.append(metro_name)
metro_earnings_dict = dict(zip(metros, cleaned_recent_earnings))
metro_to_move_to = min(metro_earnings_dict, key=metro_earnings_dict.get)
print(metro_to_move_to)
Python Automation Project Ideas Ebook Front Page

Free Ebook: 88 Python Project Ideas for Automating Your Life

The best way to learn programming is by working on real-world projects, so why not work on projects that also save you time and sanity? In this free, curated collection, you'll find project ideas for automating:

  • Common office tasks
  • Birthday gifts and wishes
  • Grocery and meal planning
  • Relationships (just the tedious parts!)
  • And quite a bit more

Subscribe to Data Rebellion and get this Ebook delivered straight to your inbox, as well as other exclusive content from time to time on efficiently learning to code useful things, vanquishing soul-crushing work, and having fun along the way.

Reader Interactions

Comments

  1. drem says

    November 8, 2016 at 8:47 pm

    This was a great guide. Thank you!

    Reply
    • Grayson Stanton says

      November 8, 2016 at 10:09 pm

      My pleasure, I’m glad it was helpful.

      Reply
      • Adam Siemiginowski says

        February 21, 2017 at 5:10 pm

        Are you going to continue working on this blog?

        Reply
        • Grayson Stanton says

          February 21, 2017 at 5:38 pm

          Yes, absolutely. I’ve been working on something bigger for the site that I wanted to get done before I continue with regular blogging (a D&D-inspired intro to python “course”), and life has been getting in the way much more than usual, but subscribers will be hearing from me again before too long. Thanks for checking in.

          Reply

Trackbacks

  1. Linkdump #17 | WZB Data Science Blog says:
    November 11, 2016 at 9:24 am

    […] Automate Your Browser: A Guided Selenium Adventure […]

    Reply
  2. link.Bookmarkstar.com says:
    February 19, 2021 at 5:48 pm

    link.Bookmarkstar.com

    Automate Your Browser: A Guided Selenium Adventure » data Rebellion

    Reply

Leave a comment Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Copyright © 2023

Terms and Conditions - Privacy Policy