• Skip to primary navigation
  • Skip to main content

data Rebellion

Learning through Adventure

  • Home
  • Blog
  • Beginner Python Course
    • Preface – The Journey Ahead
    • Prologue – The Door
    • Chapter 1 – Arithmetic and Variables
    • Chapter 2 – Strings and Lists
    • Chapter 3 – Conditional Statements
    • Chapter 4 – Functions
    • Chapter 5 – Loops
    • Chapter 6 – Built-in Functions and Methods
    • Chapter 7 – Imports and Nesting
    • Chapter 8 – Opening the Door
    • Epilogue – Only the Beginning
  • About
  • Contact
You are here: Home / Selenium / Selenium + PhantomJS Tutorial

Selenium + PhantomJS Tutorial

Updated March 25, 2021. Published November 22, 2016. 2 Comments

WARNING: Selenium support for PhantomJS has been deprecated. Both Firefox and Chrome now support headless capabilities. Use one of those instead.

This post borrows from the previous selenium-based post here. If you have heard of PhantomJS, would like to try it out, and are curious to see how it performs against other browsers such as Chrome, this post will help. However, in my experience, using the PhantomJS browser for webscraping doesn’t really have many benefits compared to using Chrome or Firefox (unless you need to run your script on a server, in which case it’s your go-to). It is faster, though not as much as you might hope, and I’ve found it to be much less reliable (it can randomly freeze on tasks that run smoothly on Chrome despite extensive tweaking and troubleshooting). My current opinion is that it’s more trouble than it’s worth for webscraping purposes, but if you want to try it out for yourself, I hope you’ll find the below tutorial helpful.

A foam phantom in an espresso, fading like PhantomJS

If you aren’t familiar with it, PhantomJS is a browser much like Chrome or Firefox but with one important difference: it’s headless. This means that using PhantomJS doesn’t require an actual browser window to be open. To install the PhantomJS browser, go here and choose the appropriate download (I’ll assume Windows from here on out, though process is similar in other OS’s). Unzip the zip file, named something like “phantomjs-2.1.1-windows.zip”. And there you have it, PhantomJS is installed. If you go into the unzipped folder, and then into the bin folder, you should find a file named “phantomjs.exe”. All we need to do now is reference that file’s path in our script to launch the browser.

Here is the start of our script from last time:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import time
import pandas as pd
from numpy import nan
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
 
## return nan values if elements not found, and convert the webelements to text
def get_elements(xpath):
elements = browser.find_elements_by_xpath(xpath) # find the elements
if len(elements) != 4: # if any are missing, return all nan values
return [nan] * 4
else: # otherwise, return just the text of the element
text = []
for e in elements:
text.append(e.text)
return text
 
## create a pandas dataframe to store the scraped data
df = pd.DataFrame(index=range(20),
columns=['company', 'quarter', 'quarter_ending',
'total_revenue', 'gross_profit', 'net_income',
'total_assets', 'total_liabilities', 'total_equity',
'net_cash_flow'])

Now we launch the browser, referencing the PhantomJS executable:

1
2
my_path = 'C:\\Users\\gstanton\\Downloads\\phantomjs-2.1.1-windows\\bin\\phantomjs.exe'
browser = webdriver.PhantomJS(executable_path=my_path)

However, at least for me, just simply launching the browser like this resulted in highly unreliable webscraping that would freeze at seemingly-random times. To make a long story short, here is some revised code for launching the browser that I found improved performance.

1
2
3
4
5
6
7
8
dcaps = webdriver.DesiredCapabilities.PHANTOMJS
dcaps["phantomjs.page.settings.userAgent"] = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 \
(KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36'
my_path = 'C:\\Users\\gstanton\\Downloads\\phantomjs-2.1.1-windows\\bin\\phantomjs.exe'
browser = webdriver.PhantomJS(executable_path=my_path,
service_args=['--ignore-ssl-errors=true', '--ssl-protocol=any', '--debug=true'],
desired_capabilities=dcaps)
browser.implicitly_wait(5)

Now I’ll let you compare the runtimes for PhantomJS and Chrome. It’s set to run PhantomJS right now, so just paste the code into your own IDE and when you want to test Chrome just comment out the PhantomJS browser launch section instead.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
import time
import pandas as pd
from numpy import nan
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
 
## return nan values if elements not found, and convert the webelements to text
def get_elements(xpath):
elements = browser.find_elements_by_xpath(xpath) # find the elements
if len(elements) != 4: # if any are missing, return all nan values
return [nan] * 4
else: # otherwise, return just the text of the element
text = []
for e in elements:
text.append(e.text)
return text
 
## create a pandas dataframe to store the scraped data
df = pd.DataFrame(index=range(20),
columns=['company', 'quarter', 'quarter_ending',
'total_revenue', 'gross_profit', 'net_income',
'total_assets', 'total_liabilities', 'total_equity',
'net_cash_flow'])
 
start_time = time.time()
 
## launch the PhantomJS browser
###############################################################################
dcaps = webdriver.DesiredCapabilities.PHANTOMJS
dcaps["phantomjs.page.settings.userAgent"] = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 \
(KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36'
my_path = 'C:\\Users\\gstanton\\Downloads\\phantomjs-2.1.1-windows\\bin\\phantomjs.exe'
browser = webdriver.PhantomJS(executable_path=my_path,
service_args=['--ignore-ssl-errors=true', '--ssl-protocol=any', '--debug=true'],
desired_capabilities=dcaps)
browser.implicitly_wait(5)
###############################################################################
"""
## launch the Chrome browser
###############################################################################
my_path = "C:\\Users\\gstanton\\Downloads\\chromedriver.exe"
browser = webdriver.Chrome(executable_path=my_path)
browser.maximize_window()
###############################################################################
"""
 
url_form = "http://www.nasdaq.com/symbol/{}/financials?query={}&data=quarterly"
financials_xpath = "//tbody/tr/th[text() = '{}']/../td[contains(text(), '$')]"
 
## company ticker symbols
symbols = ["amzn", "aapl", "fb", "ibm", "msft"]
 
for i, symbol in enumerate(symbols):
## navigate to income statement quarterly page
url = url_form.format(symbol, "income-statement")
browser.get(url)
company_xpath = "//h1[contains(text(), 'Company Financials')]"
company = WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.XPATH, company_xpath))).text
quarters_xpath = "//thead/tr[th[1][text() = 'Quarter:']]/th[position()>=3]"
quarters = get_elements(quarters_xpath)
quarter_endings_xpath = "//thead/tr[th[1][text() = 'Quarter Ending:']]/th[position()>=3]"
quarter_endings = get_elements(quarter_endings_xpath)
total_revenue = get_elements(financials_xpath.format("Total Revenue"))
gross_profit = get_elements(financials_xpath.format("Gross Profit"))
net_income = get_elements(financials_xpath.format("Net Income"))
## navigate to balance sheet quarterly page
url = url_form.format(symbol, "balance-sheet")
browser.get(url)
total_assets = get_elements(financials_xpath.format("Total Assets"))
total_liabilities = get_elements(financials_xpath.format("Total Liabilities"))
total_equity = get_elements(financials_xpath.format("Total Equity"))
## navigate to cash flow quarterly page
url = url_form.format(symbol, "cash-flow")
browser.get(url)
net_cash_flow = get_elements(financials_xpath.format("Net Cash Flow"))
 
## fill the datarame with the scraped data, 4 rows per company
for j in range(4):
row = (i * 4) + j
df.loc[row, 'company'] = company
df.loc[row, 'quarter'] = quarters[j]
df.loc[row, 'quarter_ending'] = quarter_endings[j]
df.loc[row, 'total_revenue'] = total_revenue[j]
df.loc[row, 'gross_profit'] = gross_profit[j]
df.loc[row, 'net_income'] = net_income[j]
df.loc[row, 'total_assets'] = total_assets[j]
df.loc[row, 'total_liabilities'] = total_liabilities[j]
df.loc[row, 'total_equity'] = total_equity[j]
df.loc[row, 'net_cash_flow'] = net_cash_flow[j]
browser.quit()
 
## create a csv file in our working directory with our scraped data
df.to_csv("test.csv", index=False)
 
print(time.time() - start_time)

When I compared the browsers, I found PhantomJS was generally faster, but not by enough to make selenium a viable webscraping option if it wasn’t already with using Chrome. Additionally, it took a fair amount of troubleshooting to get the PhantomJS browser to the point where it would perform even semi-reliably.

In conclusion, these two browsers are in the same general bracket of webscraping speed, and because Chrome has given me so fewer issues, I still recommend Chrome. In the future though, I’ll explore other more powerful ways of scraping pages with Javascript-rendered content. Stay tuned.

Python Automation Project Ideas Ebook Front Page

Free Ebook: 88 Python Project Ideas for Automating Your Life

The best way to learn programming is by working on real-world projects, so why not work on projects that also save you time and sanity? In this free, curated collection, you'll find project ideas for automating:

  • Common office tasks
  • Birthday gifts and wishes
  • Grocery and meal planning
  • Relationships (just the tedious parts!)
  • And quite a bit more

Subscribe to Data Rebellion and get this Ebook delivered straight to your inbox, as well as other exclusive content from time to time on efficiently learning to code useful things, vanquishing soul-crushing work, and having fun along the way.

Reader Interactions

Comments

  1. Michel says

    March 11, 2018 at 7:23 pm

    bonsoir,
    j’ai toujours ce message d’erreur avec PHANTOMJS!

    UserWarning: Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead

    Reply
    • Grayson Stanton says

      March 12, 2018 at 5:28 pm

      Thank you for pointing this out Michel. I’ll be doing posts on headless firefox and chrome soon, so I’ll be sure to update this page with a warning then.

      Reply

Leave a comment Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Copyright © 2023

Terms and Conditions - Privacy Policy