Selenium + PhantomJS Tutorial

WARNING: Selenium support for PhantomJS has been deprecated. Both Firefox and Chrome now support headless capabilities. Use one of those instead.

This post borrows from the previous selenium-based post here. If you have heard of PhantomJS, would like to try it out, and are curious to see how it performs against other browsers such as Chrome, this post will help. However, in my experience, using the PhantomJS browser for webscraping doesn’t really have many benefits compared to using Chrome or Firefox (unless you need to run your script on a server, in which case it’s your go-to). It is faster, though not as much as you might hope, and I’ve found it to be much less reliable (it can randomly freeze on tasks that run smoothly on Chrome despite extensive tweaking and troubleshooting). My current opinion is that it’s more trouble than it’s worth for webscraping purposes, but if you want to try it out for yourself, I hope you’ll find the below tutorial helpful.

If you aren’t familiar with it, PhantomJS is a browser much like Chrome or Firefox but with one important difference: it’s headless. This means that using PhantomJS doesn’t require an actual browser window to be open. To install the PhantomJS browser, go here and choose the appropriate download (I’ll assume Windows from here on out, though process is similar in other OS’s). Unzip the zip file, named something like “phantomjs-2.1.1-windows.zip”. And there you have it, PhantomJS is installed. If you go into the unzipped folder, and then into the bin folder, you should find a file named “phantomjs.exe”. All we need to do now is reference that file’s path in our script to launch the browser.

Here is the start of our script from last time:

import time
import pandas as pd
from numpy import nan
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

## return nan values if elements not found, and convert the webelements to text
def get_elements(xpath):
    elements = browser.find_elements_by_xpath(xpath) # find the elements
    if len(elements) != 4: # if any are missing, return all nan values
        return [nan] * 4 
    else: # otherwise, return just the text of the element
        text = []
        for e in elements:
            text.append(e.text)
        return text

## create a pandas dataframe to store the scraped data
df = pd.DataFrame(index=range(20),
                  columns=['company', 'quarter', 'quarter_ending', 
                           'total_revenue', 'gross_profit', 'net_income', 
                           'total_assets', 'total_liabilities', 'total_equity', 
                           'net_cash_flow'])

import time

import pandas as pd

from numpy import nan

from selenium import webdriver

from selenium.webdriver.common.by import By

from selenium.webdriver.support.ui import WebDriverWait

from selenium.webdriver.support import expected_conditions as EC

## return nan values if elements not found, and convert the webelements to text

def get_elements(xpath):

elements = browser.find_elements_by_xpath(xpath) # find the elements

if len(elements) != 4: # if any are missing, return all nan values

return [nan] * 4

else: # otherwise, return just the text of the element

text = []

for e in elements:

text.append(e.text)

return text

## create a pandas dataframe to store the scraped data

df = pd.DataFrame(index=range(20),

columns=['company', 'quarter', 'quarter_ending',

'total_revenue', 'gross_profit', 'net_income',

'total_assets', 'total_liabilities', 'total_equity',

'net_cash_flow'])

Now we launch the browser, referencing the PhantomJS executable:

my_path = 'C:\\Users\\gstanton\\Downloads\\phantomjs-2.1.1-windows\\bin\\phantomjs.exe'
browser = webdriver.PhantomJS(executable_path=my_path)

1 2	my_path = 'C:\\Users\\gstanton\\Downloads\\phantomjs-2.1.1-windows\\bin\\phantomjs.exe' browser = webdriver.PhantomJS(executable_path=my_path)

However, at least for me, just simply launching the browser like this resulted in highly unreliable webscraping that would freeze at seemingly-random times. To make a long story short, here is some revised code for launching the browser that I found improved performance.

dcaps = webdriver.DesiredCapabilities.PHANTOMJS
dcaps["phantomjs.page.settings.userAgent"] = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 \
(KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36'
my_path = 'C:\\Users\\gstanton\\Downloads\\phantomjs-2.1.1-windows\\bin\\phantomjs.exe'
browser = webdriver.PhantomJS(executable_path=my_path, 
                              service_args=['--ignore-ssl-errors=true', '--ssl-protocol=any', '--debug=true'], 
                              desired_capabilities=dcaps)
browser.implicitly_wait(5)

dcaps = webdriver.DesiredCapabilities.PHANTOMJS

dcaps["phantomjs.page.settings.userAgent"] = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 \

(KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36'

my_path = 'C:\\Users\\gstanton\\Downloads\\phantomjs-2.1.1-windows\\bin\\phantomjs.exe'

browser = webdriver.PhantomJS(executable_path=my_path,

service_args=['--ignore-ssl-errors=true', '--ssl-protocol=any', '--debug=true'],

desired_capabilities=dcaps)

browser.implicitly_wait(5)

Now I’ll let you compare the runtimes for PhantomJS and Chrome. It’s set to run PhantomJS right now, so just paste the code into your own IDE and when you want to test Chrome just comment out the PhantomJS browser launch section instead.

import time
import pandas as pd
from numpy import nan
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

## return nan values if elements not found, and convert the webelements to text
def get_elements(xpath):
    elements = browser.find_elements_by_xpath(xpath) # find the elements
    if len(elements) != 4: # if any are missing, return all nan values
        return [nan] * 4 
    else: # otherwise, return just the text of the element
        text = []
        for e in elements:
            text.append(e.text)
        return text

## create a pandas dataframe to store the scraped data
df = pd.DataFrame(index=range(20),
                  columns=['company', 'quarter', 'quarter_ending', 
                           'total_revenue', 'gross_profit', 'net_income', 
                           'total_assets', 'total_liabilities', 'total_equity', 
                           'net_cash_flow'])

start_time = time.time()

## launch the PhantomJS browser
###############################################################################
dcaps = webdriver.DesiredCapabilities.PHANTOMJS
dcaps["phantomjs.page.settings.userAgent"] = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 \
(KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36'
my_path = 'C:\\Users\\gstanton\\Downloads\\phantomjs-2.1.1-windows\\bin\\phantomjs.exe'
browser = webdriver.PhantomJS(executable_path=my_path, 
                              service_args=['--ignore-ssl-errors=true', '--ssl-protocol=any', '--debug=true'], 
                              desired_capabilities=dcaps)
browser.implicitly_wait(5)
###############################################################################
"""
## launch the Chrome browser
###############################################################################    
my_path = "C:\\Users\\gstanton\\Downloads\\chromedriver.exe"
browser = webdriver.Chrome(executable_path=my_path)
browser.maximize_window()
###############################################################################
"""

url_form = "http://www.nasdaq.com/symbol/{}/financials?query={}&data=quarterly" 
financials_xpath = "//tbody/tr/th[text() = '{}']/../td[contains(text(), '$')]"

## company ticker symbols
symbols = ["amzn", "aapl", "fb", "ibm", "msft"]

for i, symbol in enumerate(symbols):
    
    ## navigate to income statement quarterly page    
    url = url_form.format(symbol, "income-statement")
    browser.get(url)
    
    company_xpath = "//h1[contains(text(), 'Company Financials')]"
    company = WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.XPATH, company_xpath))).text
    
    quarters_xpath = "//thead/tr[th[1][text() = 'Quarter:']]/th[position()>=3]"
    quarters = get_elements(quarters_xpath)
    
    quarter_endings_xpath = "//thead/tr[th[1][text() = 'Quarter Ending:']]/th[position()>=3]"
    quarter_endings = get_elements(quarter_endings_xpath)
    
    total_revenue = get_elements(financials_xpath.format("Total Revenue"))
    gross_profit = get_elements(financials_xpath.format("Gross Profit"))
    net_income = get_elements(financials_xpath.format("Net Income"))
    
    ## navigate to balance sheet quarterly page 
    url = url_form.format(symbol, "balance-sheet")
    browser.get(url)
    
    total_assets = get_elements(financials_xpath.format("Total Assets"))
    total_liabilities = get_elements(financials_xpath.format("Total Liabilities"))
    total_equity = get_elements(financials_xpath.format("Total Equity"))
    
    ## navigate to cash flow quarterly page 
    url = url_form.format(symbol, "cash-flow")
    browser.get(url)
    
    net_cash_flow = get_elements(financials_xpath.format("Net Cash Flow"))

    ## fill the datarame with the scraped data, 4 rows per company
    for j in range(4):  
        row = (i * 4) + j
        df.loc[row, 'company'] = company
        df.loc[row, 'quarter'] = quarters[j]
        df.loc[row, 'quarter_ending'] = quarter_endings[j]
        df.loc[row, 'total_revenue'] = total_revenue[j]
        df.loc[row, 'gross_profit'] = gross_profit[j]
        df.loc[row, 'net_income'] = net_income[j]
        df.loc[row, 'total_assets'] = total_assets[j]
        df.loc[row, 'total_liabilities'] = total_liabilities[j]
        df.loc[row, 'total_equity'] = total_equity[j]
        df.loc[row, 'net_cash_flow'] = net_cash_flow[j]
   
browser.quit()

## create a csv file in our working directory with our scraped data
df.to_csv("test.csv", index=False)

print(time.time() - start_time)

100

101

102

103

104

105

106

107

import time

import pandas as pd

from numpy import nan

from selenium import webdriver

from selenium.webdriver.common.by import By

from selenium.webdriver.support.ui import WebDriverWait

from selenium.webdriver.support import expected_conditions as EC

## return nan values if elements not found, and convert the webelements to text

def get_elements(xpath):

elements = browser.find_elements_by_xpath(xpath) # find the elements

if len(elements) != 4: # if any are missing, return all nan values

return [nan] * 4

else: # otherwise, return just the text of the element

text = []

for e in elements:

text.append(e.text)

return text

## create a pandas dataframe to store the scraped data

df = pd.DataFrame(index=range(20),

columns=['company', 'quarter', 'quarter_ending',

'total_revenue', 'gross_profit', 'net_income',

'total_assets', 'total_liabilities', 'total_equity',

'net_cash_flow'])

start_time = time.time()

## launch the PhantomJS browser

###############################################################################

dcaps = webdriver.DesiredCapabilities.PHANTOMJS

dcaps["phantomjs.page.settings.userAgent"] = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 \

(KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36'

my_path = 'C:\\Users\\gstanton\\Downloads\\phantomjs-2.1.1-windows\\bin\\phantomjs.exe'

browser = webdriver.PhantomJS(executable_path=my_path,

service_args=['--ignore-ssl-errors=true', '--ssl-protocol=any', '--debug=true'],

desired_capabilities=dcaps)

browser.implicitly_wait(5)

###############################################################################

"""

## launch the Chrome browser

###############################################################################

my_path = "C:\\Users\\gstanton\\Downloads\\chromedriver.exe"

browser = webdriver.Chrome(executable_path=my_path)

browser.maximize_window()

###############################################################################

"""

url_form = "http://www.nasdaq.com/symbol/{}/financials?query={}&data=quarterly"

financials_xpath = "//tbody/tr/th[text() = '{}']/../td[contains(text(), '$')]"

## company ticker symbols

symbols = ["amzn", "aapl", "fb", "ibm", "msft"]

for i, symbol in enumerate(symbols):

## navigate to income statement quarterly page

url = url_form.format(symbol, "income-statement")

browser.get(url)

company_xpath = "//h1[contains(text(), 'Company Financials')]"

company = WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.XPATH, company_xpath))).text

quarters_xpath = "//thead/tr[th[1][text() = 'Quarter:']]/th[position()>=3]"

quarters = get_elements(quarters_xpath)

quarter_endings_xpath = "//thead/tr[th[1][text() = 'Quarter Ending:']]/th[position()>=3]"

quarter_endings = get_elements(quarter_endings_xpath)

total_revenue = get_elements(financials_xpath.format("Total Revenue"))

gross_profit = get_elements(financials_xpath.format("Gross Profit"))

net_income = get_elements(financials_xpath.format("Net Income"))

## navigate to balance sheet quarterly page

url = url_form.format(symbol, "balance-sheet")

browser.get(url)

total_assets = get_elements(financials_xpath.format("Total Assets"))

total_liabilities = get_elements(financials_xpath.format("Total Liabilities"))

total_equity = get_elements(financials_xpath.format("Total Equity"))

## navigate to cash flow quarterly page

url = url_form.format(symbol, "cash-flow")

browser.get(url)

net_cash_flow = get_elements(financials_xpath.format("Net Cash Flow"))

## fill the datarame with the scraped data, 4 rows per company

for j in range(4):

row = (i * 4) + j

df.loc[row, 'company'] = company

df.loc[row, 'quarter'] = quarters[j]

df.loc[row, 'quarter_ending'] = quarter_endings[j]

df.loc[row, 'total_revenue'] = total_revenue[j]

df.loc[row, 'gross_profit'] = gross_profit[j]

df.loc[row, 'net_income'] = net_income[j]

df.loc[row, 'total_assets'] = total_assets[j]

df.loc[row, 'total_liabilities'] = total_liabilities[j]

df.loc[row, 'total_equity'] = total_equity[j]

df.loc[row, 'net_cash_flow'] = net_cash_flow[j]

browser.quit()

## create a csv file in our working directory with our scraped data

df.to_csv("test.csv", index=False)

print(time.time() - start_time)

When I compared the browsers, I found PhantomJS was generally faster, but not by enough to make selenium a viable webscraping option if it wasn’t already with using Chrome. Additionally, it took a fair amount of troubleshooting to get the PhantomJS browser to the point where it would perform even semi-reliably.

In conclusion, these two browsers are in the same general bracket of webscraping speed, and because Chrome has given me so fewer issues, I still recommend Chrome. In the future though, I’ll explore other more powerful ways of scraping pages with Javascript-rendered content. Stay tuned.

Comments

Michel says

March 11, 2018 at 7:23 pm

bonsoir,
j’ai toujours ce message d’erreur avec PHANTOMJS!

UserWarning: Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead

- Grayson Stanton says
  
  March 12, 2018 at 5:28 pm
  
  Thank you for pointing this out Michel. I’ll be doing posts on headless firefox and chrome soon, so I’ll be sure to update this page with a warning then.

Reader Interactions

Comments

Leave a comment Cancel reply