WARNING: Selenium support for PhantomJS has been deprecated. Both Firefox and Chrome now support headless capabilities. Use one of those instead.
This post borrows from the previous selenium-based post here. If you have heard of PhantomJS, would like to try it out, and are curious to see how it performs against other browsers such as Chrome, this post will help. However, in my experience, using the PhantomJS browser for webscraping doesn’t really have many benefits compared to using Chrome or Firefox (unless you need to run your script on a server, in which case it’s your go-to). It is faster, though not as much as you might hope, and I’ve found it to be much less reliable (it can randomly freeze on tasks that run smoothly on Chrome despite extensive tweaking and troubleshooting). My current opinion is that it’s more trouble than it’s worth for webscraping purposes, but if you want to try it out for yourself, I hope you’ll find the below tutorial helpful.
If you aren’t familiar with it, PhantomJS is a browser much like Chrome or Firefox but with one important difference: it’s headless. This means that using PhantomJS doesn’t require an actual browser window to be open. To install the PhantomJS browser, go here and choose the appropriate download (I’ll assume Windows from here on out, though process is similar in other OS’s). Unzip the zip file, named something like “phantomjs-2.1.1-windows.zip”. And there you have it, PhantomJS is installed. If you go into the unzipped folder, and then into the bin folder, you should find a file named “phantomjs.exe”. All we need to do now is reference that file’s path in our script to launch the browser.
Here is the start of our script from last time:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
import time import pandas as pd from numpy import nan from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC ## return nan values if elements not found, and convert the webelements to text def get_elements(xpath): elements = browser.find_elements_by_xpath(xpath) # find the elements if len(elements) != 4: # if any are missing, return all nan values return [nan] * 4 else: # otherwise, return just the text of the element text = [] for e in elements: text.append(e.text) return text ## create a pandas dataframe to store the scraped data df = pd.DataFrame(index=range(20), columns=['company', 'quarter', 'quarter_ending', 'total_revenue', 'gross_profit', 'net_income', 'total_assets', 'total_liabilities', 'total_equity', 'net_cash_flow']) |
Now we launch the browser, referencing the PhantomJS executable:
1 2 |
my_path = 'C:\\Users\\gstanton\\Downloads\\phantomjs-2.1.1-windows\\bin\\phantomjs.exe' browser = webdriver.PhantomJS(executable_path=my_path) |
However, at least for me, just simply launching the browser like this resulted in highly unreliable webscraping that would freeze at seemingly-random times. To make a long story short, here is some revised code for launching the browser that I found improved performance.
1 2 3 4 5 6 7 8 |
dcaps = webdriver.DesiredCapabilities.PHANTOMJS dcaps["phantomjs.page.settings.userAgent"] = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 \ (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36' my_path = 'C:\\Users\\gstanton\\Downloads\\phantomjs-2.1.1-windows\\bin\\phantomjs.exe' browser = webdriver.PhantomJS(executable_path=my_path, service_args=['--ignore-ssl-errors=true', '--ssl-protocol=any', '--debug=true'], desired_capabilities=dcaps) browser.implicitly_wait(5) |
Now I’ll let you compare the runtimes for PhantomJS and Chrome. It’s set to run PhantomJS right now, so just paste the code into your own IDE and when you want to test Chrome just comment out the PhantomJS browser launch section instead.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 |
import time import pandas as pd from numpy import nan from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC ## return nan values if elements not found, and convert the webelements to text def get_elements(xpath): elements = browser.find_elements_by_xpath(xpath) # find the elements if len(elements) != 4: # if any are missing, return all nan values return [nan] * 4 else: # otherwise, return just the text of the element text = [] for e in elements: text.append(e.text) return text ## create a pandas dataframe to store the scraped data df = pd.DataFrame(index=range(20), columns=['company', 'quarter', 'quarter_ending', 'total_revenue', 'gross_profit', 'net_income', 'total_assets', 'total_liabilities', 'total_equity', 'net_cash_flow']) start_time = time.time() ## launch the PhantomJS browser ############################################################################### dcaps = webdriver.DesiredCapabilities.PHANTOMJS dcaps["phantomjs.page.settings.userAgent"] = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 \ (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36' my_path = 'C:\\Users\\gstanton\\Downloads\\phantomjs-2.1.1-windows\\bin\\phantomjs.exe' browser = webdriver.PhantomJS(executable_path=my_path, service_args=['--ignore-ssl-errors=true', '--ssl-protocol=any', '--debug=true'], desired_capabilities=dcaps) browser.implicitly_wait(5) ############################################################################### """ ## launch the Chrome browser ############################################################################### my_path = "C:\\Users\\gstanton\\Downloads\\chromedriver.exe" browser = webdriver.Chrome(executable_path=my_path) browser.maximize_window() ############################################################################### """ url_form = "http://www.nasdaq.com/symbol/{}/financials?query={}&data=quarterly" financials_xpath = "//tbody/tr/th[text() = '{}']/../td[contains(text(), '$')]" ## company ticker symbols symbols = ["amzn", "aapl", "fb", "ibm", "msft"] for i, symbol in enumerate(symbols): ## navigate to income statement quarterly page url = url_form.format(symbol, "income-statement") browser.get(url) company_xpath = "//h1[contains(text(), 'Company Financials')]" company = WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.XPATH, company_xpath))).text quarters_xpath = "//thead/tr[th[1][text() = 'Quarter:']]/th[position()>=3]" quarters = get_elements(quarters_xpath) quarter_endings_xpath = "//thead/tr[th[1][text() = 'Quarter Ending:']]/th[position()>=3]" quarter_endings = get_elements(quarter_endings_xpath) total_revenue = get_elements(financials_xpath.format("Total Revenue")) gross_profit = get_elements(financials_xpath.format("Gross Profit")) net_income = get_elements(financials_xpath.format("Net Income")) ## navigate to balance sheet quarterly page url = url_form.format(symbol, "balance-sheet") browser.get(url) total_assets = get_elements(financials_xpath.format("Total Assets")) total_liabilities = get_elements(financials_xpath.format("Total Liabilities")) total_equity = get_elements(financials_xpath.format("Total Equity")) ## navigate to cash flow quarterly page url = url_form.format(symbol, "cash-flow") browser.get(url) net_cash_flow = get_elements(financials_xpath.format("Net Cash Flow")) ## fill the datarame with the scraped data, 4 rows per company for j in range(4): row = (i * 4) + j df.loc[row, 'company'] = company df.loc[row, 'quarter'] = quarters[j] df.loc[row, 'quarter_ending'] = quarter_endings[j] df.loc[row, 'total_revenue'] = total_revenue[j] df.loc[row, 'gross_profit'] = gross_profit[j] df.loc[row, 'net_income'] = net_income[j] df.loc[row, 'total_assets'] = total_assets[j] df.loc[row, 'total_liabilities'] = total_liabilities[j] df.loc[row, 'total_equity'] = total_equity[j] df.loc[row, 'net_cash_flow'] = net_cash_flow[j] browser.quit() ## create a csv file in our working directory with our scraped data df.to_csv("test.csv", index=False) print(time.time() - start_time) |
When I compared the browsers, I found PhantomJS was generally faster, but not by enough to make selenium a viable webscraping option if it wasn’t already with using Chrome. Additionally, it took a fair amount of troubleshooting to get the PhantomJS browser to the point where it would perform even semi-reliably.
In conclusion, these two browsers are in the same general bracket of webscraping speed, and because Chrome has given me so fewer issues, I still recommend Chrome. In the future though, I’ll explore other more powerful ways of scraping pages with Javascript-rendered content. Stay tuned.
Michel says
bonsoir,
j’ai toujours ce message d’erreur avec PHANTOMJS!
UserWarning: Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead
Grayson Stanton says
Thank you for pointing this out Michel. I’ll be doing posts on headless firefox and chrome soon, so I’ll be sure to update this page with a warning then.