Note: The following post is a significant step up in difficulty from the previous selenium-based post, Automate Your Browser: A Guided Selenium Adventure. Please see the start of that post for links on getting selenium set up if this is your first time using it. If you really do need financial data, there are likely easier ways to obtain it than scraping Nasdaq or Yahoo or Morningstar with selenium. Examples may include Quandl and Yahoo’s finance API, or perhaps building a scraper with scrapy and splash. And there are many proprietary (and expensive) databases out there that will provide such data. But in any case, I hope this post is helpful in demonstrating a few more of the practices involved in real-life webscraping. The full script is at the end of the post for your convenience.
One fine Monday morning, Todd is sipping a hot cup of decaf green tea, gazing out the office window in a state of Zen oneness as a Selenium script does his work for him. But just as he is on the brink of enlightenment, his boss, Mr. Peabody, bursts into his cubicle and barks, “TODD, quit daydreaming. I just got word from the CEO: we need quarterly financials on some of our competitors.” “Oh? What for?” “Some competitive analysis or something. We’ll be doing it on a regular basis. In any case, we need that data TODAY or YOU’RE FIRED!”
As Mr. Peabody stomps away Todd lets out a sigh. His morning had been going so well, but now it seems he has to actually do some work. He decides though that if he’s going to do work, he’s going to do everything in his power to make sure he never has to do that work again. Brainstorming sources of financial data, Todd figures he could get it from nasdaq.com as easily as anywhere else. He navigates to the quarterly income statement of the first company on the list, Apple (ticker symbol: AAPL).
http://www.nasdaq.com/symbol/aapl/financials?query=income-statement&data=quarterly
The first thing Todd notices is that the actual financial data table is being generated via JavaScript (look for the tags in the html). This means that Python packages such as lxml and beautiful soup, which don’t support javascript, won’t be much help here. Todd knows that selenium doesn’t make for the fastest webscraper, but because he only needs data on 5 companies (Amazon, Apple, Facebook, IBM, Microsoft), he still decides to write up another quick selenium script.
To start, he knows he needs to make some imports, initialize a dataframe to store his scraped data in, and launch the browser.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
import pandas as pd from numpy import nan from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC ## create a pandas dataframe to store the scraped data df = pd.DataFrame(index=range(20), columns=['company', 'quarter', 'quarter_ending', 'total_revenue', 'gross_profit', 'net_income', 'total_assets', 'total_liabilities', 'total_equity', 'net_cash_flow']) ## launch the Chrome browser my_path = "C:\\Users\\gstanton\\Downloads\\chromedriver.exe" browser = webdriver.Chrome(executable_path=my_path) browser.maximize_window() |
Next, Todd thinks about how he’s going to get from company page to company page. Observing the current page’s url, he sees that substituting in the company’s ticker symbol and desired financial statement at the appropriate places should allow him to navigate to all the pages he needs, no simulated-clicking required. He also sees a common pattern in the xpath for the financial data he’ll be scraping.
1 2 3 4 5 6 7 8 9 10 |
url_form = "http://www.nasdaq.com/symbol/{}/financials?query={}&data=quarterly" financials_xpath = "//tbody/tr/th[text() = '{}']/../td[contains(text(), '$')]" ## company ticker symbols symbols = ["amzn", "aapl", "fb", "ibm", "msft"] for i, symbol in enumerate(symbols): ## navigate to income statement quarterly page url = url_form.format(symbol, "income-statement") browser.get(url) |
The first thing he wants to grab is the company ticker symbol, just so he can verify he’s scraping the correct page.
1 2 3 4 5 6 7 8 9 10 |
for i, symbol in enumerate(symbols): ## navigate to income statement quarterly page url = url_form.format(symbol, "income-statement") browser.get(url) company_xpath = "//h1[contains(text(), 'Company Financials')]" try: company = WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.XPATH, company_xpath))).text except: company = nan |
Notice the line for the assignment of the company variable. This tells the browser to check and see if the element is present, just as it normally would. If the element isn’t present, the browser will check again for the element every half second until the specified 10 seconds are up. Then it will throw an exception. This sort of precaution can be very useful for making your scrapers more reliable.
Examining the xpaths for the rest of the financial info, Todd sees that he will be collecting data points in groups of 4 (one data point for each quarter). To account for the possibility that some data might be missing, and to efficiently extract the text from the web elements, Todd writes the following function to simplify the scraping code.
1 2 3 4 5 6 7 8 9 10 |
## return nan values if elements not found, and convert the webelements to text def get_elements(xpath): elements = browser.find_elements_by_xpath(xpath) # find the elements if len(elements) != 4: # if any are missing, return all nan values return [nan] * 4 else: # otherwise, return just the text of the element text = [] for e in elements: text.append(e.text) return text |
Todd then finishes the code to loop through each of the company symbols and get the quarterly financial data from each of the financial statements.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
## company ticker symbols symbols = ["amzn", "aapl", "fb", "ibm", "msft"] for i, symbol in enumerate(symbols): ## navigate to income statement quarterly page url = url_form.format(symbol, "income-statement") browser.get(url) company_xpath = "//h1[contains(text(), 'Company Financials')]" company = WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.XPATH, company_xpath))).text quarters_xpath = "//thead/tr[th[1][text() = 'Quarter:']]/th[position()>=3]" quarters = get_elements(quarters_xpath) quarter_endings_xpath = "//thead/tr[th[1][text() = 'Quarter Ending:']]/th[position()>=3]" quarter_endings = get_elements(quarter_endings_xpath) total_revenue = get_elements(financials_xpath.format("Total Revenue")) gross_profit = get_elements(financials_xpath.format("Gross Profit")) net_income = get_elements(financials_xpath.format("Net Income")) ## navigate to balance sheet quarterly page url = url_form.format(symbol, "balance-sheet") browser.get(url) total_assets = get_elements(financials_xpath.format("Total Assets")) total_liabilities = get_elements(financials_xpath.format("Total Liabilities")) total_equity = get_elements(financials_xpath.format("Total Equity")) ## navigate to cash flow quarterly page url = url_form.format(symbol, "cash-flow") browser.get(url) net_cash_flow = get_elements(financials_xpath.format("Net Cash Flow")) |
So for each iteration of the loop, Todd is collecting these data points. But he needs somewhere to store them. That’s where the pandas dataframe comes in. The following for loop ensures that the data is placed appropriately in the dataframe.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
## fill the datarame with the scraped data, 4 rows per company for j in range(4): row = (i * 4) + j df.loc[row, 'company'] = company df.loc[row, 'quarter'] = quarters[j] df.loc[row, 'quarter_ending'] = quarter_endings[j] df.loc[row, 'total_revenue'] = total_revenue[j] df.loc[row, 'gross_profit'] = gross_profit[j] df.loc[row, 'net_income'] = net_income[j] df.loc[row, 'total_assets'] = total_assets[j] df.loc[row, 'total_liabilities'] = total_liabilities[j] df.loc[row, 'total_equity'] = total_equity[j] df.loc[row, 'net_cash_flow'] = net_cash_flow[j] |
After remembering to close the browser and write his dataframe to a .csv file, Todd has his scraper. Kicking his feet back up on his desk, he breathes a sigh of relief and continues his deep meditations on the nature of being while selenium once again does his work for him.
If you enjoyed this post be sure to subscribe, and let me know if you have any other topics you’d like to see covered. Full script below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 |
import pandas as pd from numpy import nan from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC ## return nan values if elements not found, and convert the webelements to text def get_elements(xpath): elements = browser.find_elements_by_xpath(xpath) # find the elements if len(elements) != 4: # if any are missing, return all nan values return [nan] * 4 else: # otherwise, return just the text of the element text = [] for e in elements: text.append(e.text) return text ## create a pandas dataframe to store the scraped data df = pd.DataFrame(index=range(20), columns=['company', 'quarter', 'quarter_ending', 'total_revenue', 'gross_profit', 'net_income', 'total_assets', 'total_liabilities', 'total_equity', 'net_cash_flow']) ## launch the Chrome browser my_path = "C:\\Users\\astanton\\Downloads\\chromedriver.exe" browser = webdriver.Chrome(executable_path=my_path) browser.maximize_window() url_form = "http://www.nasdaq.com/symbol/{}/financials?query={}&data=quarterly" financials_xpath = "//tbody/tr/th[text() = '{}']/../td[contains(text(), '$')]" ## company ticker symbols symbols = ["amzn", "aapl", "fb", "ibm", "msft"] for i, symbol in enumerate(symbols): ## navigate to income statement quarterly page url = url_form.format(symbol, "income-statement") browser.get(url) company_xpath = "//h1[contains(text(), 'Company Financials')]" company = WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.XPATH, company_xpath))).text quarters_xpath = "//thead/tr[th[1][text() = 'Quarter:']]/th[position()>=3]" quarters = get_elements(quarters_xpath) quarter_endings_xpath = "//thead/tr[th[1][text() = 'Quarter Ending:']]/th[position()>=3]" quarter_endings = get_elements(quarter_endings_xpath) total_revenue = get_elements(financials_xpath.format("Total Revenue")) gross_profit = get_elements(financials_xpath.format("Gross Profit")) net_income = get_elements(financials_xpath.format("Net Income")) ## navigate to balance sheet quarterly page url = url_form.format(symbol, "balance-sheet") browser.get(url) total_assets = get_elements(financials_xpath.format("Total Assets")) total_liabilities = get_elements(financials_xpath.format("Total Liabilities")) total_equity = get_elements(financials_xpath.format("Total Equity")) ## navigate to cash flow quarterly page url = url_form.format(symbol, "cash-flow") browser.get(url) net_cash_flow = get_elements(financials_xpath.format("Net Cash Flow")) ## fill the datarame with the scraped data, 4 rows per company for j in range(4): row = (i * 4) + j df.loc[row, 'company'] = company df.loc[row, 'quarter'] = quarters[j] df.loc[row, 'quarter_ending'] = quarter_endings[j] df.loc[row, 'total_revenue'] = total_revenue[j] df.loc[row, 'gross_profit'] = gross_profit[j] df.loc[row, 'net_income'] = net_income[j] df.loc[row, 'total_assets'] = total_assets[j] df.loc[row, 'total_liabilities'] = total_liabilities[j] df.loc[row, 'total_equity'] = total_equity[j] df.loc[row, 'net_cash_flow'] = net_cash_flow[j] browser.quit() ## create a csv file in our working directory with our scraped data df.to_csv("test.csv", index=False) |
Taylor says
Could you either point me to an XPath tutorial or create an XPath tutorial?
Thanks!
Grayson Stanton says
Hi there. Sure thing, here are a few. And here’s an overview of all the different options for selecting elements in selenium. I do plan to do an xpath tutorial at some point in the near future, so look for that. Best of luck!
Jay says
Just wanted to say thank you for providing such clear and helpful tutorials. I was stuck for a week on trying to understand selenium before I found your site.
Grayson Stanton says
Hi Jay. Thanks for your comment, I’m glad you found the tutorials helpful. Let me know if there’s any other subjects you’re stuck on, I’m always open to suggestions on what to cover.
Jay says
For some reason I’m only retrieving one quarter of data for amazon, apple, Facebook, IBM. But for Microsoft, I’m getting all four quarters. I tried to play around with the code but I haven’t been successful.
Grayson Stanton says
Ah, thanks for pointing that out, there was indeed a bug in the code. That last section of code where the data is being copied into the dataframe wasn’t properly indexing things and so data was being overwritten, except for Microsoft data because Microsoft was last. I also noticed I made the initial dataframe too big (I think originally I was going to do 10 companies, so 40 rows were needed) but that’s changed now too.
Note that there is also a pop up that sometimes appears, which can result in a page’s data not being scraped. To get around this, you could try clicking to close the window, or maybe even reload the page if the elements aren’t detected (assuming the pop up isn’t likely to occur twice in a row). I’m planning to write another financial scraping post that shouldn’t have this problem and should be an improvement over this whole strategy, though hopefully this is still useful as an exercise in selenium.
Thanks again, hope this helps.
Jay says
Thank you for getting back to me, everything works now. I think I’m going to try to add on to this and eventually try to create discounted cash flow model. Thanks again for the help, I appreciate it.
michel Dupuis says
Merci c’est le meilleur tutorial que j’ai trouvé sur le sujet!
https://www.nasdaq.com/symbol/dcix/financials
Juste un petit bug essayé avec “dcix” last (Period Ending:12/31/2016)!
Get Quarterly Data =>
There is currently no data for this symbol.
Thanks (Michel from France)
Grayson Stanton says
Hi Michel. It looks like, for some unspecified reason, that Nasdaq is currently missing that data for DCIX. However, Yahoo, Marketwatch, and Morningstar all appear to have it, so you might check those out:
https://finance.yahoo.com/quote/DCIX/financials?p=DCIX
https://www.marketwatch.com/investing/stock/dcix/financials/income/quarter
http://financials.morningstar.com/income-statement/is.html?t=DCIX®ion=usa&culture=en-US
Michel says
it’s morningstar that is the fastest to update the data
example ADXS 12 march AMC => update 14mars nothing on the other site
Can be directly at the source:
https://www.sec.gov/edgar/searchedgar/companysearch.html
search_bar = browser.find_element_by_xpath(“//label[@for=’cik’]”)
search_bar.send_keys(‘ACY’) # ACY earn 13mars AMC
Bug!!!!
Maybe with your help!
Grayson Stanton says
Hi Michel. Assuming you’re sending keys to the “Fast Search” bar, this worked for me:
search_bar = browser.find_element_by_xpath("//input[@id='cik']")
search_bar.send_keys('ACY')
If you still get an error from this, make sure you have the most up-to-date gecko/chrome driver, browser, and selenium module. Let me know how this goes for you.
Michel says
Bonjour,
après quelques tâtonnements:
search_bar = browser.find_element_by_xpath(“//*[@id=’fast-search’]/fieldset/input[@type=’text’]”)
search_bar.send_keys(‘ACY’)
search_button =WebDriverWait(browser, 5).until(EC.presence_of_element_located((By.XPATH, (“//*[@id=’fast-search’]/fieldset/input[@id=’cik_find’]”)))).click()
dernier =WebDriverWait(browser, 5).until(EC.presence_of_element_located((By.XPATH, (“//*[@id=’interactiveDataBtn’]”)))).click()
F_stat = WebDriverWait(browser, 5).until(EC.presence_of_element_located((By.XPATH, (“//*[@id=’menu_cat2′]”)))).click() #déroule menu
cons_bal = WebDriverWait(browser, 5).until(EC.presence_of_element_located((By.XPATH, (“//*[@id=’r2′]/a”)))).click()
Have a good day
Grayson Stanton says
Awesome, I’m glad you got that figured out. Best of luck!
Michel Dupuis says
I try to do the same with yahoo without success …!
https://finance.yahoo.com/quote/adsk/financials?p=adsk
If you could help me?
f_xpath_yahoo = “//tbody/tr/td[text() = ‘Total Revenue’]/????
#copy xpath=> //*[@id=”Col1-1-Financials-Proxy”]/section/div[3]/table/tbody/tr[2]/td[1]/span
#copy selector => #Col1-1-Financials-Proxy > section > div.Mt\28 10px\29.Ovx\28 a\29.W\28 100\25 \29 > table > tbody > tr:nth-child(2) > td.Fz\28 s\29.H\28 35px\29.Va\28 m\29 > span
#copy element => Total Revenue
I am completely lost …!
Grayson Stanton says
Hi Michel. Here’s part of the script which I’ve revised to work on Yahoo:
Does this make any sense? See if you can figure the rest out from here. If not, let me know. Best of luck!
Michel says
Merci,
ça fonctionne impeccable!
Juste les fenêtres graphiques un peut longue à s’ouvrir…
Mais comme PHANTOMJS ne fonctionne plus!
Pas d’autres solutions
Thanks again
Joshua Thompson says
Hello, is there a way to adapt this code to Google Colab? Thank you very much!
Grayson Stanton says
See if this stackoverflow question helps at all: https://stackoverflow.com/questions/51046454/how-can-we-use-selenium-webdriver-in-colab-research-google-com