• Skip to primary navigation
  • Skip to main content

data Rebellion

Learning through Adventure

  • Home
  • Blog
  • Beginner Python Course
    • Preface – The Journey Ahead
    • Prologue – The Door
    • Chapter 1 – Arithmetic and Variables
    • Chapter 2 – Strings and Lists
    • Chapter 3 – Conditional Statements
    • Chapter 4 – Functions
    • Chapter 5 – Loops
    • Chapter 6 – Built-in Functions and Methods
    • Chapter 7 – Imports and Nesting
    • Chapter 8 – Opening the Door
    • Epilogue – Only the Beginning
  • About
  • Contact
You are here: Home / Selenium / Scraping Financial Data with Selenium

Scraping Financial Data with Selenium

Updated March 25, 2021. Published November 22, 2016. 18 Comments

Note: The following post is a significant step up in difficulty from the previous selenium-based post, Automate Your Browser: A Guided Selenium Adventure. Please see the start of that post for links on getting selenium set up if this is your first time using it. If you really do need financial data, there are likely easier ways to obtain it than scraping Nasdaq or Yahoo or Morningstar with selenium. Examples may include Quandl and Yahoo’s finance API, or perhaps building a scraper with scrapy and splash. And there are many proprietary (and expensive) databases out there that will provide such data. But in any case, I hope this post is helpful in demonstrating a few more of the practices involved in real-life webscraping. The full script is at the end of the post for your convenience.

Financial data can be scraped with Selenium

One fine Monday morning, Todd is sipping a hot cup of decaf green tea, gazing out the office window in a state of Zen oneness as a Selenium script does his work for him. But just as he is on the brink of enlightenment, his boss, Mr. Peabody, bursts into his cubicle and barks, “TODD, quit daydreaming. I just got word from the CEO: we need quarterly financials on some of our competitors.” “Oh? What for?” “Some competitive analysis or something. We’ll be doing it on a regular basis. In any case, we need that data TODAY or YOU’RE FIRED!”

As Mr. Peabody stomps away Todd lets out a sigh. His morning had been going so well, but now it seems he has to actually do some work. He decides though that if he’s going to do work, he’s going to do everything in his power to make sure he never has to do that work again. Brainstorming sources of financial data, Todd figures he could get it from nasdaq.com as easily as anywhere else. He navigates to the quarterly income statement of the first company on the list, Apple (ticker symbol: AAPL).

http://www.nasdaq.com/symbol/aapl/financials?query=income-statement&data=quarterly

The first thing Todd notices is that the actual financial data table is being generated via JavaScript (look for the tags in the html). This means that Python packages such as lxml and beautiful soup, which don’t support javascript, won’t be much help here. Todd knows that selenium doesn’t make for the fastest webscraper, but because he only needs data on 5 companies (Amazon, Apple, Facebook, IBM, Microsoft), he still decides to write up another quick selenium script.

To start, he knows he needs to make some imports, initialize a dataframe to store his scraped data in, and launch the browser.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import pandas as pd
from numpy import nan
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
 
## create a pandas dataframe to store the scraped data
df = pd.DataFrame(index=range(20),
columns=['company', 'quarter', 'quarter_ending',
'total_revenue', 'gross_profit', 'net_income',
'total_assets', 'total_liabilities', 'total_equity',
'net_cash_flow'])
 
## launch the Chrome browser
my_path = "C:\\Users\\gstanton\\Downloads\\chromedriver.exe"
browser = webdriver.Chrome(executable_path=my_path)
browser.maximize_window()

Next, Todd thinks about how he’s going to get from company page to company page. Observing the current page’s url, he sees that substituting in the company’s ticker symbol and desired financial statement at the appropriate places should allow him to navigate to all the pages he needs, no simulated-clicking required. He also sees a common pattern in the xpath for the financial data he’ll be scraping.

1
2
3
4
5
6
7
8
9
10
url_form = "http://www.nasdaq.com/symbol/{}/financials?query={}&data=quarterly"
financials_xpath = "//tbody/tr/th[text() = '{}']/../td[contains(text(), '$')]"
 
## company ticker symbols
symbols = ["amzn", "aapl", "fb", "ibm", "msft"]
 
for i, symbol in enumerate(symbols):
## navigate to income statement quarterly page
url = url_form.format(symbol, "income-statement")
browser.get(url)

The first thing he wants to grab is the company ticker symbol, just so he can verify he’s scraping the correct page.

1
2
3
4
5
6
7
8
9
10
for i, symbol in enumerate(symbols):
## navigate to income statement quarterly page
url = url_form.format(symbol, "income-statement")
browser.get(url)
company_xpath = "//h1[contains(text(), 'Company Financials')]"
try:
company = WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.XPATH, company_xpath))).text
except:
company = nan

Notice the line for the assignment of the company variable. This tells the browser to check and see if the element is present, just as it normally would. If the element isn’t present, the browser will check again for the element every half second until the specified 10 seconds are up. Then it will throw an exception. This sort of precaution can be very useful for making your scrapers more reliable.

Examining the xpaths for the rest of the financial info, Todd sees that he will be collecting data points in groups of 4 (one data point for each quarter). To account for the possibility that some data might be missing, and to efficiently extract the text from the web elements, Todd writes the following function to simplify the scraping code.

1
2
3
4
5
6
7
8
9
10
## return nan values if elements not found, and convert the webelements to text
def get_elements(xpath):
elements = browser.find_elements_by_xpath(xpath) # find the elements
if len(elements) != 4: # if any are missing, return all nan values
return [nan] * 4
else: # otherwise, return just the text of the element
text = []
for e in elements:
text.append(e.text)
return text

Todd then finishes the code to loop through each of the company symbols and get the quarterly financial data from each of the financial statements.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
## company ticker symbols
symbols = ["amzn", "aapl", "fb", "ibm", "msft"]
 
for i, symbol in enumerate(symbols):
## navigate to income statement quarterly page
url = url_form.format(symbol, "income-statement")
browser.get(url)
company_xpath = "//h1[contains(text(), 'Company Financials')]"
company = WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.XPATH, company_xpath))).text
quarters_xpath = "//thead/tr[th[1][text() = 'Quarter:']]/th[position()>=3]"
quarters = get_elements(quarters_xpath)
quarter_endings_xpath = "//thead/tr[th[1][text() = 'Quarter Ending:']]/th[position()>=3]"
quarter_endings = get_elements(quarter_endings_xpath)
total_revenue = get_elements(financials_xpath.format("Total Revenue"))
gross_profit = get_elements(financials_xpath.format("Gross Profit"))
net_income = get_elements(financials_xpath.format("Net Income"))
## navigate to balance sheet quarterly page
url = url_form.format(symbol, "balance-sheet")
browser.get(url)
total_assets = get_elements(financials_xpath.format("Total Assets"))
total_liabilities = get_elements(financials_xpath.format("Total Liabilities"))
total_equity = get_elements(financials_xpath.format("Total Equity"))
## navigate to cash flow quarterly page
url = url_form.format(symbol, "cash-flow")
browser.get(url)
net_cash_flow = get_elements(financials_xpath.format("Net Cash Flow"))

So for each iteration of the loop, Todd is collecting these data points. But he needs somewhere to store them. That’s where the pandas dataframe comes in. The following for loop ensures that the data is placed appropriately in the dataframe.

1
2
3
4
5
6
7
8
9
10
11
12
13
## fill the datarame with the scraped data, 4 rows per company
for j in range(4):
row = (i * 4) + j
df.loc[row, 'company'] = company
df.loc[row, 'quarter'] = quarters[j]
df.loc[row, 'quarter_ending'] = quarter_endings[j]
df.loc[row, 'total_revenue'] = total_revenue[j]
df.loc[row, 'gross_profit'] = gross_profit[j]
df.loc[row, 'net_income'] = net_income[j]
df.loc[row, 'total_assets'] = total_assets[j]
df.loc[row, 'total_liabilities'] = total_liabilities[j]
df.loc[row, 'total_equity'] = total_equity[j]
df.loc[row, 'net_cash_flow'] = net_cash_flow[j]

After remembering to close the browser and write his dataframe to a .csv file, Todd has his scraper. Kicking his feet back up on his desk, he breathes a sigh of relief and continues his deep meditations on the nature of being while selenium once again does his work for him.

If you enjoyed this post be sure to subscribe, and let me know if you have any other topics you’d like to see covered. Full script below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
import pandas as pd
from numpy import nan
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
 
## return nan values if elements not found, and convert the webelements to text
def get_elements(xpath):
elements = browser.find_elements_by_xpath(xpath) # find the elements
if len(elements) != 4: # if any are missing, return all nan values
return [nan] * 4
else: # otherwise, return just the text of the element
text = []
for e in elements:
text.append(e.text)
return text
 
## create a pandas dataframe to store the scraped data
df = pd.DataFrame(index=range(20),
columns=['company', 'quarter', 'quarter_ending',
'total_revenue', 'gross_profit', 'net_income',
'total_assets', 'total_liabilities', 'total_equity',
'net_cash_flow'])
 
## launch the Chrome browser
my_path = "C:\\Users\\astanton\\Downloads\\chromedriver.exe"
browser = webdriver.Chrome(executable_path=my_path)
browser.maximize_window()
 
url_form = "http://www.nasdaq.com/symbol/{}/financials?query={}&data=quarterly"
financials_xpath = "//tbody/tr/th[text() = '{}']/../td[contains(text(), '$')]"
 
## company ticker symbols
symbols = ["amzn", "aapl", "fb", "ibm", "msft"]
 
for i, symbol in enumerate(symbols):
## navigate to income statement quarterly page
url = url_form.format(symbol, "income-statement")
browser.get(url)
company_xpath = "//h1[contains(text(), 'Company Financials')]"
company = WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.XPATH, company_xpath))).text
quarters_xpath = "//thead/tr[th[1][text() = 'Quarter:']]/th[position()>=3]"
quarters = get_elements(quarters_xpath)
quarter_endings_xpath = "//thead/tr[th[1][text() = 'Quarter Ending:']]/th[position()>=3]"
quarter_endings = get_elements(quarter_endings_xpath)
total_revenue = get_elements(financials_xpath.format("Total Revenue"))
gross_profit = get_elements(financials_xpath.format("Gross Profit"))
net_income = get_elements(financials_xpath.format("Net Income"))
## navigate to balance sheet quarterly page
url = url_form.format(symbol, "balance-sheet")
browser.get(url)
total_assets = get_elements(financials_xpath.format("Total Assets"))
total_liabilities = get_elements(financials_xpath.format("Total Liabilities"))
total_equity = get_elements(financials_xpath.format("Total Equity"))
## navigate to cash flow quarterly page
url = url_form.format(symbol, "cash-flow")
browser.get(url)
net_cash_flow = get_elements(financials_xpath.format("Net Cash Flow"))
 
## fill the datarame with the scraped data, 4 rows per company
for j in range(4):
row = (i * 4) + j
df.loc[row, 'company'] = company
df.loc[row, 'quarter'] = quarters[j]
df.loc[row, 'quarter_ending'] = quarter_endings[j]
df.loc[row, 'total_revenue'] = total_revenue[j]
df.loc[row, 'gross_profit'] = gross_profit[j]
df.loc[row, 'net_income'] = net_income[j]
df.loc[row, 'total_assets'] = total_assets[j]
df.loc[row, 'total_liabilities'] = total_liabilities[j]
df.loc[row, 'total_equity'] = total_equity[j]
df.loc[row, 'net_cash_flow'] = net_cash_flow[j]
browser.quit()
 
## create a csv file in our working directory with our scraped data
df.to_csv("test.csv", index=False)
Python Automation Project Ideas Ebook Front Page

Free Ebook: 88 Python Project Ideas for Automating Your Life

The best way to learn programming is by working on real-world projects, so why not work on projects that also save you time and sanity? In this free, curated collection, you'll find project ideas for automating:

  • Common office tasks
  • Birthday gifts and wishes
  • Grocery and meal planning
  • Relationships (just the tedious parts!)
  • And quite a bit more

Subscribe to Data Rebellion and get this Ebook delivered straight to your inbox, as well as other exclusive content from time to time on efficiently learning to code useful things, vanquishing soul-crushing work, and having fun along the way.

Reader Interactions

Comments

  1. Taylor says

    December 2, 2016 at 9:42 pm

    Could you either point me to an XPath tutorial or create an XPath tutorial?

    Thanks!

    Reply
    • Grayson Stanton says

      December 3, 2016 at 7:06 pm

      Hi there. Sure thing, here are a few. And here’s an overview of all the different options for selecting elements in selenium. I do plan to do an xpath tutorial at some point in the near future, so look for that. Best of luck!

      Reply
  2. Jay says

    April 18, 2017 at 1:54 pm

    Just wanted to say thank you for providing such clear and helpful tutorials. I was stuck for a week on trying to understand selenium before I found your site.

    Reply
    • Grayson Stanton says

      April 18, 2017 at 4:09 pm

      Hi Jay. Thanks for your comment, I’m glad you found the tutorials helpful. Let me know if there’s any other subjects you’re stuck on, I’m always open to suggestions on what to cover.

      Reply
  3. Jay says

    April 18, 2017 at 4:46 pm

    For some reason I’m only retrieving one quarter of data for amazon, apple, Facebook, IBM. But for Microsoft, I’m getting all four quarters. I tried to play around with the code but I haven’t been successful.

    Reply
    • Grayson Stanton says

      April 18, 2017 at 6:52 pm

      Ah, thanks for pointing that out, there was indeed a bug in the code. That last section of code where the data is being copied into the dataframe wasn’t properly indexing things and so data was being overwritten, except for Microsoft data because Microsoft was last. I also noticed I made the initial dataframe too big (I think originally I was going to do 10 companies, so 40 rows were needed) but that’s changed now too.

      Note that there is also a pop up that sometimes appears, which can result in a page’s data not being scraped. To get around this, you could try clicking to close the window, or maybe even reload the page if the elements aren’t detected (assuming the pop up isn’t likely to occur twice in a row). I’m planning to write another financial scraping post that shouldn’t have this problem and should be an improvement over this whole strategy, though hopefully this is still useful as an exercise in selenium.

      Thanks again, hope this helps.

      Reply
      • Jay says

        April 18, 2017 at 7:06 pm

        Thank you for getting back to me, everything works now. I think I’m going to try to add on to this and eventually try to create discounted cash flow model. Thanks again for the help, I appreciate it.

        Reply
  4. michel Dupuis says

    March 11, 2018 at 4:09 pm

    Merci c’est le meilleur tutorial que j’ai trouvé sur le sujet!
    https://www.nasdaq.com/symbol/dcix/financials
    Juste un petit bug essayé avec “dcix” last (Period Ending:12/31/2016)!
    Get Quarterly Data =>
    There is currently no data for this symbol.
    Thanks (Michel from France)

    Reply
    • Grayson Stanton says

      March 12, 2018 at 5:54 pm

      Hi Michel. It looks like, for some unspecified reason, that Nasdaq is currently missing that data for DCIX. However, Yahoo, Marketwatch, and Morningstar all appear to have it, so you might check those out:
      https://finance.yahoo.com/quote/DCIX/financials?p=DCIX
      https://www.marketwatch.com/investing/stock/dcix/financials/income/quarter
      http://financials.morningstar.com/income-statement/is.html?t=DCIX&region=usa&culture=en-US

      Reply
      • Michel says

        March 14, 2018 at 4:10 pm

        it’s morningstar that is the fastest to update the data
        example ADXS 12 march AMC => update 14mars nothing on the other site

        Can be directly at the source:
        https://www.sec.gov/edgar/searchedgar/companysearch.html
        search_bar = browser.find_element_by_xpath(“//label[@for=’cik’]”)
        search_bar.send_keys(‘ACY’) # ACY earn 13mars AMC
        Bug!!!!
        Maybe with your help!

        Reply
        • Grayson Stanton says

          March 15, 2018 at 12:33 am

          Hi Michel. Assuming you’re sending keys to the “Fast Search” bar, this worked for me:

          search_bar = browser.find_element_by_xpath("//input[@id='cik']")
          search_bar.send_keys('ACY')

          If you still get an error from this, make sure you have the most up-to-date gecko/chrome driver, browser, and selenium module. Let me know how this goes for you.

          Reply
        • Michel says

          March 15, 2018 at 8:19 am

          Bonjour,
          après quelques tâtonnements:

          search_bar = browser.find_element_by_xpath(“//*[@id=’fast-search’]/fieldset/input[@type=’text’]”)
          search_bar.send_keys(‘ACY’)
          search_button =WebDriverWait(browser, 5).until(EC.presence_of_element_located((By.XPATH, (“//*[@id=’fast-search’]/fieldset/input[@id=’cik_find’]”)))).click()
          dernier =WebDriverWait(browser, 5).until(EC.presence_of_element_located((By.XPATH, (“//*[@id=’interactiveDataBtn’]”)))).click()
          F_stat = WebDriverWait(browser, 5).until(EC.presence_of_element_located((By.XPATH, (“//*[@id=’menu_cat2′]”)))).click() #déroule menu
          cons_bal = WebDriverWait(browser, 5).until(EC.presence_of_element_located((By.XPATH, (“//*[@id=’r2′]/a”)))).click()

          Have a good day

          Reply
          • Grayson Stanton says

            March 15, 2018 at 4:36 pm

            Awesome, I’m glad you got that figured out. Best of luck!

            Reply
  5. Michel Dupuis says

    March 12, 2018 at 11:43 am

    I try to do the same with yahoo without success …!
    https://finance.yahoo.com/quote/adsk/financials?p=adsk
    If you could help me?

    f_xpath_yahoo = “//tbody/tr/td[text() = ‘Total Revenue’]/????
    #copy xpath=> //*[@id=”Col1-1-Financials-Proxy”]/section/div[3]/table/tbody/tr[2]/td[1]/span
    #copy selector => #Col1-1-Financials-Proxy > section > div.Mt\28 10px\29.Ovx\28 a\29.W\28 100\25 \29 > table > tbody > tr:nth-child(2) > td.Fz\28 s\29.H\28 35px\29.Va\28 m\29 > span
    #copy element => Total Revenue
    I am completely lost …!

    Reply
    • Grayson Stanton says

      March 12, 2018 at 7:11 pm

      Hi Michel. Here’s part of the script which I’ve revised to work on Yahoo:

      url_form = "https://finance.yahoo.com/quote/{0}/financials?p={0}" 
      financials_xpath = "//table/tbody/tr/td[1]/span[text() = '{}']/../../td[position()>=2]"
      
      ## company ticker symbols
      
      symbols = ["dcix"]
      
          for i, symbol in enumerate(symbols): # note that everything from here on is indented, not sure how to do that in the comments
      
          ## navigate to income statement quarterly page    
          url = url_form.format(symbol)
          browser.get(url)
      
          company_xpath = "//h1[contains(text(), {})]".format(symbol.upper())
          company = WebDriverWait(browser, 5).until(EC.presence_of_element_located((By.XPATH, company_xpath))).text
      
          quarters_button_xpath = "//button/div/span[contains(text(), 'Quarterly')]"
          quarters_button = WebDriverWait(browser, 5).until(EC.presence_of_element_located((By.XPATH, quarters_button_xpath))).click()
      
          quarter_endings_xpath = "//table/tbody/tr[1]/td[position()>=2]/span"
          quarter_endings = get_elements(quarter_endings_xpath)
      
          quarters = ['4th', '3rd', '2nd', '1st']
      
          total_revenue = get_elements(financials_xpath.format("Total Revenue"))
          gross_profit = get_elements(financials_xpath.format("Gross Profit"))
          net_income = get_elements(financials_xpath.format("Net Income"))</code>
      

      Does this make any sense? See if you can figure the rest out from here. If not, let me know. Best of luck!

      Reply
      • Michel says

        March 13, 2018 at 1:32 pm

        Merci,
        ça fonctionne impeccable!
        Juste les fenêtres graphiques un peut longue à s’ouvrir…
        Mais comme PHANTOMJS ne fonctionne plus!
        Pas d’autres solutions
        Thanks again

        Reply
  6. Joshua Thompson says

    July 2, 2021 at 3:32 am

    Hello, is there a way to adapt this code to Google Colab? Thank you very much!

    Reply
    • Grayson Stanton says

      July 6, 2021 at 8:25 pm

      See if this stackoverflow question helps at all: https://stackoverflow.com/questions/51046454/how-can-we-use-selenium-webdriver-in-colab-research-google-com

      Reply

Leave a comment Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Copyright © 2023

Terms and Conditions - Privacy Policy