How to automate web scraping in Python? Maps, Finance, and more.
Automating web scraping using Python to extract data from the internet.
AUTOMATION
Web scraping is a useful tool that can be used to take information from the internet and store it offline. That information can then be used to analyse and or visualise. The automation aspect requires setting up a program to run through the web scraping scenarios to produce an output without needing your attention. I will take you through the three examples below.
Google Maps
To set the scene I wanted to move back to London after covid because that was where my job resides. But I didn't know where exactly to move. Therefore I created a scorecard to score each London area based on rental prices, commute time to work, distance to friends, greenery coverage and even crime rates.
There were X number of London areas that I had a list of and I needed to find the commute time between each London area and my workplace. Obviously this would take a long time to do but it is also quite tedious and boring. Therefore, I automated this task using webscraping.
Code snippet:
d = webdriver.Chrome()
d.get(url)
wait = WebDriverWait(d, 15)
wait.until(EC.url_contains('https://www.google.com/maps/dir/' + url2[:3]))
redirect = d.current_url
iframe = wait.until(EC.element_to_be_clickable([By.CSS_SELECTOR, 'div.widget-consent-frame-container > iframe']))
d.switch_to.frame(iframe)
wait.until(EC.element_to_be_clickable((By.ID, "introAgreeButton"))).click()
d.switch_to.default_content()
This part of the code opens the Chrome webdriver and navigates to google maps. The second part of the code clicks the agree button on the cookies pop up. The code will then go on to input the start and end destination, as well as the departing time to find the public transport time taken between the two points of interest.
Finance
I invest my money through a stocks and shares ISA mostly in funds. However, I also invest in individual companies that are publicly traded. My method for investing in individual companies is to look for an exponential trend in operating profits. Of course there are thousands of companies to analyse and going through each one would take a long time and be tedious.
Instead I built a code to look up company's operating profits for the last five years. After the operating profits are scraped from the internet they are stored offline in tables from which the code looks for an exponential trend and then outputs only the companies that fulfil the criteria. From thousands of companies I get a list of around a hundred or even less that fulfil the criteria. Now I only need to go through this subset of companies to decide which to invest in.
Code snippet:
def stocklist(alpha):
url1 = r"https://www.hl.co.uk/shares/shares-search-results/"
url2 = alpha
url = url1 + url2
d = webdriver.Chrome()
d.get(url)
# We wait until the accept cookie button is clickable then we click
wait = WebDriverWait(d, 2)
wait.until(EC.element_to_be_clickable((By.ID, "acceptCookieButton"))).click()
wait.until(EC.element_to_be_clickable((By.XPATH, '// [ @ id = "mainContent"] / div / div / div[2] / ul[2] / li[1] / a')))
stocks = d.find_elements_by_xpath('//[@href]')
The above code navigates to the Hargreaves Lansdowne website and also clicks on the agree button for the cookies pop up. As you can see there are many ways you can do this as there are multiple ways to reference the html object the button is assigned to. Then the code will navigate to the financial information of a particular stock we are interested in. The code will go on to select the operating profits of the stock and store them in a Pandas dataframe.
Courses
My friend who is an entrepreneur wanted a list of courses from a website in a single table. On the website it was split by category and then split across several pages. From the links to the website you can scrape from each of the pages and then collate the information into a single table.
This saves my friend from manually copying and pasting each course from the website. Well, in all likelihood they would've outsourced the work therefore it would've saved him money. In any case, it saves resources.
Code snippet:
url1 = r"https://dentlearn.co.uk/listings-no-map/"
url2 = r"https://dentlearn.co.uk/listings/"
d = webdriver.Chrome()
d.get(url1)
wait = WebDriverWait(d, 5)
wait.until(EC.element_to_be_clickable((By.XPATH, '// [ @ id = "main"]/article/div/div/div/div/div[1]/article/a')))
courselinks = d.find_elements_by_xpath('//[@href]')
coursenames = d.find_elements_by_xpath('//h2[@class]')
The above code extracts course names and course links for one of my clients who is a qualified Dentist.
Conclusion
In all cases, there is an amount of setup of work that is needed, for example, the main URL for the website or identifying a pattern to navigate to the right webpage. Also, the script can take some amount of time to run. However, the main reason for automation is that it can run by itself and you can use the time to do something else in parallel.