Python

Webscraping On Xpath

Alternative To Google Sheet IMPORTXML()

Kean Teng Blog

--

Images from Unsplash

Google Sheet importxml() function is a wonderful tool to allow you to scrape for website information by just identifying the corresponding Xpath. But it has a limit, when you have hundreds or thousands of data items to be gathered, your Sheet's cell will get stuck at endless loading. As you can see from the image below, for about 250 rows of data, after waiting for more than 30 minutes, the loading cell does not refresh.

# sheet function
=IMPORTXML("URL", 'XPATH')
Google Sheet Data Extraction Using IMPORTXML()

Luckily, the Python selenium package offers a powerful solution to this issue. With just a few lines of code, you can scrape hundreds or even thousands of rows of data and compile them into a Python data frame for you to perform further analysis, with speed and efficiency.

1. What is Selenium

Selenium is a project that provides a range of tools and libraries to support the automation of web browsers. It supports extensions to emulate user interaction with browsers, a distribution of servers for scaling browser allocation and allows you to write interchangeable code for all major web browsers.

You can install the package by writing this code on Windows Terminal:

py -m pip install selenium

2. Finding Xpath

Previously, I have shown the Random Malaysia Address on Google Sheet, where IMPORTXML() is used to scrape for random addresses. The website used is here.

Site For Random Address Generation

This site allows user to generate 20 random addresses at once and the addresses are shown in several blue boxes as shown in the above images. In order to locate the addresses, or in other words, all the information inside each blue box, we would need to locate their Xpath. The Xpath can be found by opening the developer mode on your web browser by pressing Ctrl + Shift + I and search for the required element. Once you have found the required element, right click and select Copy Full Xpath. The Xpath is as follows:

/html/body/section[2]/div/div[2]/div[2]/ul/li[1]

As mentioned earlier, there are 20 random addresses generated. Thus, notice that the * in /html/body/section[2]/div/div[2]/div[2]/ul/li[*], it will be numbered from 1 to 20, representing 20 different boxes containing random addresses.

Finding Site Element Xpath

3. Web Scraping in Action

Everything is easy after knowing the Xpath of the web page element you want. We will start by importing the required package:

from selenium import webdriver
from selenium.webdriver.common.by import By
import time
import pandas as pd
import random

Then, we need to write a function to extract the random address based on the Xpath we copied. We will write a get_address function for that.

def get_address(url):
driver = webdriver.Chrome()
driver.get(url)
time.sleep(2)

container1 = driver.find_element(By.XPATH, "/html/body/section[2]/div/div[2]/div[2]/ul/li[1]")

driver.close()

return container1.text

But the output will be in this format:

//output:
Street: G Kenyalang Shopping Centre 6D Jln Datuk Sim Kheng Hong Kenyalang Park Kuching Ma
City: Kuching
State/province/area: Sarawak
Phone number 08233-1761
Zip code 93300
Country calling code +60
Country Malaysia

//What we want:
Street: G Kenyalang Shopping Centre 6D Jln Datuk Sim Kheng Hong Kenyalang Park Kuching Ma; City: Kuching; State/province/area: Sarawak; Phone number 08233-1761; Zip code 93300; Country calling code +60; Country Malaysia

The output makes us hard to put in a data frame. Therefore, we need an additional function called textline to convert the output to one string as shown above:

def get_address(url):
driver = webdriver.Chrome()
driver.get(url)
time.sleep(2)

container1 = driver.find_element(By.XPATH, "/html/body/section[2]/div/div[2]/div[2]/ul/li[1]")
container2 = driver.find_element(By.XPATH, "/html/body/section[2]/div/div[2]/div[2]/ul/li[2]")

def textLine(poem):
lst=list(poem)
string=''
for i in lst:
string+=i
# print(string)
lst1=string.split("\n")
str1=""

for i in lst1:
str1+=i+" ;"
str2=str1[:-2]
return str2

location1 = textLine(container1.text)
location2 = textLine(container2.text)

driver.close()

temp = [location1, location2]

return temp

Now, we can put the output into a Python data frame:

url = https://www.bestrandoms.com/random-address-in-my?quantity=20
df = pd.DataFrame({'address':get_address(url)})

And we are done! You can also write a loop if you need more random locations as each function can return at most 20 random locations.

4. Scrape More Than 1000 Locations

Check out my GitHub for the full Jupyter notebook to scrape more than 1000 random locations with a .csv output at the end.

Originally published at https://keanteng.github.io on August 13, 2023.

--

--

Kean Teng Blog

My personal blog page to publish my writing on a variety of topics and ideas.