A Beginner’s Guide to Web Scraping Using Python
This article serves as a beginner’s guide to web scraping using Python and looks at the different frameworks and methods you can use, outlined in simple terms.
Image by jemastock on Freepik
Sometimes it is necessary to gather large quantities of information from a website so it can be used for various purposes. This is called web scraping and can be achieved in several ways. One effective web scraping method is to use a programming language known as Python.
This article serves as a beginner’s guide to web scraping using Python and looks at the different frameworks and methods you can use, outlined in simple terms.
What is Web Scraping?
Web scraping is the extraction of data (primarily unstructured data) from a website, usually in large quantities. Once collected, this information is exported into a usable, structured format such as a spreadsheet or an Application Programming Interface (API).
This can be done manually for small datasets; however, it’s best to use automated systems to handle large volumes of data as it is quicker and less costly.
There is no one-size-fits-all approach to web scraping, as all websites come in different sizes and forms. Each site can provide various obstacles that need to be navigated, such as Captcha challenge-response tests, which is why web scrapers need to be very versatile.
What is the Purpose of Web Scraping?
Web scrapers can be used for any number of purposes. Some of the most popular uses are listed below:
- Comparison shopping websites
- Real estate listings
- Lead generation
- Displaying industry-specific statistics and insights
- Current stock prices, crypto prices, and other financial data
- Product data from sites like eBay and Amazon
- Sports stats for gambling websites and fantasy leagues
As with any web project, adhering to the law and regulations is very important. Not only can this avoid any legal action, but it can also ensure your system is better protected from hackers and cybercrime. Always make sure you follow good digital citizenship practices, such as protecting your privacy, changing your passwords regularly, and reporting any illegal activity you come across online.
What is Python, and Why is it used for Web Scraping?
Python is a general-purpose computer programming language that can be used for various tasks, from building websites and software to automating specific tasks and even machine learning. It is compatible with almost any type of program and wasn’t developed for any single objective.
Why Is Python a Good Option for Web Scraping?
There are five key reasons why you should choose Python for your web scraping project.
1. Python Has a Wide Selection of Libraries
Python has a large number of libraries that can be repurposed for your project (a library is a section of code that anyone can use to be included in their own programs). Python libraries include pandas, Matplotlib, Numpy, and more.
These libraries can be used for many different functions and are perfect for data manipulation and web crawling projects.
2. Python Is Relatively Simple
Python is one of the simplest programming languages to get to grips with as it doesn’t use symbols such as semicolons and curly brackets, making the code less convoluted.
3. Python Is Dynamic
Python can be dynamically typed, meaning you do not need to define any data types for variables within Python. Instead, you can insert them whenever needed, making the process much quicker.
4. Python Can Complete Complex Tasks With Only a Small Amount of Code
The goal of web scraping is to save time and collect data quickly, but this isn’t much good if writing the code is a lengthy process. Python, however, is streamlined and only requires a small amount of code to achieve the user’s goal.
5. Python Syntax Can Be Learned Quickly
Python syntax (the rules determining how the code will be written) is very straightforward to learn compared to other programming languages. Each scope or block is easily distinguishable within the code, which makes it easy to follow, even for beginners.
A Beginners Guide to Web Scraping Using Python
In this section, we will discuss the frameworks you can consider to help build your web scraping program. This will be followed by an example method of how you can scrape an e-commerce website.
Python Frameworks for Web Scraping
When using Python for Web Scraping purposes, there are three frameworks that the program can use. These are Beautiful Soup, Scrapy, and Selenium.
- Beautiful Soup - The Beautiful Soup framework is used for parsing XML and HTML files to extract data. When scraping a website, the requests library must be used to send website requests and receive responses. The HTML is then extracted and delivered to the Beautiful Soup object for parsing.
- Scrapy - Scrapy is one of the top web crawling and scraping frameworks that Python uses, effectively crawling websites and extracting structured data. It can be used for numerous tasks, including data mining, web monitoring, and automated testing. Scrapy is HTML focused and works by simply specifying a URL.
- Selenium - The Selenium framework scrapes websites that load dynamic content (Facebook, Twitter, and other social media sites, for example). It can also scrape websites that require a login or registration.
NB! As well as the frameworks above, you should also be aware of the data analysis and manipulation library Pandas. This library is used to extract the data and then save it in the user’s preferred format.
Scraping an Online Shopping Website - An Example
For this method, you will need:
- Ubuntu Operating System
- Google Chrome Browser
- Python 2+ or 3+ with installed Selenium, Pandas, and Beautiful Soup libraries.
Step One
The first step is to find the URL of the page/pages you want to scrape. In this example, we will scrape one of the largest e-commerce websites to extract the prices, names, and ratings of smartphones.
Step Two
Next, you should inspect your chosen page and view its page source. The data you are looking for will usually be within tags, so you must first determine where the information you want to scrape is within the page’s code.
In Google Chrome, right-click on any element within the web page and click inspect. You can then view the page elements. To find your data’s location, view the source code by right-clicking on an image or price and then selecting ‘View Page Source.’
Step Three
Search the page source for the data you want to extract. In this case, the rating, name, and price information will be nested in “div” tags.
Step Four
Now, it is time to develop the code using Python. To do this, first, open the Ubuntu terminal and type: gedit your file name> the.py extension. We will call the file ‘web scrape;’ therefore, the command is:
1. Now, it is time to develop the code using Python. To do this, first, open the Ubuntu terminal and type: gedit your file name> the.py extension. We will call the file ‘web scrape;’ therefore, the command is:
gedit web-scrape.py
2. Use the command below to extract the required libraries:
from selenium import webdriver from BeautifulSoup import BeautifulSoup import pandas as pd
3. Ensure you have Python 3+ and Beautiful Soup installed
4. Set the path to Chrome driver to use the Chrome browser:
driver = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver")
5. Next, we need to open the web page and store the collected information as a list:
products = []# store name of the product prices = []# store price of the product ratings = []# store rating of the product driver.get(insert URL)
6. Now, you’re ready to extract the data. Enter the div tags where the data is nested:
content = driver.page_source soup = BeautifulSoup(content) for a in soup.findAll('a', href = True, attrs = { 'class': '_31qSD5' }): name = a.find('div', attrs = { 'class': '_3wU53n' }) price = a.find('div', attrs = { 'class': '_1vC4OE _2rQ-NK' }) rating = a.find('div', attrs = { 'class': 'hGSR34 _2beYZw' }) products.append(name.text) prices.append(price.text) ratings.append(rating.text)
7. Run the code:
python web-scrape.py
8. Save the collected information in your preferred format; in this example, we will save it as a CSV file.
df = pd.DataFrame({ 'Product Name': products, 'Price': prices, 'Rating': ratings }) df.to_csv('products.csv', index = False, encoding = 'utf-8')
9. Then, run the program one more time to complete the process.
As you can see, with the right tools installed and knowing the simple commands, websites can be easily scraped using Python. We hope you have found this guide useful and that you can apply some of the above techniques to your next web scraping project.
Nahla Davies is a software developer and tech writer. Before devoting her work full time to technical writing, she managed — among other intriguing things — to serve as a lead programmer at an Inc. 5,000 experiential branding organization whose clients include Samsung, Time Warner, Netflix, and Sony.