top of page

Image Scraping with Python

In this article, we will learn how to scrape images from any website using Python and the BeautifulSoup library.


What is Image Scraping?

Image scraping is a subset of web scraping technology. While web scraping deals with all forms of web data extraction, image scraping only focuses on the media side – images, videos, audio, and so on.


We will use the requests library to fetch a web page containing the target images. Then pass the response from that website into BeautifulSoup to grab all image link addresses from img tags. After that, we will write each image file into a folder to download the images.


How to Fetch Image URLs With Python's BeautifulSoup

Now go ahead and create a Python file in your project root folder. Ensure that you append the .py extension to the filename.


Open the Python file with any good code editor and use the following code to request a web page:

import requests
URL = "imagesiteURL" # Replace this with the website's URL
getURL = requests.get(URL, headers={"User-Agent":"Mozilla/5.0"})
print(getURL.status_code)

If the above program outputs a 200 response code, the request was successful. Otherwise, you might want to ensure that your network connection is stable. Also, ensure that you've supplied a valid URL.


Now use BeautifulSoup to read the content of the web page with the aid of the html_parser:

from bs4 import BeautifulSoup
 
soup = BeautifulSoup(getURL.text, 'html.parser')
 
images = soup.find_all('img')
print(images)

This code creates a list of objects, each representing an image from the web page. However, what you need from this data is the text of each image's src attribute.


To extract the source from each img tag:

imageSources = []
 
for image in images:
    imageSources.append(image.get('src'))
 
print(imageSources)

Rerun your code, and the image addresses should now appear in a new list (imageSources). You've successfully extracted each image source from the target web page.


How to Save the Images With Python

First, create a download destination folder in your project root directory and name it images.

For Python to successfully download the images, their paths need to be full absolute URLs. In other words, they need to include the "http://" or "https://" prefix, plus the full domain of the website. If the web page references its images using relative URLs, you'll need to convert them into absolute URLs.


In the easy case, when the URL is absolute, initiating the download is just a case of requesting each image from the earlier extracted sources:

for image in imageSources:
    webs = requests.get(image)
    open('images/' + image.split('/')[-1], 'wb').write(webs.content)

The image.split('/')[-1] keyword splits the image link at every forward-slash (/). Then it retrieves the image file name (including any extension) from the last element.


Bear in mind that, in rare cases, image filenames might clash, resulting in download overwrites. Feel free to explore solutions to this problem as an extension to this example.


Absolute URLs can get quite complicated, with lots of edge cases to cover. Fortunately, there's a useful method in the requests.compat package called urljoin. This method returns a full URL, given a base URL and a URL which may be relative. It allows you to resolve values you'll find in href and src attributes.


The final code looks like this:

from bs4 import BeautifulSoup
URL = "imagesiteURL" # Replace this with the website's URL
getURL = requests.get(URL, headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(getURL.text, 'html.parser')
 
images = soup.find_all('img')
resolvedURLs = []
 
for image in images:
    src = image.get('src')
    resolvedURLs.append(requests.compat.urljoin(URL, src))
 
for image in resolvedURLs:
    webs = requests.get(image)
    open('images/' + image.split('/')[-1], 'wb').write(webs.content)


Resource: Makeuseof.com


The Tech Platform

Kommentare


bottom of page