Scraping webpage header text with Python

Photo by Javier Quesada on Unsplash

| task |

We have a list of URLs and need to get the headers from these pages. Assuming all the headers are wrapped in h1 tags and these pages are done in HTML/CSS.

| tech |

Python* and Beautif…


This content originally appeared on DEV Community and was authored by Eva

Photo by Javier Quesada on Unsplash

| task |

We have a list of URLs and need to get the headers from these pages. Assuming all the headers are wrapped in h1 tags and these pages are done in HTML/CSS.

| tech |

Python* and Beautifulsoup
*you need to install Python and bs4, and create an environment to run Python. [how-to tk]

| solution |

First, create a .txt file with all the URLs listed. This file can be stored anywhere on your machine, but I recommend saving it under the same folder where you store the project. After that, create another .txt file to store your output.

Create a Python file (.py) to write the script in.

Here is the logic:

1) open the URL, and get the HTML content of the page
2) if we can get the HTML content and the h1 tag exist
3) we get the text inside of the h1 tag and put it into our output.txt file
4) repeat the above steps for all the URLs
5) repeat the above for all the URL list .txt files in the folder if you have separate URL lists.

Full script:

import requests
from bs4 import BeautifulSoup

def print_h1(url: str):
    response = requests.get(url) # querying the webpage, and get an object as a response 
    soup = BeautifulSoup(response.text, 'html.parser')
    if response.status_code != 200 or soup.h1 == None:
        print("FAILED: ", url)
        return
    print("\t", soup.h1.text)

files = ["urls/urllist1.txt", "urls/urllist2.txt"]

for file in files:
    with open(file) as f:
        urls = f.readlines()
        print(file)
        for url in urls:
            print_h1(url.strip())

Here is the breakdown:
1) First, we need to import dependencies.
import requests
from bs4 import BeautifulSoup

2) Write an algorithm to loop through the URL list files, and loop through all the URLs in one list.
Use a custom function "print_h1" to extract the h1 from the page, and url.strip() removes any extra space around the URLs in the .txt file. We also added "print(file)" so we know which URL list file all the headers belong to.
for file in files:
with open(file) as f:
urls = f.readlines()
print(file)
for url in urls:
print_h1(url.strip())

3) Create an array for all the URL list files.
files = ["urls/urllist1.txt", "urls/urllist2.txt"]

At the end of #2, we have a function "print_h1" that needs to be defined. Now let's create the function.

response = requests.get(url)
Get the URL from the .txt file

soup = BeautifulSoup(response.text, 'html.parser')
This line creates a Python Beautiful Soup object and passes it to Python’s built-in HTML parser.

if response.status_code != 200 or soup.h1 == None:
print("FAILED: ", url)
return

This is our error-catching block. If we can't get an URL as a response, or there is no h1 tag on the page, we print "FAILED" and append the URL.

print("\t", soup.h1.text)
We print the text in the h1 tag in output.txt. "\t" adds a tab in front of the text.
[tk: how do we make the function print the output in output.txt?]

Run the Python file and it should write all the headers and error messages into the output file.

credit:
Christian Rang

reference:
https://oxylabs.io/blog/beautiful-soup-parsing-tutorial
https://www.crummy.com/software/BeautifulSoup/bs4/doc/


This content originally appeared on DEV Community and was authored by Eva


Print Share Comment Cite Upload Translate Updates
APA

Eva | Sciencx (2024-10-15T01:18:18+00:00) Scraping webpage header text with Python. Retrieved from https://www.scien.cx/2024/10/15/scraping-webpage-header-text-with-python/

MLA
" » Scraping webpage header text with Python." Eva | Sciencx - Tuesday October 15, 2024, https://www.scien.cx/2024/10/15/scraping-webpage-header-text-with-python/
HARVARD
Eva | Sciencx Tuesday October 15, 2024 » Scraping webpage header text with Python., viewed ,<https://www.scien.cx/2024/10/15/scraping-webpage-header-text-with-python/>
VANCOUVER
Eva | Sciencx - » Scraping webpage header text with Python. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2024/10/15/scraping-webpage-header-text-with-python/
CHICAGO
" » Scraping webpage header text with Python." Eva | Sciencx - Accessed . https://www.scien.cx/2024/10/15/scraping-webpage-header-text-with-python/
IEEE
" » Scraping webpage header text with Python." Eva | Sciencx [Online]. Available: https://www.scien.cx/2024/10/15/scraping-webpage-header-text-with-python/. [Accessed: ]
rf:citation
» Scraping webpage header text with Python | Eva | Sciencx | https://www.scien.cx/2024/10/15/scraping-webpage-header-text-with-python/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.