A Guide to Storing Scraped Data with Python

Web scraping is an efficient tool to gather large amounts of data for data science projects and depending on your goal, you’ll need to have that scraped data stored one way or another.

In this article, we’ll cover some common, efficient ways to store scraped data, such as storing the data in a file or in a database, and also a quick bonus. But let’s start with some considerations on the structure and consistency of the data.

Photo by Ruchindra Gunasekara on Unsplash

Structure and Consistency

Before you start scraping or even thinking about the best way to store your data, it’s essential to consider how the data will be structured and how you’ll make it consistent. The question here is: what will your data look like when the code stops running?

If you’re collecting data from several similar pages on the same website, then you’ll probably be able to store everything in the same table, but if you’re gathering data from very different sources, it may be easier to store them separately at first.

It’s also important to know the type of variables you’re storing. Is it text or numbers? And what is the plan if some field is missing on some particular page? If it’s a number, maybe it makes sense to use zero instead. If it’s text, you can just use any string to represent an empty field, such as “-”, for example.

But if that doesn’t make sense, replace it with an empty string like “”. No matter what you do, do something, or your data may turn out to be inconsistent since some rows will be longer than others and you won’t be able to trust the data you just scraped.

CSV Files

A Final Note on Consistency

In order to properly store data in a CSV file, you need to consider the separator. As the name says, comma-separated values — CSV is a file where the values are separated by commas. But if your data also has lots of commas, then your file will become a huge mess.

You can then either remove every comma from your data or use a different separator. If the commas are in numbers like “1,000,000”, then it might be easier to remove them since they won’t be useful. But if it’s the text that you’re scraping, then the commas might be relevant and it’s better to change the separator to a semi-colon, for instance.

DataFrame to CSV

The first way to create a CSV file with web scraping is to use the DataFrame.to_csv() method. This is pretty straightforward and just exports a DataFrame as a CSV file.

However, in order to export the DataFrame, you first need to have your data as a DataFrame. A simple way to achieve this is to create a big list of lists containing all the data you scraped. Each list inside this list of lists would represent a single row of the DataFrame, containing data from a single page.

When you’re done gathering data, you can just transform the list to a DataFrame and then export it as a CSV file.

https://medium.com/media/6c472694e06a55266cf8c16e267e0d8d/href

Notice that when the DataFrame is created, you can directly pass the names of the columns and this will also be in the CSV file.

Also, it’s recommendable to set index to Falsein the to_csv()method as it usually makes no sense to have the index in the file. Finally, notice the sep parameter being used to separate the value with semi-colons.

This approach has a big downside, though. Notice that you’ll only have a CSV file with some information in it after the code finishes. This scenario means that if any problems happen and your code crashes, let’s say, with only half of the pages scraped, you won’t have access to this data and will have to start all over again.

If you choose to use this approach, however, it becomes even more important to use mechanisms to avoid your code from raising an exception. So make sure to use try and except clauses when needed, to insert as many pauses in the code as possible to avoid overloading the server, and to take advantage of a proxy provider, such as Infatica, as they’ll be able to provide you a better infrastructure of IP addresses so you can be sure your code will keep running.

Writing in the File

Another option is to write everything in the CSV file as you’re scraping the data. This approach reduces the damage in case your code stops running, since everything you have already scraped will be in the CSV file, and you won’t need to start from scratch.

It’s pretty simple to implement too. You can either choose to use the Python CSV library for more complex operations or just use the built-in function open().

https://medium.com/media/20426a7afa545af105cab1b9ced6a6e4/href

Two notes on this code:

  1. The second argument in the open function is the mode. There are several modes to open a file. Here are some of them:
  • r: open for reading (default);
  • a: open for writing, appending to the end of the file if it exists. If the file doesn’t exist, it will be created;
  • w: open for writing, truncating the file first. If the file doesn’t exist, it will be created.

For other options, check the documentation.

2. Always important to close the file. That’s why you need the try and finally clauses. If an error happens while writing, the file will be closed anyway.

Another way to perform this same operation is to use a withclause. It’s useful because it guarantees that the file will be closed no matter what, which makes your code much more simple.

https://medium.com/media/15973cba9ff157d147645164c0b2fa4a/href

Using this approach, your CSV file won’t have the names of the columns unless you add them manually or use an ifclause to write it in the file only before the first page you’re scraping. This is how it would look like:

https://medium.com/media/5ddcf42a355a8d839bf70a532818c693/href

Comparison

Although it may seem that the second approach consumes more time since the code is opening the file every time it scrapes a new page, that’s not the case.

I used each approach to scrape the same web page one hundred times. The scrapers were exactly the same, the only difference was how I stored the data. Here’s the average time elapsed with each approach.

# First approach
0.6287886786460877 seconds
# Second approach
0.6286333727836609 seconds

Writing as I scraped was even a tiny little bit faster.

Storing in a Database

If you have a more complex system going on there, then maybe you’ll want to have the data in a database. For this scenario, it’s easy to integrate your Python code with your SQL database.

This section assumes that you have some basic SQL knowledge and it does not intend to be an SQL crash course. The idea is just to show how to quickly integrate a scraper into a database.

Python has some useful libraries that allow the user to connect with a database, such as PyMySQL, if you’re using MySQL, and sqlite3, if you’re using SQLite, for example.

The use of both these libraries is pretty similar, therefore we’ll just use sqlite3 as an example. For this, I’m assuming you already created the database and the table. If so, we then have some steps to follow:

  1. Create a connection with the database;
  2. Instantiate a Cursor object;
  3. Make sure the cursor is using the correct database, especially if you have more than one.
import sqlite3
conn = sqlite3.connect('my_database.db')
cur = conn.cursor()
cur.execute('USE my_database')

Note that the SQL command is passed as a string.

With this done, all we need to do is to use the cursor to insert the scraped data into the database for every page we scrape and then use the connection to commit.

https://medium.com/media/ef81f1e8a5ec12d7a62afb4273006af9/href

Notice that this approach is similar to writing in a CSV file, except that we’re committing to a database. Also, it’s essential to close both the cursor and the connection, so you might want to use the try and finally clauses there to make sure that, no matter what happens, they’ll be closed.

Bonus: E-mail

Sometimes you may not need to store the data or at least no only need to store it. You may want to be notified about it immediately, maybe because it means something is wrong, maybe because it’s a great opportunity that you can’t miss.

So as a bonus, let’s see how to send e-mails with Python. Using the SMTPlib module, it’s easy to write a function to send an email automatically.

https://medium.com/media/bc508b74f3b20db65a09a21715783152/href

The function receives the subject of the email, the message, and the recipient, but you can change to receive information about the sender as well. Also, this particular function is set to send emails from the Live server, but it’s possible to use Gmail and others. Check the docs.

Finally, the try and except clauses are important to protect your code from crashing if something goes wrong while creating the connection with the email server.

I hope this article can maybe be useful somehow. If you have a question, a suggestion, or just want to be in touch, feel free to contact me through Twitter, GitHub, or Linkedin.


A Guide to Storing Scraped Data with Python was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.


This content originally appeared on Level Up Coding - Medium and was authored by Otávio Simões Silveira

Web scraping is an efficient tool to gather large amounts of data for data science projects and depending on your goal, you’ll need to have that scraped data stored one way or another.

In this article, we’ll cover some common, efficient ways to store scraped data, such as storing the data in a file or in a database, and also a quick bonus. But let’s start with some considerations on the structure and consistency of the data.

Photo by Ruchindra Gunasekara on Unsplash

Structure and Consistency

Before you start scraping or even thinking about the best way to store your data, it’s essential to consider how the data will be structured and how you’ll make it consistent. The question here is: what will your data look like when the code stops running?

If you’re collecting data from several similar pages on the same website, then you’ll probably be able to store everything in the same table, but if you’re gathering data from very different sources, it may be easier to store them separately at first.

It’s also important to know the type of variables you’re storing. Is it text or numbers? And what is the plan if some field is missing on some particular page? If it’s a number, maybe it makes sense to use zero instead. If it’s text, you can just use any string to represent an empty field, such as “-”, for example.

But if that doesn’t make sense, replace it with an empty string like “”. No matter what you do, do something, or your data may turn out to be inconsistent since some rows will be longer than others and you won’t be able to trust the data you just scraped.

CSV Files

A Final Note on Consistency

In order to properly store data in a CSV file, you need to consider the separator. As the name says, comma-separated values — CSV is a file where the values are separated by commas. But if your data also has lots of commas, then your file will become a huge mess.

You can then either remove every comma from your data or use a different separator. If the commas are in numbers like “1,000,000”, then it might be easier to remove them since they won’t be useful. But if it’s the text that you’re scraping, then the commas might be relevant and it’s better to change the separator to a semi-colon, for instance.

DataFrame to CSV

The first way to create a CSV file with web scraping is to use the DataFrame.to_csv() method. This is pretty straightforward and just exports a DataFrame as a CSV file.

However, in order to export the DataFrame, you first need to have your data as a DataFrame. A simple way to achieve this is to create a big list of lists containing all the data you scraped. Each list inside this list of lists would represent a single row of the DataFrame, containing data from a single page.

When you’re done gathering data, you can just transform the list to a DataFrame and then export it as a CSV file.

Notice that when the DataFrame is created, you can directly pass the names of the columns and this will also be in the CSV file.

Also, it’s recommendable to set index to Falsein the to_csv()method as it usually makes no sense to have the index in the file. Finally, notice the sep parameter being used to separate the value with semi-colons.

This approach has a big downside, though. Notice that you’ll only have a CSV file with some information in it after the code finishes. This scenario means that if any problems happen and your code crashes, let’s say, with only half of the pages scraped, you won’t have access to this data and will have to start all over again.

If you choose to use this approach, however, it becomes even more important to use mechanisms to avoid your code from raising an exception. So make sure to use try and except clauses when needed, to insert as many pauses in the code as possible to avoid overloading the server, and to take advantage of a proxy provider, such as Infatica, as they’ll be able to provide you a better infrastructure of IP addresses so you can be sure your code will keep running.

Writing in the File

Another option is to write everything in the CSV file as you’re scraping the data. This approach reduces the damage in case your code stops running, since everything you have already scraped will be in the CSV file, and you won’t need to start from scratch.

It’s pretty simple to implement too. You can either choose to use the Python CSV library for more complex operations or just use the built-in function open().

Two notes on this code:

  1. The second argument in the open function is the mode. There are several modes to open a file. Here are some of them:
  • r: open for reading (default);
  • a: open for writing, appending to the end of the file if it exists. If the file doesn’t exist, it will be created;
  • w: open for writing, truncating the file first. If the file doesn’t exist, it will be created.

For other options, check the documentation.

2. Always important to close the file. That’s why you need the try and finally clauses. If an error happens while writing, the file will be closed anyway.

Another way to perform this same operation is to use a withclause. It’s useful because it guarantees that the file will be closed no matter what, which makes your code much more simple.

Using this approach, your CSV file won’t have the names of the columns unless you add them manually or use an ifclause to write it in the file only before the first page you’re scraping. This is how it would look like:

Comparison

Although it may seem that the second approach consumes more time since the code is opening the file every time it scrapes a new page, that’s not the case.

I used each approach to scrape the same web page one hundred times. The scrapers were exactly the same, the only difference was how I stored the data. Here’s the average time elapsed with each approach.

# First approach
0.6287886786460877 seconds
# Second approach
0.6286333727836609 seconds

Writing as I scraped was even a tiny little bit faster.

Storing in a Database

If you have a more complex system going on there, then maybe you’ll want to have the data in a database. For this scenario, it’s easy to integrate your Python code with your SQL database.

This section assumes that you have some basic SQL knowledge and it does not intend to be an SQL crash course. The idea is just to show how to quickly integrate a scraper into a database.

Python has some useful libraries that allow the user to connect with a database, such as PyMySQL, if you’re using MySQL, and sqlite3, if you’re using SQLite, for example.

The use of both these libraries is pretty similar, therefore we’ll just use sqlite3 as an example. For this, I’m assuming you already created the database and the table. If so, we then have some steps to follow:

  1. Create a connection with the database;
  2. Instantiate a Cursor object;
  3. Make sure the cursor is using the correct database, especially if you have more than one.
import sqlite3
conn = sqlite3.connect('my_database.db')
cur = conn.cursor()
cur.execute('USE my_database')

Note that the SQL command is passed as a string.

With this done, all we need to do is to use the cursor to insert the scraped data into the database for every page we scrape and then use the connection to commit.

Notice that this approach is similar to writing in a CSV file, except that we’re committing to a database. Also, it’s essential to close both the cursor and the connection, so you might want to use the try and finally clauses there to make sure that, no matter what happens, they’ll be closed.

Bonus: E-mail

Sometimes you may not need to store the data or at least no only need to store it. You may want to be notified about it immediately, maybe because it means something is wrong, maybe because it’s a great opportunity that you can’t miss.

So as a bonus, let’s see how to send e-mails with Python. Using the SMTPlib module, it’s easy to write a function to send an email automatically.

The function receives the subject of the email, the message, and the recipient, but you can change to receive information about the sender as well. Also, this particular function is set to send emails from the Live server, but it’s possible to use Gmail and others. Check the docs.

Finally, the try and except clauses are important to protect your code from crashing if something goes wrong while creating the connection with the email server.

I hope this article can maybe be useful somehow. If you have a question, a suggestion, or just want to be in touch, feel free to contact me through Twitter, GitHub, or Linkedin.


A Guide to Storing Scraped Data with Python was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.


This content originally appeared on Level Up Coding - Medium and was authored by Otávio Simões Silveira


Print Share Comment Cite Upload Translate Updates
APA

Otávio Simões Silveira | Sciencx (2021-02-26T04:31:36+00:00) A Guide to Storing Scraped Data with Python. Retrieved from https://www.scien.cx/2021/02/26/a-guide-to-storing-scraped-data-with-python/

MLA
" » A Guide to Storing Scraped Data with Python." Otávio Simões Silveira | Sciencx - Friday February 26, 2021, https://www.scien.cx/2021/02/26/a-guide-to-storing-scraped-data-with-python/
HARVARD
Otávio Simões Silveira | Sciencx Friday February 26, 2021 » A Guide to Storing Scraped Data with Python., viewed ,<https://www.scien.cx/2021/02/26/a-guide-to-storing-scraped-data-with-python/>
VANCOUVER
Otávio Simões Silveira | Sciencx - » A Guide to Storing Scraped Data with Python. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2021/02/26/a-guide-to-storing-scraped-data-with-python/
CHICAGO
" » A Guide to Storing Scraped Data with Python." Otávio Simões Silveira | Sciencx - Accessed . https://www.scien.cx/2021/02/26/a-guide-to-storing-scraped-data-with-python/
IEEE
" » A Guide to Storing Scraped Data with Python." Otávio Simões Silveira | Sciencx [Online]. Available: https://www.scien.cx/2021/02/26/a-guide-to-storing-scraped-data-with-python/. [Accessed: ]
rf:citation
» A Guide to Storing Scraped Data with Python | Otávio Simões Silveira | Sciencx | https://www.scien.cx/2021/02/26/a-guide-to-storing-scraped-data-with-python/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.