Use Jupyter to Restart the Script from the Point Where the Scrapper Terminated
Photo by Aron Visuals on Unsplash
Have you ever had a situation where your scrapper came across an error [may it be server error or scraper block] and had to start over again?
You’re in luck! You could use Jupyter to restart the script from the point where the scrapper has terminated. I don’t know how this exactly works, but let me give you a brief explanation on how to use this workaround.
This solution largely depends on the Jupyter “running code” feature, where we could run blocks of code independently from each batch.
Start with the usual scraping library:
Python, Selenium, Pandas, Beautiful Soup, and your good old friend time.
are the library you need for this project.
In this explanation, I won’t be diving deeply into my source code but will show how my first batch of scraping code would look like.
The brief explanation for my code would the following:
Here we… what? the servers are down….
People who scrape a lot hate this error message
but as I said in the first part of this article, create another batch of code that could continue the code that you started off with.
Before you continue:
1) Do not close the chrome browser connected with your script. If you do all your session/progress will be deleted.
2) Remember to check the page number that you scraped and recode accordingly on your next batch of code.
So run the program again… with a little change in id number of set no in the export part.
I edited the set number before I ran this block of code
and tada! we move forward to scraping the rest of the set
the rest of the code is running 🙂
But do beware of the continuous error if the website server is unstable like my target website. Repeat the process above whenever your program’s connection to the website gets cut off.
I wasn’t able to figure out how to automate this portion, but if any of your awesome readers know the answer, please comment below.
Anyway, that’s it folks~ jupyter scraping time machine.No tags for this post.