This will require you some knowledge of HTML and CSS so that you can easily play around with Tags and Elements.
We are finally grabbing elements from the HTML text we beautified from the last time! This will be the finale! Don’t miss out on the core Web Scraping!
Congratulations on completing your Part 2 series of this journey! What? You didn’t go through Part 2 yet? Worry not, Here’s the link, Quickly go through it, Perform all the instructions, it will not take more than 15 mins:
You weren’t there in part 1 too? I got you, check this out:
Also we are following along the GitHub Repository which I created while learning Web Scraping. Here are the notebooks for you:
Wow, that was a lot of promotion! I am glad you didn’t just fume over me :’)
Now this will require you some knowledge of HTML and CSS so that you can easily play around with Tags and Elements.
This also doesn’t mean that you cannot do web scraping if you are a complete newbie in HTML and CSS.
This guy just made the job easier for you! Watch some parts of this video and you will get the essence! We will require no more than preliminary knowledge about it…
Grabbing elements off of HTML text
So coming back, We wanted to grab elements off of beautified soup which we currently have from our last article.
Syntax for grabbing elements:
Now if we wanted to grab paragraph elements from the HTML text. Then upon running the First command, it will generate an output which will contain list of all paragraph elements.
Second command will give out the first element of the list which is at the index 0.
Third one will simply give the text inside the paragraph tag selected.
soup.select('p') soup.select('p') soup.select('p').get_text()
Grabbing this Table of Contents’ sub points from: https://en.wikipedia.org/wiki/Jonas_Salk
Wikipedia is open sourced, So we can scrape out the data without any difficulties!
You will have to find out the class which will give out the contents, In my case ‘.toclevel-2’ class contained these sub headings.
So this is the little hidden trick which you will have to solve every time you want to scrape something specific out!
Also, remember classes can change accordingly from IP Address to IP Address. So, You will most probably get to see some another class containing these sub headings.
If you have performed everything till now, and it all worked fine, Then Congratulations! You can now successfully scrape data out from the websites over the internet from your Python Script!
I have not gone explaining too much so that not to overwhelm you and for the sake of simplicity! I am expecting that you know the art of googling! :)
By the Way,
You would have noticed master or main written like this in my terminal:
This means my web scraping folder is a git initialized repository. Don’t know what git is? Let me know if you would want an easy explanation about it!
I promise I will explain it in one article! :)
Get to know all about Web Development in this short article:
~Follow Harsh Gaurav for more Technical and Interestingly random content :)