The most urgent and commonly misrepresented conversation in Data Science is that we never get clean and ready to use data. The .xlsx and .csv will have issues from the source and it is opposite of a treasure hunt to figure out all of them – not fun! If you’ve ever faced difficulty in extracting data from websites, then you’ve landed in the right place. There is brighter side to this that sometimes websites provide APIs for the kind of data you would like to use (Twitter (X) or LinkedIn). By the end of it, we should be able to webscrape a basic website with Python!

By the end of this guide, you’ll know the basic of web scraping and have leverage to experiment it further with websites of your choice! Three most important things that we are interested in:

  1. DATA
  2. DATA
  3. You know it already..DATA!

Start with an Exciting Project!

Web scraping, simply put, is a technique to extract data (structured or unstructured) from websites. It’s like sending a scavenger on treasure hunt (the website), which diligently brings back the hunting items (data). You can use this technique for doing various tasks, from academic projects to job search, from market research to lead generation, etc. and extract the data to your local computer or save it in a database in a table format.

Please be mindful of terms and conditions of a website before accessing data.

It might seem easy and unnecessary at the first glance, but it is a powerful tool. Web scraping can dramatically increase your speed, efficiency, and accuracy in data collection. You can gather large amounts of data through automation in minutes.

What’s the deal!

Web scraping can also regularly update the data, keeping your treasure current and reliable. So, as our King Khan says: “Don’t underestimate the power of a common man (or woman) who reads to become faster and more efficient.” There are many tools and libraries out there but the popular ones are BeautifulSoup (which we will be using), Scrapy and Selenium. This will get the job done.

Dive In

Web scraping works by simulating a web browser. It sends requests to websites and parses the HTML response to extract the data you need. To do this, you’ll need tools called web scrapers that send requests to a web server, just like a browser does when you visit a site. Once the server responds with the page’s HTML code, the scraper parses (breaks it down and understands it) this code to locate particular HTML tags, classes, or attributes that contain the data to be scraped.

NEED CODE ? NEED HTML?

The easiest way to automate scraping is to use code! A small Python script can do the job for you. Web pages are built out of HTML code, and the data is “wrapped” in tags that enable the browser to make sense of it. These tags allow you to extract the data you need.

Most web browsers provide built-in tools for inspecting elements. To open this tool, right-click the web page and select Inspect. Here you can view the HTML code and identify the tags containing the data you want to extract. This tool is like a blueprint of the webpage, showing how the page is structured.

We are taking your top most used website for Movie Ratings – IMDb.

Selectors: The Way to Filter!

Selectors are like filters for web scraping, helping you find specific information on the page. Here are the main types:

  • Element Selector: Selects all instances of a particular HTML element, like paragraphs (<p>), headings (<h1>, <h2>, etc.), or links (<a>). It’s like saying: “Get me all the paragraphs on this page!”
  • Class Selector: If you have worked with HTML/CSS, you know what classes are. If not, think of classes like labels. If an element has a label (class) attached to it, you can use the class selector to scoop up all elements with that label. For example, selecting all the elements with class “article.”
  • ID Selector: IDs are unique tags given to individual elements. So, if you’re after something special, like a unique product on an e-commerce site, the ID selector is your go-to. Let’s say you want to select an element with the id main-content, the ID selector would provide the assistance you need.

The Beauty of “BeautifulSoup”

Various web scraping tools are available, and “Beautiful Soup” is one of them. Beautiful Soup is a magic library for web scraping in Python. It’s super easy to use and works great for simple and small web scraping tasks. If you are new to the game of “Vice City,” this is the first cheat code you should use. Here, we will learn to use BeautifulSoup to scrape data for movies (‘Title’, ‘Genre’, ‘Stars’, ‘Runtime’, ‘Rating’) from the IMDB website: https://www.imdb.com/list/ls566941243/

NOTE: Before scraping data, ensure the website allows it. Review the site’s “Terms of Use” and use APIs when legal access is required.

Here we go!

  1. Start with defining the URL!
  1. Import the necessary libraries
  1. Send the “get” request to the desired URL. This sends a HTTP request to the website server.
  1. The server responds with the page’s HTML code, and now there is the need to parse the HTML content of the response to a variable using BeautifulSoup. You can also save the content to a text file for reference. 
  1. Use the selectors defined earlier to extract movie data from the parsed HTML code. You would need to use the “Inspect” feature of the web pages, and then look at the HTML code for locating the desired data. If you cannot find the “Inspect” option, you can just have a look at the imdb.txt file that we just saved. 

Example to see how this works:

Movie titles are stored inside <div> elements with the class “lister-item-content.” Extract these titles by selecting the <div> element and finding the <a> tags within each <div>. These <a> tags are the clickable links that contain the movie titles.

Use inspection and use the “.find()” function to extract movie data from the parsed HTML and store it in a list.

Now, after all the hardships of the above 5 steps, you just have to create a pandas dataframe in this step to store the data in a structured format.

Aaand you have your first dataframe scraped from a web page! Congratulations!

Web Scraping is huge and there are multiple levels to it, here are some resources that will help you grow and build up the skill!

WAIT!

Before going to the next blog of Startup Analytics, have a look at the final data frame obtained:

😊😊😊HAPPY SCRAPING!!! 😊😊😊

Through this basic workflow we were able to Webscrape with simple Python libraries. While, this seems simple, the information extracted is quite crucial and lot’s of websites are quite particular about their data and put restrictions/limitations on data retrieval.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *