Sunday, May 19, 2024
HomeProgrammingNet Scraping With Python — Get the High 100 Most-Streamed Songs on...

Net Scraping With Python — Get the High 100 Most-Streamed Songs on Spotify | by Vitor Xavier | Sep, 2022

One advantage of music: When it hits you, you are feeling no ache!

Whats up! First I’d prefer to recommend you place in your headphones and hearken to your favourite album whereas we undergo this challenge, as a result of music makes all the things higher.

In the present day we’re going to do a Net Scraping in a dataset from Wikipedia, the place we’ve got the High 100 most-streamed songs of 2022 on Spotify, in addition to their Rating, the variety of Streams (billions), the Artists, and the Publish Date of the Songs.

Let’s import the libraries we’re going to work:

from bs4 import BeautifulSoup
import requests

And now, connect with the web site and pull in knowledge:

URL = ''headers = {'Person-Agent': 'Mozilla/5.0 (Home windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/ Safari/537.36'}web page = requests.get(URL, headers=headers)

You may get your Person Agent on this web site.

Now that we’ve linked to the web site’s knowledge, let’s use Stunning Soup for getting the web page content material, and likewise to construction the HTML in a ‘fairly’ html design utilizing .prettify:

soup1 = BeautifulSoup(web page.content material, 'html.parser')soup2 = BeautifulSoup(soup1.prettify(),'html.parser')

We are able to see the distinction between soup1 and soup2:

soup2 is structured in a way more seen manner than soup1, as a result of the .prettify operate is used to facilitate the evaluation.

Let’s check out how the information is on Wikipedia:

So, the very first thing we wish is to get the Rank values. For this I’ve pressed F12 to open the DevTools, see the web page’s html construction and choose the Examine Operate:

After which examine the rating factor:

Trying carefully to the html operate:

It’s doable to see that the rating’s textual content is inside a th operate with attributes: fashion=”text-align:heart;

So, let’s create an inventory to get all of the values vary 1–100 from rating, utilizing .find_all to seek for the references talked about above:

rating = []
for i in vary (0,100):
rank = soup2.find_all('th', attrs = {'fashion': 'text-align:heart;'})[i].textual content
rank ='n','')
rank = rank.strip()

Discover that I even have carried out just a little cleansing, taking off strings ‘n utilizing .exchange and eradicating clean areas from starting and finish of the string utilizing .strip(), if this was not carried out, the outcome can be this:

So, after the cleansing we’ve got this rating record:

Good! Now that the rating record is finished, we’re going to take a look at the subsequent variable: songs.

I attempted utilizing the identical technique of rating within the songs however it didn’t work. So the tactic I discovered handiest was to examine the variety of intervals between the tune’s registers:

So, after seeing that the tune’s registers seem each 5 rows, and the registers are inside a td operate, I created the tune variable:

song_list = []
for i in vary(0,500,5) :
tune = soup2.find_all('td')[i].textual content
tune ='n','')
tune ='"','')
tune = tune.strip()
tune = ' '.be a part of(tune.break up())

Discover that if we don’t use this mix of be a part of() and break up() operate, the tune ‘I Took a Tablet in Ibiza (Seeb Remix)’ would have lots of duplicated clean areas, like this:

So, after the cleansing, we’ve got this song_list:

Now let’s get the variety of streams, in billions:

streams_list = []
for i in vary(1,500,5) :
streams = soup2.find_all('td')[i].textual content
streams ='n','')
streams = streams.strip()

Discover that the vary sample continues (each 5 rows), and inside a td operate. For streams_list we’ve got:

After streams, we have to get the artists:

artist_list = []
for i in vary(2,500,5):
artist = soup2.find_all('td')[i].textual content
artist ='n','')
artist = artist.strip()
artist = ' '.be a part of(artist.break up())

Right here we additionally had to make use of the mix of be a part of() and break up() features, in any other case the artist_list can be like this:

And, after all of the cleansing the end result for artist_list is:

And, to complete, let’s get the publish date of the songs:

date_list = []
for i in vary(3,500,5):
date = soup2.find_all('td')[i].textual content
date ='n', '')
date = date.strip()

The outcome for date_list is:

That’s good!! All we obtained to do now’s put these lists in a dictionary with the intention to rework the information right into a dataframe:

knowledge = {
'Rank' : rating,
'Music' : song_list,
'Streams' : streams_list,
'Artist' : artist_list,
'Date' : date_list

Now, let’s import pandas and switch the dictionary right into a dataframe:

import pandas as pd
df = pd.DataFrame(knowledge)

That’s good! now we’ve got our dataset:

We are able to additionally import this dataframe to a CSV, and outline a operate for the Net Scraping, in order that we are able to replace the information(CSV) from the web site each time we set the operate to work.

To import to CSV, all we obtained to do is:


And now we’re prepared to research the dataset and take some insights from it.

However let’s reserve the evaluation half for one more time sooner or later, because the important objective right here on this challenge is to indicate how we are able to do internet scraping.

It’s vital to know that if the information you need from an internet site isn’t displayed in simply 1 web page, you can too loop the URL with the intention to get all this knowledge (keep in mind all the time to get the patterns).



Please enter your comment!
Please enter your name here

Most Popular

Recent Comments