Tennis Racquet Data Mining

Methodology:

  1. Introduction
  2. Data
  3. Conclusion
  1. Introduction

This is the beginning of all data science projects: data collection. While data repositories are everywhere, it’s convenient to turn the internet into your database. Typically this is a job for software engineers, but you might find yourself working on this in smaller companies. Let’s get to it!

2. Data

The data is coming from tennisexpress.com because I play tennis. It’s a small dataset that contains 213 instances and 15 features.

I always give shout out to sources I find it useful and this time is no exception: https://www.youtube.com/watch?v=MeBU-4Xs2RU. Feel free to watch the video if I wasn’t being clear.

Essentially, we are building 3 functions: request(), parse(), and output(). Needless to say, the purpose of these functions are pretty self explanatory: connect, read, and save.

We are going to use the “requests_html” library downloaded from https://pypi.org/project/requests-html/. One great benefit of using this library is that it can render dynamically loaded contents, which exists in basically all the modern webpages. What are dynamically loaded contents, you asked? It is information that is displayed as a result of the client not the server. So what? Well, the short answer is that you might not find the data that you are looking for. The long answer is that when I request data from the server with the library “beatifulsoup”, it won’t give me data rendered by JavaScript or ajax (aka dynamically loaded contents) because I am requesting the information from the server, which doesn’t render dynamically loaded contents, but the client does. Don’t panic, first check to see if the data needed is dynamically loaded by inspecting the element you want in your browser and see if there is a script tag equal to JavaScript or ajax. If you see the following highlighted script tag within a bigger tag that also wraps the content you want, then I am sorry, your data is dynamically loaded. The good news is that almost 90% of modern websites have some sorts of dynamically loaded contents, so the skill is in demand.

There are many ways to resolve this, like using selenium or even as crazy as writing a new class to mimic a client. However, they are often too complicated because they are not made for this purpose. Luckily we have the “requests_html” library. Let’s talk about one of its functions: request().

The most important purpose of the request() function is to render dynamically loaded contents and return the target data. But how? Ignore ‘productList’ for now, I will explain it later in the parse() function. To render the page, we need to use the get() method to collect all the data first. Notice the URL that was passed in as a parameter, this makes our code reusable by passing on a different URL. Underneath the get() method we have our main character, html.render() method, which acts as a client to render all the dynamically loaded contents. After that, I used xpath (right click, select copy, select xpath) to locate the section I needed. I think this is the easiest way though it’s certainly not the only way. Again check out the library website for other means.

Before moving on, I need to quickly explain how the website is structured and what is my plan for everything to make sense. The data I need is the specs of each individual racquets that resides in a different page from the racquet index page. Therefore, first I need to grab all the URL links for every racquet from the racquet index page, which is what we did in the request() function. Then, I need to send the index page into the parse function, where I parse each individual racquet page one at a time to collect their individual specs in their own page, which brought me to the next function: parse() function.

parse()

In the parse() function, I pass on the index page, which consists of racquet links as well as the information of no use, returned by the request() function. Now I use the “absolute_links” method that comes with the library to filter out the links and run it through a for-loop. Once we get the links of individual racquet page, we use the get() method to collect all the data like we did in request() method. From this point, you can use various techniques to find the data you are looking for, which are all documented in the library’s main page. Since the data I am looking for does not exist in every product, I used a “try and exception” handling method so it doesn’t crash. Next, you create a product dictionary to store all the data you just scraped and append it to the “productList” array that we created in the beginning. As for the bar, it just shows me the progress of the scraping so I know the program didn’t crash. It’s not essential but neat to have. There is one last thing to pay attention to before we are done with this function: what if the racquet page is rendered with JavaScript? You didn’t see the render() method used here because the data I needed are not dynamically loaded. However, if I need ratings, for example, which is dynamically loaded, then I will have to render each racquet page individually. That can take forever, luckily there is “async” functions, but that’s a story of another time. This will conclude the parse() function. Let’s go to the last function output().

output()

The output() function is very straight forward. Make sure you don’t forget to import pandas library and use DataFrame() method to turn “productList” into a data frame format. After that, let’s call to_csv() to write the data in the csv file and save it in the desired location.

The last step is to put all the functions together. Although this step is not directly related to data mining, iterating through pages is an essential task because many websites format their products in pages. Due to the format of the link, I set x equal to 1 and run a while loop that doesn’t pass 4, which is the total number of pages. Inside the loop, I passed the page links to request() function to gather links of all the racquets (aka index page) and pass it on to the parse() function to collect the specs of each individual racquet and append them on “productList”. Outside of the loop I call the output function to turn “productList” array into a database via pandas library and save it as csv file after the loop is done. Now run it in the terminal and open the file path where you saved the data in, you should see the data you just mined nice and neat.

5. Conclusion

Understanding how to gather dynamically loaded content is critical in today’s data mining landscape because most websites contains dynamically loaded contents more or less. And if you ask me, I believe it will only get more prevalent in the future because it makes the code more neat and logical. And if this sounds like a job for web app developer, you are right because understanding web app architecture is not a typical job description for data scientist. However, I do think data scientist is a role that requires a little of knowledge on everything. And if you want to be a good data scientist, having a firm grasp on the beginning and ending of a data science project cycle is critical.

--

--

--

Balance is the key Personal Website: https://sites.google.com/view/luoyuan/home

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Betting on Javascript

Everything to know about CSS: Flex

longest substring with at most k distinct characters

Pattern Matching — one of the coolest feature of Elixir

How to create a card component from Figma to React or Vue.js with Overlay

React useEffect cleanup explained

Every known way to get references to windows, in javascript

Django CKEditor install with youtube and code spinet plugins

install ckeditor in django

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
袁晗 | Luo, Yuan Han

袁晗 | Luo, Yuan Han

Balance is the key Personal Website: https://sites.google.com/view/luoyuan/home

More from Medium

Dimension reduction on Athlete's performance data

DIY: Complete cycle of data research

Enter the Dataverse

Five real-world applications of data analytics in healthcare