A few months ago I stumbled with an unusual task in my everyday work. It consisted of analyzing, looking for a pattern in a considerable amount of data, as Terabytes. This information came from records of so-called CDRs, and I didn’t know how to face it. On the one hand, I have a good base in statistics and mathematics. I have also programmed in PHP most of my years as a programmer. And some time ago I started to look at Big Data, data science, but above all things, towards Machine Learning.
My main programming language has always been PHP. I squeed to its last drops, sometimes even asking for things it was not made for. Examples of this are trying to making it run as a daemon, handling signals from S.O., so we could stop it gently, avoiding using the kill -9 command, or worse things.
Then, I started to see that the current books related to the topic, showed their hands-on. Their example code was using Python and R but in a kind of interactive IDE. This way Web allows you to share what you were doing with others in real time. It blew my mind. It was very simple, as I say IDE, though I could also use the word Sandbox. I could use both words for quick views, proof of concept, show someone in a few minutes, relevant and timely useful information.
This on the side of the Jupyter Notebook. But, in fact, what makes the magic is the ecosystem python modules provide. Led by pandas and numpy, for machine learning and data science.
A quick example
Since the data of the CDRs is confidential, I can not show you a real customer’s example. It occurred to me to illustrate with a personal project that I started facing some weeks ago. It is based on some paper that I found while looking for information on predictive models. This tries to predict the result of professional tennis matches based on the player’s characteristics. It also considers match features (previous results) to show who will win the game with a given probability.
The author is very clear about how he deals with the different approaches (Multilayer Neural Networks, Logistic Regression, Support Vector Machines). But he does not present any code example of their implementation. So I liked the idea of seeing whether I could reproduce their results. I’m still working on it, but I have enough to be able to show you the beginning steps.
The first thing is to install Anaconda and Jupyter. I recommend using this guide (GitHub repo with the hands-on of O’Reilly’s book Hands-On Machine Learning with Scikit-Learn and TensorFlow). Then, with one command, start the Jupyter Notebook server to see something like that (I raised the server within the hands-on repo, but they can do it over any directory).
We created a new notebook for this example. It looks like this.
In my case, I used a database of tennis matches downloaded from this site. It consists of CSV files, with one row per game and enough features to cover almost the entire scope of the paper. After that, I concatenate the 17 files with information from the year 2000 to 2017 of all the official tournaments of the ATP Men’s tour. It counts a total of 55k matches. Then I imported them into a Mysql table. That table will be my origin of data. (Pandas provides a complete API for reading the data into memory from the original CSV files, so importing it into a Mysql database is not mandatory).
The first step now loads into memory the content of the data in a Pandas data frame.
Note that the operation lasts over 6 seconds. In this time I already have in memory the 55k matches in a Pandas Dataframe.
We call head() to see the first records of the set.
When calling describe().
It gives us information such as the amount of non-zero data, the mean, the standard deviation, the 25th, 50th, 75th percentiles, and the min and max values. Powerful, right? Thi is for all the features at the same time. And as it is all in memory, this exit takes only about 200ms. With that, we have a good initial idea of our data that can give us the first insights.
Now we can invoke the hist() method of Pandas data frames. This is the one that surprised me the most, that gives us a histogram of each of the features in seconds.
And if we want to see the histogram of a single feature, we can invoke hist() in this way.
Then we see that there is an almost standard distribution but with a peak at 0 value. This is because there are matches that don’t have information about this feature and will have to be cleaned later. Of course, there are ways to do this that take only one line, but I do not get to show it here.
Finally, we can analyze the correlation of the features. How they relate to each other in a few lines, making use of the correlation matrix in this way.
As an example, I took the feature minutes (total duration of the game) and looked for the existing correlation against the other features. The highest correlation is against the features w_svpt (total points served by the winner) with 0.891721 and l_svpt (total points served by the loser) with 0.888824. Which is quite obvious because the longer the game, the greater the number of points played and therefore served by both players. Quite obvious the correlation found, but it serves to see that is going well.
Lastly, is innovative the way in which this correlation matrix can be plotted between a set of features. For example, taking these minutes, w_svpt, w_1stIn, we can graphically see how is the correlation between them. Note that in the diagonal, the histogram is plotted since it does not make sense to plot the correlation between a variable and itself.
This is the tip of the iceberg. The first 20 lines allow this ecosystem, composed of Jupyter Notebooks and Anaconda, as managers of most of the packages related to Big Data and Machine Learning.
Further reading on Data Structures on our blog: Probabilistic Data Structures to Improve System Performance