top of page
Search

From Viz to ML

  • Writer: Faline Rezvani
    Faline Rezvani
  • May 14, 2024
  • 2 min read

Updated: May 28



Colson, G. (2011). Leading British Phobias [Painting].


Data visualization can be used to clearly and quickly shed light on the distribution of habits, preferences, or even fears within a population. Even simple representations of open data can lay the foundation for a machine learning (ML) project.

Young People Survey is a publicly available dataset collected in 2013 from Slovakian participants aged 15-30 years (Sabo, M. n.d.).  There are 1,010 samples and 157 features, several of which are devoted to phobias.


Understanding the Data

 
With this information, we can learn more about ophidiophobia, or the fear of snakes.  In addition to rating various hobbies and interests, the survey participants were asked to rate their fear of snakes based on a 5-point scale, with 1 representing the lowest level of fear.  Two quick visualizations show the instances of each rating, as well as the min., max., median, and interquartile range (IQR) between lower and upper quartiles of the rating instances.


To further understand the meaning associated with each rating, we can reference the Fear Cognition Scale (FCS) developed by Murad Salman Mirza (Mirza, M. 2018):


Mirza, M. (2018). Fear Cognition Scale [Digital image].


In Tableau, we can enhance our initial bar chart:


Now viewers can quickly and clearly see there are almost as many people with no fear of snakes as there are people with a critical fear.
 
 
 

Insight into Action

 
Learning about this large population uninhibited by the fear of snakes, an online snake enthusiast publication may want to optimize their advertising efforts.  They can explore the statistics and relationships of this dataset and build a predictive model to help them target readers who are “not fearful” of snakes.
 
Using Colab, a secure Google Cloud-based programming environment, we can load the dataset and begin exploring.  Rather than use all 157 features of the survey for this exercise, we only load 12 columns unrelated to demographics along with our 1 target column.




Before being used to make predictions, the features must be inspected for relationships.  For example, as the rating for ‘Music’ increases, will the rating for ‘Dancing’ also increase?


Luckily, no. Features sharing a direct relationship can result in misleading ML models.  With this heatmap we can see the most closely related features are ‘Cars’ and ‘Science and Technology’, but on a small enough scale that it won’t pose a problem.
 
Only after thoroughly dissecting the dataset with exhaustive exploratory analysis can the ML model life cycle at last begin.



Increased awareness of the process underlying predictive modeling can help any individual build a framework of questions prior to collaborating on a ML project.

 

 

 

 

 

“The secret of getting ahead is getting started.”

- Mark Twain

 

  

 

 

 

 




Comments


bottom of page