Smoking Data Analysis
Data Visualization, Interactive Dashboard
ROLES
Research
Data Cleaning
Analysis
Visualization
TIMELINE
November 2024
(14 hours)
TOOLS
Tableau, RStudio
My partner and I were challenged to identify a research question relevant to us and gather data to explore and answer it.
During our topic exploration, we came across an article suggesting that the correlation between smoking and health metrics like cholesterol levels and blood pressure is often misunderstood. This discovery led us to pose the question:
"What factors cause a person to smoke, and what are its effects?"
To investigate, we examined causation factors such as age, gender, ethnicity, and education, as well as biometric indicators like heart rate and cholesterol levels, as highlighted in the article that inspired us.
Data Wrangling
dim star
RStudio
RStudio was used for the data wrangling of the datasets. Data wrangling for the data included getting rid of entries that contained any missing or extraneous values.
To investigate our question, "What factors cause a person to smoke, and what are its effects?", we worked with two datasets:
Smoking Dataset from UKNational STEM Centre, 1691 Entries

This data will be used to focus on the causes of smoking and determine if there is a cause for smoking based on certain demographics. These demographics may include age, gender, race, and education.
Hypertension Risk DatasetMD Raihan Khan, 3900 Entries

This data will be used to focus on the effects of smoking, and determining the overall health conditions that smoking may affect. These health conditions include cholesterol levels and heart rate.

R was additionally used on the Hypertension Risk Dataset, to add new columns to define benchmarks for cholesterol levels (healthy, at risk, high) and heart rate (healthy, normal, high).

Cholesterol LevelsHealthy (< 200 mg/dL)
At Risk (≥ 200 mg/dL & < 240 mg/dL)
High (≥ 240 mg/dL)
Heart RateHealthy (< 60 bpm)
Normal ( ≥ 60 bpm & ≤ 100 bpm) 
High(> 100 bpm)
Interactive Dashboard
dim star
Tableau
Causes for Smoking
(Age, Gender, Ethnicity, and Education)
dim star
Age + Gender (Stacked Bar Chart)
Features
Hover (View Gender, Age, # of Cigs Smoked/Week)
Click (Highlight all data in chart corresponding to a gender)
Observations
For age, we found that there may be a potential relationship, as the results displayed a bell-shaped curve. Individuals in the middle age range (around 30–40 years old) appeared to smoke more on average than those in younger or older age groups. However, additional research would be needed to confirm this correlation, as the trend could simply reflect the average age distribution of the dataset.
For gender, we observed no significant correlation indicating that gender caused an individual to smoke more or less.
dim star
Ethnicity (Box Plots)
Features:
Hover to View:
Box Plot Statistics: Maximum, Minimum, Median, Mean, Q1, Q3
Point Statistics: # of Cigs Smoked/Week, Ethnicity
Observations:
We found there may be a potential relationship, as individuals identified as "White" tended to smoke more on average than those identified as Asian, Black, or Chinese. However, additional research would be needed to confirm this correlation, as the trend could be influenced by the larger number of data points representing individuals who were "White" compared to other ethnicities.
dim star
Education Status (Pie Charts)
Features:
Hover (View Highest Education Level Achieved, Smoking Status (yes, no), % of Pie Chart)
Click (Adjust dashboard to reflect data pertaining to selected section of pie chart)
Observations:
We observed no significant correlation indicating that highest education level achieved caused an individual to smoke. We can see this in the distribution of education levels being similar between smoking statuses (yes, no), and when observing the other visualizations when clicked, the trends observed were similar across the board.
Effects of Smoking
(Cholesterol Levels, Heart Rate)
dim star
Cholesterol Levels (Pie Charts, Box Plots)
Features:
Pie Charts:
Hover (View Cholesterol Risk Level, Smoking Status (yes, no), % of Pie Chart)
Box Plots:
Hover to View:
Box Plot Statistics: Maximum, Minimum, Median, Mean, Q1, Q3
Point Statistics: Cholesterol Risk Level, # of Cigs Smoked/Week per Day
Observations:
We observed no significant correlation suggesting that smoking affects an individual's cholesterol levels.
The pie charts showed nearly identical distributions of risk levels between smokers and non-smokers. This pattern was also reflected in the box plots, where the central values for the number of cigarettes smoked per week per day were similar across different cholesterol levels.
dim star
Heart Rate (Pie Charts, Grouped Histogram, Box Plots)
Features:
Pie Charts:
Hover (View Heart Rate Risk Level, Smoking Status (yes, no), % of Pie Chart)
Click (Adjust dashboard to reflect data pertaining to selected section of pie chart)
Grouped Histogram:
Hover (View Smoking Status (yes, no), Heart Rate, Count)
Box Plots:
Hover to View:
Box Plot Statistics: Maximum, Minimum, Median, Mean, Q1, Q3
Point Statistics: Heart Rate Risk Level, # of Cigs Smoked/Week per Day
Observations:
We observed no significant correlation suggesting that smoking has an effect on an individual's heart rate.
The pie charts showed nearly identical distributions of risk levels between smokers and non-smokers. Similarly, the grouped histogram revealed that the frequency of individuals across heart rate bins was fairly consistent regardless of smoking status. This trend was further supported by the box plots, where the central values (medians) for the number of cigarettes smoked per week per day appeared comparable.
Telling a Story...
The hardest part of this project was figuring out how to make the data tell a story that made sense to someone seeing it for the first time.

We ultimately solved this by structuring the dashboard in two parts: one to visualize the potential causes of smoking (age, gender, ethnicity, education), and the other to explore its potential effects through key health and demographic indicators (cholesterol levels, heart rate).

To further improve clarity, we applied a consistent and distinct color scheme to each factor. This was especially important in the "effects" section, where multiple visualizations were used for each factor. By using the same color across related visuals, viewers could easily identify which charts corresponded to which factor, resulting in a more cohesive and intuitive experience.

This experience showed me how powerful clear design and storytelling can be when it comes to making data both understandable and impactful.