Jubayer Hossain

Undergraduate Research - Importance, Benefits, and Challenges

Mon, 08 Aug 2022 00:00:00 +0000

Heart Disease Analysis and Prediction Using Machine Learning

Sun, 07 Aug 2022 00:00:00 +0000

Introduction

Heart disease describes a range of conditions that affect your heart. Diseases under the heart disease umbrella include blood vessel diseases, such as coronary artery disease; heart rhythm problems (arrhythmia); and heart defects you’re born with (congenital heart defects), among others. The term “heart disease” is often used interchangeably with the term “cardiovascular disease.”Cardiovascular disease refers to conditions characterized by narrowed or blocked blood vessels, which can result in a heart attack, chest pain (angina), or stroke. Other heart conditions, such as those affecting your heart’s muscle, valves, or rhythm, are also classified as heart disease. any types of heart disease can be avoided or treated by adopting a healthy lifestyle.

Source:https://www.cdc.gov/heartdisease/about.htm

Symptoms

Chest pain, chest tightness, chest pressure and chest discomfort (angina)
Shortness of breath
Pain, numbness, weakness or coldness in your legs or arms if the blood vessels in those parts of your body are narrowed
Pain in the neck, jaw, throat, upper abdomen or back

Source: https://www.cdc.gov/heartdisease/risk_factors.htm

Objective(s)

With the dataset provided for heart analysis, we have to analyse the possibilities of heart attack on the basis of various features, and then the prediction from the analysis will tell us that whether an individual is prone to heart attack or not.
The detailed analysis can proceed with the exploratory data analysis (EDA).
The classification for predication can be done using various machine learning model algorithms, choose the best suited model for heart attack analysis and finally save the model in the pickle (.pkl) file.

Research Question(s)

Does the age of a person contribute towards heart attack?
Are different types of chest pain related to each other or the possibility of getting a heart attack?
Does high blood pressure increase the risk of heart attack?
Does the cholesterol level eventually contribute as a risk factor towards heart attack?

Dataset Information

Age : Age of the patient
Sex : Sex of the patient
exang: exercise induced angina (1 = yes; 0 = no)
ca: number of major vessels (0-3)
cp : Chest Pain type
- Value 1: typical angina
- Value 2: atypical angina
- Value 3: non-anginal pain
- Value 4: asymptomatic
trtbps : resting blood pressure (in mm Hg)
chol : cholestoral in mg/dl fetched via BMI sensor
fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
rest_ecg : resting electrocardiographic results
- Value 0: normal
- Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
- Value 2: showing probable or definite left ventricular hypertrophy by Estes’ criteria
thalach : maximum heart rate achieved
target :
- 0 = less chance of heart attack
- 1 = more chance of heart attack

Data Source: https://www.kaggle.com/rashikrahmanpritom/heart-attack-analysis-prediction-dataset

Data Analysis Workflow

Data Collection
Importing Data
Data Cleaning
- Handling Missing Data
- Outlier Detection and Removal
Exploring Data using Descriptive Statistics
- Understanding Data using
  - Univariate Analysis
  - Bivariate Analysis
  - Multivariate Analysis
- Understanding Data using Visualizations
  - Univariate
    - Histograms
    - Density Plot
  - Bivariate
    - Scatter Plot
    - Boxplot
  - Multivariate
    - Correlation Matrix
    - Covariance Matrix
Decision Making using Inferential Statistics
- Hypothesis Testing(T-Test, Z-Test, Chi-square, ANOVA)
- Creating Predicting Models

Biocomputing: A New and Quickly Developing Field

Thu, 04 Aug 2022 00:00:00 +0000

Do you enjoy maths and biology? Do you enjoy writing code? Then focus on biocomputing, the career of the future that will give you a variety of job options. In this article, I will discuss what biocomputing is and why we should learn these skills in the modern day. In particular, students with a life science background should learn the biocomputing workflow.

What is Biocomputing?

Biocomputing is an innovative branch of technology that functions at the interface of biology, engineering, and computer science. It tries to employ cells or their subcomponent molecules (such as DNA or RNA) to do activities normally performed by a computer.

Features of Biocomputing

Biocomputing (Bioinformatics) is placed at the intersection of Medicine, Biology, Applied Mathematics, and Computer Science. Those who have chosen this career are responsible for addressing global issues such as:

Search for methods of treatment of cancer, chronic and autoimmune diseases;
Extending the life of the population, improving the ecological situation and searching for the longevity genome;
Development, planning, implementation of mathematical methods, algorithms and programs used for the analysis of medical and biological information;
Application of the obtained research and practice results.

Why Biocomputing?

Modern diagnostic and research techniques have led to a growth in the quantity of scientific data, which is extremely difficult to manually process. In this instance, Biocomputing (Bioinformatics) is of assistance. In the second part of the 20th century, it emerged as an interdisciplinary science. Biocomputing incorporates components of applied mathematics, statistics, computer science, mathematical and computer modeling, and programming.

The field is new and expanding rapidly, and it will continue to do so in the future since the use of computer methods assures great accuracy, speed, and eliminates the human element. There is a demand for biocomputing technology in biochemistry, molecular biology, microbiology, pharmacy,biophysics, ecology, pharmacology, agriculture, and genetics, among other fields.

Advantages of Biocomputing

All options are open to Biocomputing professionals, from local research institutions to reputable worldwide IT firms.
Biocomputing specialists do not engage directly with patients or biological material, as their work involves mathematical methodologies and computer systems.
Knowledge of programming languages and the fundamentals of applied mathematics enables Biocomputing professionals to quickly transition to other fields, such as traditional programming, genomic data science, software development, and testing.
Continuous self-development and improvement of professional skills.
The ability to analyze data sets, knowing that the results of work in the long term will save the lives of thousands of people.

Some Major Courses in Biocomputing

Introduction to Life Sciences & Bioinformatics
Applied Mathematics
Web programming and Databases
Bioinformatics Lab I
Bioinformatics Lab II
Introduction to Programming Languages (Python, R, and Julia)
Fundamentals of Biostatistics
Data Science: Introduction
Machine Learning (Artificial Intelligence)

Leading Positions in Biocomputing

Data Analyst
Bioinformatician
Bioinformatics Scientist
Data Scientist

We are working on designing a biocomputing program at CHIRAL Bangladesh. Next, I will share how you can start learning biocomputing.

Interpreting Data Using Descriptive Statistics with R

Thu, 28 Jul 2022 00:00:00 +0000

Introduction

Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns,to spot anomalies,to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.

Exploratory data analysis (EDA) methods are often called Descriptive Statistics due to the fact that they simply describe, or provide estimates based on, the data at hand.

Exploratory Data Analysis

“Exploratory data analysis can never be the whole story, but nothing else can serve as the foundation stone.” —John Tukey

EDA consists of:

Organizing and summarizing the raw data,
Discovering important features and patterns in the data and any striking deviations from those patterns
Interpreting our findings in the context of the problem

And can be useful for:

Describing the distribution of a single variable (center, spread, shape, outliers)
Checking data (for errors or other problems)
Checking assumptions to more complex statistical analyses
Investigating relationships between variables

Features of Exploratory Data Analysis

In this notebook covers two broad topics:
- Examining Distributions — exploring data one variable at a time.
- Examining Relationships — exploring data two variables at a time.
In Exploratory Data Analysis, our exploration of data will always consist of the following two elements:
- Visual displays
- Numerical measures.

Working with Data using R

In this lesson, we will explore pulse dataset using R. In addition, we will perform exploratory data analysis.

Load Packages

# Load packages 
library(tidyverse)
library(ggplot2)
library(ggpubr)
library(gridExtra)
library(gtsummary)
library(gt)
library(datasets)

Load and Explore Data

# Read Data 
data <- read.csv("data/pulse_data.csv", stringsAsFactors = TRUE)
gt(head(data))

Height	Weight	Age	Gender	Smokes	Alcohol	Exercise	Ran	Pulse1	Pulse2	BMI	BMICat
1.73	57	18	Female	No	Yes	Moderate	No	86	88	19.04507	Underweight
1.79	58	19	Female	No	Yes	Moderate	Yes	82	150	18.10181	Underweight
1.67	62	18	Female	No	Yes	High	Yes	96	176	22.23099	Normal
1.95	84	18	Male	No	Yes	High	No	71	73	22.09073	Normal
1.73	64	18	Female	No	Yes	Low	No	90	88	21.38394	Normal
1.84	74	22	Male	No	Yes	Low	Yes	78	141	21.85728	Normal

# Check data structure 
glimpse(data)

Rows: 108
Columns: 12
$ Height   <dbl> 1.73, 1.79, 1.67, 1.95, 1.73, 1.84, 1.62, 1.69, 1.64, 1.68, 1…
$ Weight   <dbl> 57, 58, 62, 84, 64, 74, 57, 55, 56, 60, 75, 58, 68, 59, 72, 1…
$ Age      <int> 18, 19, 18, 18, 18, 22, 20, 18, 19, 23, 20, 19, 22, 18, 18, 2…
$ Gender   <fct> Female, Female, Female, Male, Female, Male, Female, Female, F…
$ Smokes   <fct> No, No, No, No, No, No, No, No, No, No, No, No, Yes, No, No, …
$ Alcohol  <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, Ye…
$ Exercise <fct> Moderate, Moderate, High, High, Low, Low, Moderate, Moderate,…
$ Ran      <fct> No, Yes, Yes, No, No, Yes, No, No, No, Yes, Yes, No, No, No, …
$ Pulse1   <dbl> 86, 82, 96, 71, 90, 78, 68, 71, 68, 88, 76, 74, 70, 78, 69, 7…
$ Pulse2   <dbl> 88, 150, 176, 73, 88, 141, 72, 77, 68, 150, 88, 76, 71, 82, 6…
$ BMI      <dbl> 19.04507, 18.10181, 22.23099, 22.09073, 21.38394, 21.85728, 2…
$ BMICat   <fct> Underweight, Underweight, Normal, Normal, Normal, Normal, Nor…

One Categorical Variable

Distribution of One Categorical Variable
Numerical Summaries
- One-way Frequency Table(Counts)
- One-way Frequency Table(Percentages)
- One-way Frequency Table(Combination of Counts and Percentages)
Visual or Graphical Displays
- Bar Chart - Great for categorical data visualization
- Pie Chart - Use with caution for summarizing categorical data

Distribution of One Categorical Variable

Here is some information that would be interesting to get from these data:

What percentage of the sampled respondents fall into each category?
How are respondents divided across the three body image categories? Are they equally divided? If not, do the percentages follow some other kind of pattern?

Numerical Measures

In order to summarize the distribution of a categorical variable, we first create a table of the different values (categories) the variable takes, how many times each value occurs (count) and, more importantly, how often each value occurs (by converting the counts to percentages).

The result is often called a Frequency Distribution or Frequency Table.
A Frequency Distribution or Frequency Table is the primary set of numerical measures for one categorical variable.
Consists of a table with each category along with the count and percentage for each category.
Provides a summary of the distribution for one categorical variable.

One-way Frequency Table(Counts)

# One-way frequency table 
data %>% 
  group_by(BMICat) %>% 
  summarise(frequency = n())

# A tibble: 4 × 2
  BMICat      frequency
  <fct>           <int>
1 Normal             62
2 Obese               2
3 Overweight         17
4 Underweight        27

One-way Frequency Table(Counts, Percentage)

data %>% 
  group_by(BMICat) %>% 
  summarise(counts = n()) %>% 
  mutate(percent = counts/sum(counts) *100)

# A tibble: 4 × 3
  BMICat      counts percent
  <fct>        <int>   <dbl>
1 Normal          62   57.4 
2 Obese            2    1.85
3 Overweight      17   15.7 
4 Underweight     27   25

Visual or Graphical Displays

There are two simple graphical displays for visualizing the distribution of one categorical variable:

Bar Charts
Pie Charts

Bar Charts

To describe the number of observations in each category of the discrete variable
To visualize estimated error for discrete variables

# Visualize one categorical variable; `fct_infreq()` for sorting the bar 
data %>% 
  ggplot(aes(x = BMICat))+
  geom_bar(fill = "#97B3C6")

# # Sorting Bar Chart by using `fct_infreq()`
data %>% 
  ggplot(aes(x = fct_infreq(BMICat)))+
  geom_bar(fill = "#97B3C6")

# Summaries(counts) data for visualizing the distribution 
df1 <- data %>% 
  group_by(BMICat) %>% 
  summarise(counts = n()) %>% 
  arrange(counts)

df1

# A tibble: 4 × 2
  BMICat      counts
  <fct>        <int>
1 Obese            2
2 Overweight      17
3 Underweight     27
4 Normal          62

# Show the observations number on the top the bar 
ggplot(df1, aes(x = BMICat, y = counts)) +
  geom_bar(fill = "#97B3C6", stat = "identity") +
  geom_text(aes(label = counts), vjust = -0.3)

# # Sorting bar by using `reorder()`
ggplot(df1, aes(x = reorder(BMICat, counts), y = counts)) +
  geom_bar(fill = "#97B3C6", stat = "identity") +
  geom_text(aes(label = counts), vjust = -0.3)

# # Sorting bar by using `reorder()` and `desc()`
ggplot(df1, aes(x = reorder(BMICat, desc(counts)), y = counts)) +
  geom_bar(fill = "#97B3C6", stat = "identity") +
  geom_text(aes(label = counts), vjust = -0.3)

# Calculate percentage of each category 
df2 <- data %>% 
  group_by(BMICat) %>% 
  summarise(counts = n()) %>% 
  arrange(desc(BMICat)) %>% 
  mutate(prop = round(counts*100/sum(counts), 1))
df2

# A tibble: 4 × 3
  BMICat      counts  prop
  <fct>        <int> <dbl>
1 Underweight     27  25  
2 Overweight      17  15.7
3 Obese            2   1.9
4 Normal          62  57.4

# Sorting the bars using `reorder()`
ggplot(df2, aes(x = reorder(BMICat, counts), y = prop)) +
  geom_bar(fill = "#97B3C6", stat = "identity") +
  geom_text(aes(label = prop), vjust = -0.3)

# Show the percentage(%) on the of the bar 
ggplot(df2, aes(x = BMICat, y = prop)) +
  geom_bar(fill = "#97B3C6", stat = "identity") +
  geom_text(aes(label = prop), vjust = -0.3)

# Sorting the bars using `reorder()` and  `desc()`
ggplot(df2, aes(x = reorder(BMICat, desc(prop)), y = prop)) +
  geom_bar(fill = "#97B3C6", stat = "identity") +
  geom_text(aes(label = prop), vjust = -0.3)

# Customize the plot
ggplot(df2, aes(x = reorder(BMICat, desc(counts)), y = prop)) +
  geom_bar(fill = "#97B3C6", stat = "identity") +
  geom_text(aes(label = prop), vjust = -0.3)+
  labs(title = "Distribution of BMICat", 
       x = "BMI Category", 
       y = "Proportion", 
       caption = "Data Source: https://bolt.mph.ufl.edu/")

# Create bar chart using ggpubr 
ggbarplot(df2, x = "BMICat", y = "counts", fill = "#97B3C6")

# show counts 
ggbarplot(df2, x = "BMICat", y = "counts", fill = "#97B3C6", label = TRUE, lab.pos = "out")

# show counts 
ggbarplot(df2, x = "BMICat", y = "prop", fill = "#97B3C6", label = TRUE, lab.pos = "out")

Pie charts

ggpie(df2, "prop", label = "BMICat", fill = "BMICat", 
      color = "white", 
      palette = c("#00AFBB", "#E7B800", "#FC4E07", "#97B3C6"))

# Show group names and value as labels
labs <- paste0(df2$BMICat, " (", df2$prop, "%)")

ggpie(df2, "prop", label =labs, fill = "BMICat", 
      color = "white", 
      palette = c("#00AFBB", "#E7B800", "#FC4E07", "#97B3C6"), lab.pos = "in")

# Change the position and font color of labels
labs <- paste0(df2$BMICat, "(", df2$prop, "%)")

ggpie(df2, "prop", label =labs, 
      lab.pos = "in", lab.font = "white",
      fill = "BMICat", 
      color = "white", 
      palette = c("#00AFBB", "#E7B800", "#FC4E07", "#97B3C6"))

One Quantitative Variable

Distribution of One Quantitative Variable
Numerical Measures
Graphs

Distribution of One Quantitative Variable

In this section, we will explore the data collected from a quantitative variable, and learn how to describe and summarize the important features of its distribution.

We will learn how to display the distribution using graphs and discuss a variety of numerical measures.

Numerical Measures

Measures of Center

Introduction
Mean
Median
Comparing the Mean and the Median

Mean

# Average BMI
data %>% 
  summarise(avg_bmi = mean(BMI))

   avg_bmi
1 22.03186

Median

# Median BMI
data %>% 
  summarise(median_bmi = median(BMI))

  median_bmi
1   21.57798

Graphs

Histograms

Shape: Overall appearance of histogram. Can be symmetric, bell-shaped, left skewed, right skewed, etc.
Center: Mean or Median
Spread: How far our data spreads. Range, Interquartile Range (IQR),standard deviation, variance.
Outliers: Data points that fall far from the bulk of the data

Interpretation: The distribution of height is bell shaped with a center of about 10.001, a range of 11 inches (5 to 16), and no apparent outliers.

# Calculate average height 
data %>% 
  summarise(avg_height = mean(Height))

  avg_height
1   1.732685

# Show the center in histogram 
gghistogram(data, x = "Height", add = "mean")

# Calculate median height 
data %>% 
  summarise(median_height = median(Height))

  median_height
1          1.73

# Show the center in histogram 
gghistogram(data, x = "Height", add = "median")

# Add mean  
gghistogram(data, x = "Height", bins = 15, fill = "#97B3C6", title = "Histogram of Height", xlab = "Height(m)", ylab = "Frequency", add = "mean")

Interpretation: The distribution of height is roughly bell shaped with a center of about 1.7m, a range of 0.55 meters (1.40 to 1.95), and no apparent outliers.

# Change the bins size 
gghistogram(data, x = "Height", bins = 15, fill = "#58508d" , add = "mean")

# Compare mean and median 
data %>% 
  summarise(avg_bmi = mean(BMI), 
            median_bmi = median(BMI))

   avg_bmi median_bmi
1 22.03186   21.57798

Describing Distributions

Features of Distributions of Quantitative Variables
Shape (Symmetry/Skewness, Modality)
Center
Spread
Outliers

# Load and explore diabetes data 
diabetes <- read.csv("data/diabetes.csv", stringsAsFactors = TRUE)
gt(head(diabetes))

Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
6	148	72	35	0	33.6	0.627	50	1
1	85	66	29	0	26.6	0.351	31	0
8	183	64	0	0	23.3	0.672	32	1
1	89	66	23	94	28.1	0.167	21	0
0	137	40	35	168	43.1	2.288	33	1
5	116	74	0	0	25.6	0.201	30	0

Shape

When describing the shape of a distribution, we should consider:

Symmetry/skewness of the distribution.
Peakedness (modality) — the number of peaks (modes) the distribution has.

Symmetric Distributions

Note that all three distributions are symmetric, but are different in their modality (peakedness).

The first distribution is unimodal — it has one mode (roughly at 10) around which the observations are concentrated.
The second distribution is bimodal — it has two modes (roughly at 10 and 20) around which the observations are concentrated.
The third distribution is kind of flat, or uniform. The distribution has no modes, or no value around which the observations are concentrated. Rather, we see that the observations are roughly uniformly distributed among the different values.

# Check distribution of age 
gghistogram(diabetes, x = "BMI", fill = "#665191")

Skewed Right Distributions

A distribution is called skewed right if, as in the histogram above, the right tail (larger values) is much longer than the left tail (small values).

# Check distribution of age 
gghistogram(diabetes, x = "Age", fill = "#665191")

Skewed Left Distributions

A distribution is called skewed left if, as in the histogram above, the left tail (smaller values) is much longer than the right tail (larger values).

# Check distribution of Glucose 
gghistogram(diabetes, x = "Glucose", fill = "#665191")

Comments: - Distributions with more than two peaks are generally called multimodal. - Bimodal or multimodal distributions can be evidence that two distinct groups are represented. - Unimodal, Bimodal, and multimodal distributions may or may not be symmetric.

Center

The center of the distribution is often used to represent a typical value.

# Check distribution of BMI 
gghistogram(diabetes, x = "BMI", fill = "#665191", add = "mean", add_density = TRUE)

# Check distribution of BMI 
gghistogram(diabetes, x = "BMI", fill = "#665191", add = "median")

Spread

One way to measure the spread (also called variability or variation) of the distribution is to use the approximate range covered by the data.

# Check distribution of BloodPressure 
gghistogram(diabetes, x = "BloodPressure", fill = "#665191", add = "median")

Outliers

Outliers are observations that fall outside the overall pattern.

# Check distribution of BloodPressure 
gghistogram(diabetes, x = "BloodPressure", fill = "#665191", add = "median")

Measures of Spread

Range
Inter-Quartile Range (IQR)
Standard Deviation
Properties of the Standard Deviation
Choosing Numerical Measures

Range

The range covered by the data is the most intuitive measure of variability. The range is exactly the distance between the smallest data point (min) and the largest one (Max).

Range = Max – min

data %>% 
  summarise(max_height = max(Height), 
            min_height = min(Height), 
            range = max_height - min_height)

  max_height min_height range
1       1.95        1.4  0.55

Inter-Quartile Range (IQR)

While the range quantifies the variability by looking at the range covered by ALL the data, the Inter-Quartile Range or IQR measures the variability of a distribution by giving us the range covered by the MIDDLE 50% of the data.

IQR = Q3 – Q1
Q3 = 3rd Quartile = 75th Percentile
Q1 = 1st Quartile = 25th Percentile

data %>% 
  summarise(
            min = fivenum(Weight)[1],
            Q1 = fivenum(Weight)[2],
            median = fivenum(Weight)[3],
            Q3 = fivenum(Weight)[4],
            max = fivenum(Weight)[5], 
            IQR = Q3 - Q1)

  min   Q1 median Q3 max  IQR
1  41 56.5     63 75 110 18.5

Standard Deviation

data %>% 
  summarise(avg_height = mean(Height), 
            std = sd(Height))

  avg_height       std
1   1.732685 0.1012133

Variance

data %>% 
  summarise(avg_height = mean(Height), 
            var_height = var(Height))

  avg_height var_height
1   1.732685 0.01024412

Measures of Position

Percentiles
Five-Number Summary
Standardized Scores (Z-Scores)
Measures of Position

Percentiles

In general the P-th percentile can be interpreted as a location in the data for which approximately P% of the other values in the distribution fall below the P-th percentile and (100 –P)% fall above the P-th percentile.

Five Number Summary

data %>% 
  summarise(
            min = fivenum(Weight)[1],
            Q1 = fivenum(Weight)[2],
            median = fivenum(Weight)[3],
            Q3 = fivenum(Weight)[4],
            max = fivenum(Weight)[5])

  min   Q1 median Q3 max
1  41 56.5     63 75 110

Standardized Scores (Z-Scores)

Z = (x – mean)/standard deviation

data %>% 
  mutate(zscore = (BMI - mean(BMI) / sd(BMI))) %>% 
  head()

  Height Weight Age Gender Smokes Alcohol Exercise Ran Pulse1 Pulse2      BMI
1   1.73     57  18 Female     No     Yes Moderate  No     86     88 19.04507
2   1.79     58  19 Female     No     Yes Moderate Yes     82    150 18.10181
3   1.67     62  18 Female     No     Yes     High Yes     96    176 22.23099
4   1.95     84  18   Male     No     Yes     High  No     71     73 22.09073
5   1.73     64  18 Female     No     Yes      Low  No     90     88 21.38394
6   1.84     74  22   Male     No     Yes      Low Yes     78    141 21.85728
       BMICat   zscore
1 Underweight 12.37439
2 Underweight 11.43112
3      Normal 15.56030
4      Normal 15.42004
5      Normal 14.71326
6      Normal 15.18659

Measures of Position

Measures of position also allow us to compare values from different distributions. For example, we can present the percentiles or z-scores of an individual’s height and weight. These two measures together would provide a better picture of how the individual fits in the overall population than either would alone.

Although measures of position are not stressed in this course as much as measures of center and spread, we have seen and will see many measures of position used in various aspects of examining the distribution of one variable and it is good to recognize them as measures of position when they appear.

Outliers Detection

Using the IQR to Detect Outliers
The 1.5(IQR) Criterion for Outliers
The 3(IQR) Criterion for Outliers
Understanding Outliers

Using the IQR to Detect Outliers

So far we have quantified the idea of center, and we are in the middle of the discussion about measuring spread, but we haven’t really talked about a method or rule that will help us classify extreme observations as outliers. The IQR is commonly used as the basis for a rule of thumb for identifying outliers.

The 1.5(IQR) Criterion for Outliers

An observation is considered a suspected outlier or potential outlier if it is:

below Q1 – 1.5(IQR) or
above Q3 + 1.5(IQR)

The following picture (not to scale) illustrates this rule:

The 3(IQR) Criterion for Outliers

An observation is considered an EXTREME outlier if it is:

below Q1 – 3(IQR) or
above Q3 + 3(IQR)

ds <- read.csv("data/500_Person_Gender_Height_Weight_Index.csv")
head(ds)

  Gender Height Weight Index
1   Male    174     96     4
2   Male    189     87     2
3 Female    185    110     4
4 Female    195    104     3
5   Male    149     61     3
6   Male    189    104     3

gghistogram(ds, x = "Height")

gghistogram(ds, x = "Weight")

Boxplots

The Five Number Summary
The Boxplot
Side-By-Side (Comparative) Boxplots

The Five Number Summary

So far, in our discussion about measures of spread, some key players were:

the extremes (min and Max), which provide the range covered by all the data; and
the quartiles (Q1, M and Q3), which together provide the IQR, the range covered by the middle 50% of the data.

Recall that the combination of all five numbers (min, Q1, M, Q3, Max) is called the five number summary, and provides a quick numerical description of both the center and spread of a distribution.

ds %>% 
  summarise(
            min = fivenum(Height)[1],
            Q1 = fivenum(Height)[2],
            median = fivenum(Height)[3],
            Q3 = fivenum(Height)[4],
            max = fivenum(Height)[5])

  min  Q1 median  Q3 max
1 140 156  170.5 184 199

The Boxplot

The central box spans from Q1 to Q3. In our example, the box spans from 32 to 41.5. Note that the width of the box has no meaning.
A line in the box marks the median M, which in our case is 35.

Lines extend from the edges of the box to the smallest and largest observations that were not classified as suspected outliers (using the 1.5xIQR criterion). In our example, we have no low outliers, so the bottom line goes down to the smallest observation, which is 21. Since we have three high outliers (61,74, and 80), the top line extends only up to 49, which is the largest observation that has not been flagged as an outlier.

4. outliers are marked with asterisks (*). To summarize: the following information is visually depicted in the boxplot:

the five number summary (blue) the range and IQR (red) outliers (green)

Side-By-Side (Comparative) Boxplots

ggboxplot(ds, y = "Height")

ggboxplot(ds, y = "Weight")

ggboxplot(ds, x = "Gender", y = "Height")

The “Normal” Shape

The Standard Deviation Rule
Visual Methods of Assessing Normality
Standardized Scores (Z-Scores)

The Standard Deviation Rule

The Standard Deviation Rule:

Approximately 68% of the observations fall within 1 standard deviation of the mean.
Approximately 95% of the observations fall within 2 standard deviations of the mean.
Approximately 99.7% (or virtually all) of the observations fall within 3 standard deviations of the mean.

Visual Methods of Assessing Normality

Standardized Scores (Z-Scores)

Z = (x - mean) / standard deviation

ds %>% 
  mutate(zscore = (Height - mean(Height)) / sd(Height)) %>% 
  head()

  Gender Height Weight Index     zscore
1   Male    174     96     4  0.2476907
2   Male    189     87     2  1.1637067
3 Female    185    110     4  0.9194357
4 Female    195    104     3  1.5301130
5   Male    149     61     3 -1.2790025
6   Male    189    104     3  1.1637067

Role Type Classification

While it is fundamentally important to know how to describe the distribution of a single variable, most studies pose research questions that involve exploring the relationship between two (or more) variables. These research questions are investigated using a sample from the population of interest.

Example Research Question(s)

Is there a relationship between gender and test scores on a particular standardized test? Other ways of phrasing the same research question:

Is performance on the test related to gender?
Is there a gender effect on test scores?
Are there differences in test scores between males and females?

Are the smoking habits of a person (yes, no) related to the person’s gender(male, female)?

Role of a Variable in a Study

In most studies involving two variables, each of the variables has a role. We distinguish between:

Response variable — the outcome of the study; and
Eexplanatory variable — the variable that claims to explain, predict or affect the response.

As we mentioned earlier the variable we wish to predict is commonly called the dependent variable, the outcome variable, or the response variable. Any variable we are using to predict (or explain differences) in the outcome is commonly called an explanatory variable, an independent variable, a predictor variable, or a covariate.

Typically the explanatory variable is denoted by X, and the response variable by Y.

Example

Research Question: Is there a relationship between gender and test scores on a particular standardized test? Other ways of phrasing the same research question:

Is performance on the test related to gender?
Is there a gender effect on test scores?
Are there differences in test scores between males and females?
Gender is the explanatory variable
Test score is the response variable

Role-Type Classification

If we further classify each of the two relevant variables according to type (categorical or quantitative), we get the following 4 possibilities for “role-type classification”

Categorical explanatory and quantitative response (Case CQ)
Categorical explanatory and categorical response (Case CC)
Quantitative explanatory and quantitative response (Case QQ)
Quantitative explanatory and categorical response (Case QC)

Figure 1: Figure Caption

Example

Research Question: Is there a relationship between gender and test scores on a particular standardized test? Other ways of phrasing the same research question:

Is performance on the test related to gender?
Is there a gender effect on test scores?
Are there differences in test scores between males and females?
Gender is the explanatory variable
Test score is the response variable
Therefore this is an example of case C → Q.

Case C-Q Categorical Explanatory and Quantitative Response

data %>% 
  select(Gender, BMI) %>% 
  group_by(Gender) %>% 
  summarise(Avg_BMI = mean(BMI))

# A tibble: 2 × 2
  Gender Avg_BMI
  <fct>    <dbl>
1 Female    20.8
2 Male      23.1

data %>% 
  group_by(Gender) %>% 
  summarise(n = n(),
            min = fivenum(BMI)[1],
            Q1 = fivenum(BMI)[2],
            median = fivenum(BMI)[3],
            Q3 = fivenum(BMI)[4],
            max = fivenum(BMI)[5])

# A tibble: 2 × 7
  Gender     n   min    Q1 median    Q3   max
  <fct>  <int> <dbl> <dbl>  <dbl> <dbl> <dbl>
1 Female    50  16.6  19.0   20.6  22.2  29.0
2 Male      58  16.8  20.2   22.9  25.1  32.1

ggboxplot(data, x = "Gender", y = "BMI")

Case C-C - Two Categorical Variables

# https://www.statology.org/dplyr-crosstab/
df3 <- data %>% 
  group_by(Gender, Ran) %>% 
  tally() %>% 
  spread(Ran, n)
df3

# A tibble: 2 × 3
# Groups:   Gender [2]
  Gender    No   Yes
  <fct>  <int> <int>
1 Female    28    22
2 Male      35    23

# https://www.statology.org/dplyr-crosstab/
df3 <- data %>% 
  group_by(Gender, BMICat) %>% 
  tally() %>% 
  spread(BMICat, n)

df3

# A tibble: 2 × 5
# Groups:   Gender [2]
  Gender Normal Obese Overweight Underweight
  <fct>   <int> <int>      <int>       <int>
1 Female     29    NA          3          18
2 Male       33     2         14           9

Case Q-Q - Two Quantitative Variables

data %>% 
  select(Height, Weight) %>% 
  cor()

          Height    Weight
Height 1.0000000 0.7413042
Weight 0.7413042 1.0000000

Scatterplots

Creating Scatterplots
Interpreting Scatterplots
Direction
Form
Strength

Interpreting Scatterplots

Figure 2: Figure Caption

Direction

Figure 3: Figure Caption

A positive (or increasing) relationship means that an increase in one of the variables is associated with an increase in the other.

A negative (or decreasing) relationship means that an increase in one of the variables is associated with a decrease in the other.

Not all relationships can be classified as either positive or negative.

Form

Figure 4: Figure Caption

The form of the relationship is its general shape. When identifying the form, we try to find the simplest way to describe the shape of the scatterplot. There are many possible forms. Here are a couple that are quite common: Relationships with a linear form are most simply described as points scattered about a line:

Figure 5: Figure Caption

A scatterplot in which the points are slightly above or below a line which has been drawn through the points. Overall, the points create a shape that appears to be a fat line. In this example, the points create a negative relationship.Relationships with a non-linear (sometimes called curvilinear) form are most simply described as points dispersed around the same curved line:

Figure 6: Figure Caption

There are many other possible forms for the relationship between two quantitative variables, but linear and curvilinear forms are quite common and easy to identify. Another form-related pattern that we should be aware of is clusters in the data:

Strength

Figure 7: Figure Caption

The strength of the relationship is determined by how closely the data follow the form of the relationship. Let’s look, for example, at the following two scatterplots displaying positive, linear relationships:

The strength of the relationship is determined by how closely the data points follow the form. We can see that in the left scatterplot the data points follow the linear pattern quite closely. This is an example of a strong relationship. In the right scatterplot, the points also follow the linear pattern, but much less closely, and therefore we can say that the relationship is weaker. In general, though, assessing the strength of a relationship just by looking at the scatterplot is quite problematic, and we need a numerical measure to help us with that. We will discuss that later in this section.

Figure 8: Figure Caption

Data points that deviate from the pattern of the relationship are called outliers. We will see several examples of outliers during this section. Two outliers are illustrated in the scatterplot below:

Figure 9: Figure Caption

ggscatter(data, x = "Height", y = "Weight", shape = 21, size = 3,  add = "reg.line", fill = "lightgray",  color = "Gender")

State the Art of Microbial Genome Analysis

Thu, 28 Jul 2022 00:00:00 +0000

Perception of Students on Antibiotic Resistance and Prevention: An Online, Community-Based Case Study from Dhaka, Bangladesh

Wed, 27 Jul 2022 00:00:00 +0000

Interventions to Improve the Mental Health among Intimate Partner Violence Survivors in Low and Middle-income countries: A Systematic Review Protocol

Wed, 25 May 2022 00:00:00 +0000

Knowledge, Attitudes, and Practice regarding Thalassemia Among High School Students in Bangladesh

Wed, 25 May 2022 00:00:00 +0000

Knowledge, Attitudes, and Practices among University Students regarding the Concept of Safe Marriages for Thalassemia Prevention in Bangladesh

Wed, 25 May 2022 00:00:00 +0000

Quality of Life among Bangladeshi Patients with Thalassemia using the SF-36 Questionnaire

Thu, 19 Aug 2021 12:21:13 +0600

Knowledge and Attitudes of Thalassemia among Public University Students in Bangladesh

Thu, 19 Aug 2021 12:17:26 +0600

Perception and the Impact of Distance Learning on Students from the Science Faculty at Jagannath University, Dhaka during COVID-19: An Exploratory Study

Fri, 23 Apr 2021 23:24:21 +0600

Survey

Perception of Students on Antibiotic Resistance and Prevention: An Online,Community-Based Case Study from Dhaka,Bangladesh

Mon, 19 Apr 2021 00:31:46 +0600

Introduction

Antibiotics either are cytotoxic or cytostatic to the micro-organisms, allowing the body’s natural defences, such as the immune system, to eliminate them. They often act by inhibiting the synthesis of a bacterial cell, synthesis of proteins, deoxyribonucleic acid (DNA), ribonucleic acid (RNA), by a membrane disorganizing agent, or other specific actions. Antibiotics may also enter the cell wall of the bacteria by binding to them, using the energy-dependent transport mechanisms in ribosomal sites, which subsequently lead to the inhibition of the protein synthesis.

Objectives

Improve awareness and understanding of antimicrobial resistance through effective communication, education and training.
Strengthen the knowledge and evidence base through surveillance and research.
Reduce the incidence of infection through effective sanitation, hygiene and infection prevention measures.
Optimize the use of antimicrobial medicines in human and animal health.
Develop the economic case for sustainable investment that takes account of the needs of all countries and to increase investment in new medicines, diagnostic tools, vaccines and other interventions.

Survey

Self Management And Knowledge of Diabetic Patients in Bangladesh and the Prevalence rate of Diabetes

Fri, 16 Apr 2021 06:41:30 +0600

Introduction

Diabetes mellitus is associated with significant morbidity and mortality in Bangladesh, where healthcare facilities and accessibility are inadequate. Diabetes is also linked to heart disease, stroke, and kidney failure. The secret to achieving clinical goals in ambulatory treatment is assumed to be sufficient patient awareness of self-care. In Bangladesh, there are a few studies on the relationship between awareness and self-care practices among type 2 diabetes patients, but none are recent. Despite this, little research has been done on type 1 diabetes patients’ self-care habits and awareness.The foundation of DM management is diabetes education combined with adequate motivation of patients and caregivers.

Objectives

Since many people with diabetes are uncertain if they have Type 1 or Type2, the questionnaire included four questions to help determine the most likely type
The aim of this study was to see if there was a connection between diabetes awareness and self-care practices among Type 1 and Type 2 diabetes patients
Estimate the proportion of respondents with diabetes type by these questions

Survey

A survey on the general concept of diabetes among students of different universities in Bangladesh.

Fri, 16 Apr 2021 06:40:02 +0600

Perception of Students on Antibiotic Resistance and Prevention: An Online, Community-Based Case Study from Dhaka, Bangladesh

Fri, 16 Apr 2021 06:38:12 +0600

Experiences, Side Effects, and Opinions Following COVID-19 Vaccination in Bangladesh: a cross-sectional community e-survey in Bangladesh

Thu, 28 Jan 2021 00:00:00 +0000

About Me

Mon, 01 Jan 0001 00:00:00 +0000

Consulting

Mon, 01 Jan 0001 00:00:00 +0000

Training

Mon, 01 Jan 0001 00:00:00 +0000