Browse by Domains

Understanding Distributions in Statistics

Contributed by: Venkat M
LinkedIn Profile: https://www.linkedin.com/in/venkat-murali-3753bab/

What is Distribution? 

The distribution of a statistical dataset is the spread of the data which shows all possible values or intervals of the data and how they occur.

A distribution is simply a collection of data or scores on a variable. Usually, these scores are arranged in order from ascending to descending and then they can be presented graphically.

The distribution provides a parameterized mathematical function which will calculate the probability of any individual observation from the sample space.

Before moving on to distributions, understanding about the term “data” which is very important and critical for the data analyst/data scientist

To understand more about distribution in statistics, watch this complete video where Abhinand Sarkar will share some of his thoughts on distribution.

What is Data?

Data is a collection of information (numbers, words, measurements, observations) about facts, figures and statistics collected together for analysis.

Example: Distribution of Categorical Data (True/False, Yes/No): It shows the number (or) percentage of individuals in each group.

How to Visualize Categorical Data:  Bar Plot, Pie Chart and Pareto Chart.

Distribution of Numerical Data (Height, Weight and Salary): Firstly, it is sorted from ascending to descending order and grouped based on similarity. It is represented in graphs and charts to examine the amount of variance in the data.

How to Visualize Numerical Data: Histogram, Line Plot and Scatter Plot.

Measurement level of Data

S.NoQualitativeQuantitative
1
Nominal – Brand-name, Zip-code and GenderOrdinal – Grades, Star ReviewsOrdinal – Position in Race and DateInterval – Temperature in Celsius, Year of BirthRatio – Height, Age, Weight

What does Data do? In What ways it matters most?

  1. Identifies the relationship between two variables
  2. Prediction of future and forecasting based on the previous trend of data
  3. Pattern determination that exists in the dataset
  4. Detects Fraud and anomalies

Why are distributions important?

Sampling distributions are important for statistics because we need to collect the sample and estimate the parameters of the population distribution. Hence distribution is necessary to make inferences about the overall population.

For example, The most common measures of how sample differs from each other is the standard deviation and standard error of the mean.

Difference between Frequency and Probability Distribution

S.NoFrequency DistributionProbability Distribution
1
It records how often an event occurs. It is based on actual observationsIt records the likelihood that an event is to occur. It is based on theoretical assumption of what should happen

Frequency Distribution:

The number of times each numerical value occurs.

Probability Distribution

List of Probabilities associated with each of its possible numerical values.

Types of Distributions

  • Bernoulli Distribution
  • Uniform Distribution
  • Binomial Distribution
  • Normal Distribution
  • Poisson Distribution
  • Exponential Distribution

Python Libraries for Distributions

Bernoulli Distribution

A special case of binomial distribution. It is the discrete probability distribution and has exactly only two possible outcomes – 1(Success) and 0(Failure) and a single trial.

Example: In Cricket: Toss a Coin leads to win or lose the toss. There is no intermediate result. The occurrence of a head denotes success, and the occurrence of a tail denotes failure.

The probability of success (1) is 0.4 and failure(0) is 0.6

Bernoulli Distribution in Python

Normal Distribution

It is otherwise known as Gaussian Distribution and Symmetric Distribution. It is a type of continuous probability distribution which is symmetric to the mean. The majority of the observations cluster around the central peak point.

It is a bell-shaped curve.

Examples: Performance appraisal, Height, BP, measurement error and IQ scores follow a normal distribution.

Mean = Median = Mode

The standard normal distribution is a normal distribution with µ = 0 and б = 1.

Basic Properties:

  • The normal distribution always run between –α and +α
  • Zero skewness and distribution is symmetrical about the mean.
  • Zero kurtosis
  • 68% of the values are within 1 SD of the mean
  • 95% of the values are within 2 SD of the mean
  • 99.7% of the values are within 3 SD of the mean

Normal Distribution in Python

Binomial Distribution

The most widely known discrete probability distribution. It has been used hundreds of years.

Assumptions:

  1. The experiment involves n identical trials.
  2. Each trial has only two possible outcomes – success or failure.
  3. Each trial is independent of the previous trials.
  4. The terms p and q remain constant throughout the experiment, where p is the probability of getting a success on any one trial and q = (1 – p) is the probability of getting a failure on any one trial.

Binomial Distribution in Python

Poisson Distribution

It is the discrete probability distribution of the number of times an event is likely to occur within a specified period of time. It is used for independent events which occur at a constant rate within a given interval of time.

The occurrences in each interval can range from zero to infinity (0 to α).

Examples:

  1. How many black colours are there in a random sample of 50 cars
  2. No of cars arriving at a car wash during a 20 minute time interval

Uniform Distribution

It is a continuous or rectangular distribution. It describes an experiment where an outcome lies between certain boundaries.

Examples:

  1. Time to fly from Newark to Atlanta ranges from 120 to 150 minutes if we monitor the fly time for many commercial flights it will follow more or less the uniform distribution.
  2. The time taken for the students to finish a one hour test may range from 50 mins to 60 mins. An equal number of students complete over 5 minutes interval within this range – 50, 54, 56, 58 and 60. The finishing time of the test can be approximated by a uniform distribution.
  3. Time for Pizza delivery from Nanganallur to Alandur may range from 20 to 30 mins uniformly from the time delivery man leaves the Pizza Hut.

Uniform Distribution in Python

Gamma Distribution

It deals with continuous variables which take on a wide range of values such as individual call times. Based on which we can model probabilities across any range of possible values using a gamma distribution function. First one is shape parameter (α) and the second one is scale parameter (β).

Examples:

  • The amount of rainfall accumulated in a reservoir.
  • The size of loan defaulters and aggregation of insurance claims
  • The flow of items through manufacturing and distribution processes
  • The load on web servers

Gamma Distribution in Python

Exponential Distribution

It is concerned with the amount of time until some specific event occurs.

Example: 

  • The amount of time until an earthquake occurs has an exponential distribution
  • The amount of time in business telephone calls
  • The car battery lasts. 
  • The amount of money customers spend on one trip to the supermarket follows an exponential distribution. There are more people who spend small amounts of money and fewer people who spend large amounts of money.

The exponential distribution is widely used in the field of reliability. 

Note: Reliability deals with the amount of time a product lasts.

Exponential Distribution in Python

References

  1. https://statisticsbyjim.com/basics/probability-distributions/
Avatar photo
Great Learning Team
Great Learning's Blog covers the latest developments and innovations in technology that can be leveraged to build rewarding careers. You'll find career guides, tech tutorials and industry news to keep yourself updated with the fast-changing world of tech and business.

Leave a Comment

Your email address will not be published. Required fields are marked *

Great Learning Free Online Courses
Scroll to Top