UNIT-1:Population, Sample and Data Condensation

Population

In statistical terms, population refers to the complete set of individuals or objects that have a certain characteristic or attribute of interest. When conducting research or making statistical inferences, it is often not possible or practical to study every member of the population. Instead, a sample of the population is selected and studied in order to make inferences about the larger group.

The population can be described using various measures such as mean, median, mode, variance, and standard deviation. These measures provide information about the central tendency, dispersion, and distribution of the population data.

It’s important to note that when working with a sample rather than the complete population, the estimates obtained from the sample may not perfectly reflect the characteristics of the population. This is due to sampling error, which can be reduced by increasing the sample size or using a more representative sample.

In summary, in statistics, population refers to the complete group of individuals or objects of interest and the study of the population helps to describe and understand the characteristics of the group

Sample and Data Condensation

Sample and data condensation are two important concepts in statistics.

A sample is a subset of the population that is selected for study. Sampling is an important aspect of statistical analysis because it allows researchers to make inferences about a population based on a smaller, more manageable subset of data. There are various sampling methods, including random sampling, stratified sampling, and cluster sampling, each with its own strengths and weaknesses.

Data condensation refers to the process of summarizing and simplifying a large amount of data into a more manageable form. This can be done using various methods such as frequency tables, histograms, and box plots, which can provide a visual representation of the data and help to identify patterns and trends.

Another method of data condensation is descriptive statistics, which involves calculating measures such as mean, median, mode, variance, and standard deviation to summarize the data. Descriptive statistics can provide important insights into the distribution of the data and help to identify outliers and other features of interest.

In summary, sample and data condensation are important concepts in statistics that help researchers to work with smaller subsets of data and summarize large amounts of information in a more manageable form

Definition and scope of statistics

Statistics is a branch of mathematics that deals with the collection, analysis, interpretation, presentation, and organization of data. It is a tool used to make sense of data and draw meaningful conclusions and inferences about a population based on a sample of data.

The scope of statistics is broad and includes a wide range of applications in fields such as biology, social sciences, engineering, economics, and many others. Some of the key areas in which statistics is used include:

  1. Data collection and analysis: Statistics is used to design studies and experiments, collect data, and analyze it to draw conclusions and make inferences about a population.
  2. Descriptive statistics: This involves summarizing and describing data using measures such as mean, median, mode, and standard deviation, as well as visualizing data using techniques such as histograms and box plots.
  3. Inferential statistics: This involves using a sample of data to make inferences about a population. Inferential statistics involves the use of statistical models and hypothesis testing to determine the likelihood of a relationship between variables or to make predictions about future events.
  4. Probability: Statistics also involves the study of probability, which is used to model and understand random events and make predictions about the likelihood of certain outcomes.
  5. Survey design and analysis: Statistics is used to design and analyze surveys, which are used to gather information about a population, such as opinions or attitudes.

In summary, the scope of statistics is wide and involves the collection, analysis, interpretation, and presentation of data for a variety of purposes and applications.

 concept of population and sample with Illustration

The concepts of population and sample are important in statistical analysis and research.

A population refers to the complete set of individuals or objects being studied. It is the entire group of interest and includes all relevant data. For example, the population of a country includes all individuals who live in that country.

A sample, on the other hand, is a subset of the population. It is a smaller group of individuals or objects selected from the population for study. Sampling is done because it is often not feasible or practical to study the entire population. For example, if a researcher wants to study the income level of a population, it may not be possible to gather data from every individual. In this case, a sample of the population can be selected and studied instead.

Here’s a simple illustration to help understand the concept:

Imagine a bag containing 100 marbles, with 50 red and 50 blue marbles. The bag represents the population. If we take a sample of 10 marbles from the bag, this represents the sample. The sample may contain 6 red marbles and 4 blue marbles, which is just a smaller representation of the population and not a perfect representation of the population

Raw data, attributes and variables

Raw data refers to unprocessed data that has been collected from various sources, such as surveys, experiments, or databases. Raw data is usually in its original form and hasn’t been manipulated or analyzed.

Attributes are characteristics or features of the data. For example, in a study of a population of individuals, the attributes might include age, gender, education level, and income.

Variables are attributes that can take on different values. For example, in a study of a population of individuals, the variable “age” can take on different values for each person in the population, such as 20, 25, 30, etc. In statistical analysis, variables are used to answer questions and make predictions.

In summary, raw data is the starting point for any analysis, while attributes and variables are used to describe and analyze the data

What is Frequency Distribution?

Frequency distribution is defined as the first method that is used to organize data in an effective way. Frequency distribution performs the systematic investigation of the raw data. The data is first arranged by frequency distribution and then set as frequency table.

Frequency distribution is defined as the systematic representation of different values of variables along with the corresponding frequencies; it is classified on the basis of class interval.

Class interval is defined as the size of each class into which a range of variables is divided and represented as histogram or bar graph.

Types of Class Intervals

Class intervals are divided into two different categories, exclusive and inclusive class intervals. Here is the example to both:

  1. Exclusive Class Interval

 

The class interval where the upper limit of previous data entry is the same as the lower limit of next data entry is called an exclusive data interval. For consideration,

S. NoMarksNo. of students
10-208
220-407
340-603
  1. Inclusive Class Interval

 

The class interval where the upper limit of previous data entry is the same as the lower limit of next data entry is called an exclusive data interval. For consideration,

S. NoMarksNumber of students
11-207
221-409
341-608

Also Read | Introduction to Bayesian Statistics

What is Discrete and Continuous Frequency Table Distribution?

Frequency distribution is further classified into two types based upon class interval. Named as discrete frequency table and continuous frequency table. Here are the examples:

  1. Discrete Frequency Table

If the class interval of data is not given, it is termed as a discrete frequency distribution. For example,

S. no.Number of itemsNumber of packets
1123
2212
3334
4420
5572
 Total163
  1. Continuous Frequency Table

When the class intervals are available within the data, it is called a continuous frequency distribution. For consideration,

S. NoMarksNumber of students
10-105
220-307
330-4012
440-5032
550-604
 Total60

Also Read | Data Democratization

Types of Frequency Distribution Methods

 

There are two types of frequency distribution methods:

  1. Grouped frequency distribution.
  1. Ungrouped frequency distribution.
  1. Grouped Frequency Distribution

 

As the name suggests, grouped frequency distribution is well defined and distributed into groups. When the variables are continuous the data is gathered as grouped frequency distribution. Different measures are taken during data collection, such as age, salary, etc. The entire data is classified into class intervals. For consideration,

Family IncomeNumber of persons
Below-20,00052
20,001-30,00014
30,001-40,0006
40,001-50,0008
  1. Ungrouped Frequency Distribution

 

As the name suggests, ungrouped frequency distribution doesn’t consist of well-distributed class intervals. Ungrouped frequency distribution is applied on discrete data rather than continuous one. Examples of such data usually include data related to gender, marital status, medical data etc. For consideration,

VariableNumber of persons
GENDER 
Female19
Male22
MARITAL STATUS 
Single32
Married4
Divorced4

Other Types of Frequency Distribution

  1. Cumulative Frequency Distribution

Cumulative frequency distribution is also known as percentage frequency distribution. Percentage distribution reflects the percentage of samples whose scores fall in the specific group and number of scores. 

This type of distribution is quite useful for comparison of data with the findings of other studies having different sample sizes. In this type of distribution, percentages and frequencies are summed up in a single table. For consideration,

ScoreFrequencyPercentageCumulative frequencyCumulative percentage
14848
214283264
46121020
58161836
78164080
86124692
94850100