Sampling and Estimation

Quantitative Methods

Sampling and Estimation

•               A sample is some part of a larger body specially selected to represent the whole. Sampling is the process by which this part is chosen. Sampling then is taking any portion of a population or universe as representative of that population or universe.For a sample to be useful, it should reflect the similarities and differences found in the total group.The main objective of drawing a sample is to make inferences about the larger population from the smaller sample.A poll is a type of sample survey dealing mainly with issues of public opinions or elections, or peoples attitudes about candidates for political office, or public issues.Polls are conducted by large polling organizations such as the Roper poll, the Harris poll, the American Institute of Public Opinion, and the National Opinion Research Center.A census is a survey in which information is gathered from or about all members of a population

Simple Random Sampling

Simple random sampling refers to any sampling method that has the following properties.

•The population consists of N objects.

•The sample consists of n objects.

•If all possible samples of n objects are equally likely to occur, the sampling method is called simple random sampling.

An important benefit of simple random sampling is that it allows researchers to use statistical methods to analyze sample results. For example, given a simple random sample, researchers can use statistical methods to define a confidence interval around a sample mean. Statistical analysis is not appropriate when non-random sampling methods are used.

There are many ways to obtain a simple random sample. One way would be the lottery method. Each of the N population members is assigned a unique number. The numbers are placed in a bowl and thoroughly mixed. Then, a blind-folded researcher selects n numbers. Population members having the selected numbers are included in the sample.

Sampling Error

Sampling process error occurs because researchers draw different subjects from the same population but still, the subjects have individual differences.  Population is only a subset of the entire population; therefore, there may be a difference between the sample and population.

The most frequent cause of the said error is a biased sampling procedure. Every researcher must seek to establish a sample that is free from bias and is representative of the entire population.

Another possible cause of this error is chance. The process of randomization and probability sampling is done to minimize sampling process error but it is still possible that all the randomized subjects are not representative of the population.

The most common result of sampling error is systematic error wherein the results from the sample differ significantly from the results from the entire population. It follows logic that if the sample is not representative of the entire population, the results from it will most likely differ from the results taken from the entire population.

Sample Size and Sampling Error

Given two exactly the same studies, same sampling methods, same population, the study with a larger sample size will have less sampling process error compared to the study with smaller sample size. Sample size increases, it approaches the size of the entire population, therefore, it also approaches all the characteristics of the population, thus, decreasing sampling process error. Standard Deviation and Sampling Error

Standard deviation is used to express the variability of the population. More technically, it is the average difference of all the actual scores of the subjects from the mean or average of all the scores. Therefore, if the sample has high standard deviation, it follows that sample also has high sampling process error.Sample size increases, the standard deviation decreases. Imagine having only 10 subjects, with this very little sample size, the tendency of their results is to vary greatly, thus a high standard deviation. Then, imagine increasing the sample size to 100, the tendency of their scores is to cluster, thus a low standard deviation.

Sampling Distribution

A sampling distribution is analogous to a population distribution: it describes the range of all possible values that the sampling statistic can take. In the assessment of the quality of a sample, the approach usually involves comparing the sampling distribution to the population distribution. We expect the sampling distribution to be a pattern similar to the population distribution - that is, if a population is normally distributed, the sample should also be normally distributed. If the sample is skewed when we were expecting a normal pattern with most of the observations centered around the mean, it indicates potential problems with the sample and/or the methodology.

Stratified sampling

Stratified sampling is a probability sampling technique wherein the researcher divides the entire population into different subgroups or strata, then randomly selects the final subjects proportionally from the different strata.

It is important to note that the strata must be non-overlapping. Having overlapping subgroups will grant some individuals higher chances of being selected as subject. This completely negates the concept of stratified sampling as a type of probability sampling.

Equally important is the fact that the researcher must use simple probability sampling within the different strata.The most common strata used in stratified random sampling are age, gender, socioeconomic status, religion, nationality and educational attainment.

Uses of Stratified Random Sampling

•It is used when the researcher wants to highlight a specific subgroup within the population. This technique is useful in such researches because it ensures the presence of the key subgroup within the sample.

•Researchers also use Stratified random sampling when they want to know about existing relationships between two or more subgroups. With a simple random sampling technique, the researcher is not sure whether the subgroups that he wants to observe are represented equally or proportionately within the sample.

•With stratified sampling, the researcher can representatively sample even the smallest and most inaccessible subgroups in the population. This allows the researcher to sample the rare extremes of the given population.

Time series Data

A time series is a sequence of data points, measured typically at successive points in time spaced at uniform time intervals. Examples of time series are the daily closing value of the Dow Jones Industrial Average and the annual flow volume of the Nile River at Aswan. Time series are very frequently plotted via line charts. Time series are used in statistics, signal processing, pattern recognition, econometrics, mathematical finance, weather forecasting, earthquake prediction, electroencephalography, control engineering, astronomy, and communications engineering.

Time series analysis comprises methods for analyzing time series data in order to extract meaningful statistics and other characteristics of the data. Time series forecasting is the use of a model to predict future values based on previously observed values. While regression analysis is often employed in such a way as to test theories that the current values of one or more independent time series affect the current value of another time series, this type of analysis of time series is not called "time series analysis", which focuses on comparing values of a single time series at different points in time.

Time series data have a natural temporal ordering. This makes time series analysis distinct from other common data analysis problems, in which there is no natural ordering of the observations (e.g. explaining people's wages by reference to their respective education levels, where the individuals' data could be entered in any order). Time series analysis is also distinct from spatial data analysis where the observations typically relate to geographical locations (e.g. accounting for house prices by the location as well as the intrinsic characteristics of the houses). A stochastic model for a time series will generally reflect the fact that observations close together in time will be more closely related than observations further apart. In addition, time series models will often make use of the natural one-way ordering of time so that values for a given period will be expressed as deriving in some way from past values, rather than from future values

Cross-sectional Data

Cross-sectional data is information that is gathered at one point in time to reflect social conditions. Unlike longitudinal data that is information gathered over several periods of time, cross-sectional data is simply a snapshot of information at one particular point in time. This type of data is limited in that it cannot describe changes over time or cause-and-effect relationships in which one variable affects another over time.

The Central Limit Theorem

The Central Limit Theorem describes the characteristics of the "population of the means" which has been created from the means of an infinite number of random population samples of size (N), all of them drawn from a given "parent population". The Central Limit Theorem predicts that regardless of the distribution of the parent population:

 The mean of the population of means is always equal to the mean of the parent population from which the population samples were drawn.

 The standard deviation of the population of means is always equal to the standard deviation of the parent population divided by the square root of the sample size (N).

 [And the most amazing part!!] The distribution of means will increasingly approximate a normal distribution as the size N of samples increases.

A consequence of Central Limit Theorem is that if we average measurements of a particular quantity, the distribution of our average tends toward a normal one. In addition, if a measured variable is actually a combination of several other uncorrelated variables, all of them "contaminated" with a random error of any distribution, our measurements tend to be contaminated with a random error that is normally distributed as the number of these variables increases.

Standard Error

The standard error is the standard deviation of the sampling distribution of a statistic. The term may also be used to refer to an estimate of that standard deviation, derived from a particular sample used to compute the estimate.

For example, the sample mean is the usual estimator of a population mean. However, different samples drawn from that same population would in general have different values of the sample mean, so there is a distribution of sampled means (with its own mean and variance). The standard error of the mean (i.e., of using the sample mean as a method of estimating the population mean) is the standard deviation of those sample means over all possible samples (of a given size) drawn from the population. Secondly, the standard error of the mean can refer to an estimate of that standard deviation, computed from the sample of data being analyzed at the time.

In regression analysis, the term "standard error" is also used in the phrase standard error of the regression to mean the ordinary least squares estimate of the standard deviation of the underlying errors

Properties of an Estimator

The three desirable properties of an estimator are that they are unbiased, efficient and consistent:

Unbiased - The expected value (mean) of the estimate's sampling distribution is equal to the underlying population parameter; that is, there is no upward or downward bias.

Efficiency - While there are many unbiased estimators of the same parameter, the most efficient has a sampling distribution with the smallest variance.

Consistency - Larger sample sizes tend to produce more accurate estimates; that is, the sample parameter converges on the population parameter.

Degrees of Freedom

In statistics, the number of degrees of freedom is the number of values in the final calculation of a statistic that are free to vary.The number of independent ways by which a dynamic system can move without violating any constraint imposed on it, is called degree of freedom. In other words, the degree of freedom can be defined as the minimum number of independent coordinates that can specify the position of the system completely.Estimates of statistical parameters can be based upon different amounts of information or data. The number of independent pieces of information that go into the estimate of a parameter is called the degrees of freedom. In general, the degrees of freedom of an estimate of a parameter is equal to the number of independent scores that go into the estimate minus the number of parameters used as intermediate steps in the estimation of the parameter itself (i.e., the sample variance has N-1 degrees of freedom, since it is computed from N random scores minus the only 1 parameter estimated as intermediate step, which is the sample mean).

Student t Distribution

Student's t-distribution (or simply the t-distribution) is a family of continuous probability distributions that arise when estimating the mean of a normally distributed population in situations where the sample size is small and population standard deviation is unknown. Whereas a normal distribution describes a full population, t-distributions describe samples drawn from a full population; accordingly, the t-distribution for each sample size is different, and the larger the sample, the more the distribution resembles a normal distribution.

The t-distribution plays a role in a number of widely used statistical analyses, including the Student's t-test for assessing the statistical significance of the difference between two sample means, the construction of confidence intervals for the difference between two population means, and in linear regression analysis. The Student's t-distribution also arises in the Bayesian analysis of data from a normal family.