Statistical aspects

The design and implementation of quantitative surveys raise specific statistical problems.

First, for all sample surveys (i.e., when only part of the population is surveyed), a sampling plan must be defined; that is, a method for drawing the sample that will be as statistically representative as possible while complying with logistic and budget constraints.

Second, once the survey has been conducted, the question of estimation arises, i.e., how to produce values that will apply to the population at large on the basis of the selected sample (extrapolation). Extrapolation often involves weighting, a statistical procedure that takes into account not only the sampling plan but also non-response and external data sources that can be used to improve estimates. The general term used to describe this procedure is data adjustment. Lastly, sample quality can be assessed through indicators such as bias and precision.

An entire branch of survey statistics is devoted to sampling and estimation questions.

The sampling frame

Before a quantitative survey can be conducted, the sample selection method and the sample size must be determined.

A sample or the total target population?

Statistically speaking, the ideal way of proceeding is to survey the total target population. However, this is not usually possible, for reasons of cost, logistics, risk of an adverse impact on data quality, or simple feasibility.

If it is not possible to survey the total target population, then that population must be sampled. The questions to be asked at this stage concern the size of the sample to be surveyed and how the sample is to be drawn. The underlying question of the sampling frame must also be adressed.

Sample size

Sample size is the main factor of statistical precision (or statistical power). Size has to be determined in terms of this requirement. Other constraints that make it necessary to limit the number of individuals in the sample (cost, logistics, etc.), must also be taken into account.

Furthermore, it is important to anticipate nonresponse and out-of-scope responses by increasing the sample size to take account of potential “losses” due to those causes .

The sampling frame

The sampling frame is the complete listing of individuals belonging to the target population. A good sampling frame has the following characteristics:

•  it is exhaustive and no individual is counted twice;
• it is up-to-date;
• it includes a name and variables that can be used to identify and contact eligible individuals (address, telephone number, etc.);
• it includes “auxiliary variables” that may prove useful in sampling, response collection and data adjustment.

For major surveys of the French population, a partnership with INSEE is often required to obtain access to a sampling frame. The telephone directory is also a potential sampling frame, though an incomplete one.

Sampling methods (or sampling plan)

Using the auxiliary information available in the sampling frame, the aim is to draw a sample that will produce the most accurate estimates while complying with existing constraints (size of the total sample and of the specific group(s) of interest, etc.).

Several sampling techniques may be used and possibly combined: stratification, multistage sampling (including cluster sampling), balanced sampling, etc. Simulations may be needed to compare various sampling options. It is important to allocate time for this task in this survey schedule.

How to proceed when there is no sampling frame

In this case, it may be possible to use a partial sampling frame; e.g., the complete list of housing units in a set of randomly selected geographical zones (primary units).

This type of sample drawing is similar to two-stage sampling, where a filter survey is conducted with a large sample and a smaller sample is drawn from the target population thus identified.

In other cases “indirect sampling” may be used. This consists in accessing the target population by means of another, related population. For example, children can be accessed through a sample of parents, or social service users through a sample of services (as in surveys of homeless people).

As a last resort, a sampling frame may be determined using a non-probabilistic method such as the quota, random route or purposive methods. Quota sampling deserves special mention, as it is the method most often used by polling institutes in general population surveys. The idea is to ensure that the final sample accurately reflects the population researchers wish to study. In theory, it is impossible to calculate sampling precision, so this method cannot be used for studies that must meet precision requirements.

Once the sample survey has been conducted, it is important to calculate “weights,” that is, coefficients that can be used to extrapolate from the sample to the entire target population.

Calculation of weightings is largely dependent on the sampling plan. However, estimates produced with this initial weighting can be improved to take account of non-response and auxiliary data from the sampling frame or outside sources. This procedure is known as data “adjustment.”

Adjustments are made primarily on the basis of auxiliary data, i.e. information that is either internal to the survey or can be found in external sources. Specific techniques are used to make the most of such information: adjusting for total nonresponse, post-stratification or marginal calibration

Internal information, meanwhile, may come from:

•  the sampling frame, either in the form of aggregated statistics (totals, proportions) or at the individual level. In the latter case, that information should be available for all individuals drawn for the sample, regardless of whether they are actually respondents;
•  the data collection process. Here, too, information should be available for all individuals drawn, whether or not they actually responded. The information may include reasons for not answering (refusal, non-contact, out-of-scope individual, etc.), number of times the individual had to be contacted before the interview could be conducted, etc. These types of information are called paradata.

Correcting for partial non-response

Data can also be impacted by partial non-response; i.e. when respondent has not answered a question or a questionnaire section. In this case imputation techniques are used to supply the missing data, specifically

• cold-deck imputation (using information external to survey data)
• hot-deck imputation (using information internal to survey).

Other techniques include multiple imputation regression, imputations by partial least squares (PLS) regression algorithms (particularly NIPALS), reduced variance, etc.

Information on non-respondents

Whether for adjustment or imputation purposes, it is important to study non-response in order to understand what causes it and the mechanisms behind it.

To do so, researchers need information on non-respondents. In cases of total non-response, the information in the sampling frame at least is known, but it is useful to collect other information during data collection, even if the questionnaire could not be administered. For example, it is useful to know why the respondent did not answer (refusal, no facilitating interlocutor, language problem, etc.).

In cases of partial non-response, researchers have respondent’s answers to some questions. The appropriate imputation method can then be identified by means of a response probability study.

Quality assessment

The aim of quantitative and qualitative quality assessment is to measure any “errors” that may compromise the quality of the collected data.

Sources of error in surveys

There are three types of error that may occur in a survey: sampling error, non-response and measurement errors. They all have consequences in terms of bias and variance of the estimates obtained.

Sampling errors are due to the sampling procedure and how it was implemented. Conducting a sample survey (which by definition only polls a part of the target population) rather than a survey of all members of the target population gives rise to uncertainty, and the degree of that uncertainty is inversely proportional to sample size. Moreover, a poor sampling frame (obsolete, incomplete, etc.) can introduce bias, e.g., under- or over-representation, etc.

Non-response from eligible respondents—for whatever reason (refusal to answer, respondent could not be contacted, respondent did not fill out the questionnaire, etc.)—can induce bias if non-respondent characteristics differ from respondent characteristics. Non-response also reduces the size of the usable sample, thereby reducing estimation precision.

Non-response may be partial (the respondent did not answer one or more questions) or total (no questionnaire returned, blank questionnaire returned).

Measurement errors may be due to several factors, such as difficulty administering the questionnaire (respondent does not properly understand certain questions, translation difficulties, anomalous response category), poor interviewer-respondent interaction (interviewer improperly reformulates questions or does not correctly follow interviewing instructions, etc.) or the collection method (all other things being equal, different interview methods—face-to-face, telephone, self-administered questionnaire—may elicit different responses from respondents). Responses may also vary with the interview conditions (interview location, whether or not another person is present, etc.). It is also important to look out for possible data processing errors (transcription errors, lost questionnaires, etc.).