Home » Part D: Methodological background of data collection » D.9: Glossary of Statiscical Terms

D.9: Glossary of Statiscical Terms

Accuracy: The closeness of a measured value to a known (or standard) value. The closer the measured value to the known value, the more accurate are your results. If your sample result shows an average age of 42 years among sample members, but the known value for the total population is 35 years, the accuracy is rather low. Accuracy is different to (and independent of) precision.
Bias: Your data are biased (or: contains a bias) if certain persons in your sample, or answering options in your questionnaire, are systematically favoured (i.e. are chosen more often than other ones). Statistically, a bias is a systematic (not random) deviation from the true value.
Census: Data collection from all elements from your population of interest.
Confidence interval: Range of values (e.g. for different farm sizes in ha) which are likely (=of which you can be confident) to contain a specific value you have measured (e.g. a farm size of 40 ha).
Confidence level: The probability that, if you repeat a measurement of a certain value, it will fall into your confidence interval. The most common confidence level is 95%, usually expressed by z=1.96
Elements: Individual units or members (often people, but also organisations or companies) who make up a population.
Nonprobability sampling: A non-random (personally influenced) sampling procedure. If you do nonprobability sampling, you cannot claim representativeness.
Population: All elements (i.e. farmers, processors or consumers) who have some characteristics in common. If these characteristics are important for you, you talk about your ‘population of interest’ (i.e. organic farmers).
Precision: The closeness of two or more measurements to each other. If you took several measurements which are all very close to each other, the result is precise. It is possible to have precise, but inaccurate results.
Probability sampling: A random sampling procedure where all elements have the same chance for being included in the sample. Only probability samples can be used to get representative results.
Random error: An error that arises from random changes or differences in participants or in measurement situations. Random errors compensate each other and do not lead to a bias in your results.
Representative sample: A sample whose members accurately reflect the total population from which the sample is taken. Representativeness can only be gained by probability sampling.
Sample: The group of elements of your population of interest that you select for data collection.
Sampling error (also: Random sampling error): An error that occurs because the selected sample is an imperfect representation of the population. The figure of a sampling error tells you about the difference between a result you get from a sample and the result you would have got if you had asked your total population of interest (census).
Sampling frame: A list of all elements which are part of your population of interest, or the procedure of creating such a list.
Sampling unit: A unit of your population of interest you choose during the sampling process. A sampling unit contains one or more of the individual elements of the population (e.g. elements: individual people; unit: households)
Standard deviation: The square root of the variance. Like variance, it tells you about dispersion of values. However, unlike variance the standard deviation is expressed in the same units as the data.
Systematic error: An error that leads to a bias in your results, e.g. due to mistakes in sampling or during data collection.
Target population: Another expression for the population of interest.
Variance: A measure of dispersion of individual values. If all values you measured are close to each other, your variance is low.

Quality dimensions were defined by the Statistical office of the European Union (Eurostat) to establish a framework for the analysis and evaluation of the quality of statistical data and its sources: relevance, accuracy, timeliness and punctuality, accessibility and clarity, comparability, and coherence (Eurostat, 2009).

These quality dimensions are explained in more detail in the European Statistics Code of Practice (CoP), which presents the desired structure and content of a quality report to harmonise quality reporting across member states and to facilitate comparisons (Eurostat, 2009). In addition, the European Statistical System Committee prepared a Quality Assurance Framework to explain activities, methods and tools that help to implement the CoP (Eurostat, 2012). In the following paragraphs, each data quality dimension is explained by a number of aspects and questions which should be taken into account when assessing data quality. The introduction of a quality report has to include: a brief history of the statistical process and outputs in question; a main text body on statistics to which the outputs belong; and limitations of the quality report with references to related reports (Hahn & Linden, 2007).

The first quality dimension is relevance. It is defined as the degree to which statistical outputs meet current and potential user needs. To further describe the relevance of the statistical output it is necessary to refer to its contents and to provide the key outputs/estimates desired by different users.

The second quality dimension, accuracy, implies the closeness of data to the true values and covers sampling as well as non-sampling errors which have been explained above. The method of data collection (e.g. online survey) that is used needs to be presented in order to understand and assess specific errors. Furthermore, a section on the main sources of random and systematic errors needs to be provided. Depending on the type of study, particular errors have to be defined in more detail and have to be handled individually. In addition to sampling errors, one can find coverage, measurement, nonresponse, and data handling errors as described earlier in this chapter.

The third dimension consists of two quality indicators: namely timeliness and punctuality. Timeliness is defined as the length of time between the date to which the data refer and their availability for the public. Punctuality means the time lag between release date and target release date of data. The reasons for non-punctual releases need to be explained.

Accessibility and clarity comprise simplicity and ease with which users can access statistics. The conditions of data access depend on the following factors: media, support, pricing policies, and possible restrictions. The understanding of statistical outputs can be enhanced by the description of accompanying information. The best way to evaluate this quality dimension is a reflection on the feedback of users, which is an essential part of the quality report.

The two remaining dimensions refer to coherence and comparability of statistical data. The quality of statistical outputs depends on the use of the same concepts and harmonised methods. In this context comparability is defined as a special case of coherence. A lack of coherence is explained by differences in concepts and methods. Hence one part of the quality section needs to deal with the assessment of possible effects of each reported difference on the output values. To further explain this quality dimension it can be related to a variety of attributes. First of all, comparability can be regarded over time and across regions; secondly coherence can be evaluated internally, but also in comparison with national accounts or with other statistics; and finally the quality can be checked with the help of so-called mirror statistics, which usually tackle the same topic but use a different sample or a different method (Eurostat, 2009).

More information on analyses and publication of data can be found in one of the subsequent chapters.