Which of the following is used to summarize two potentially related categorical variables?

1 Data, statistical information and statistics 1.1 Definitions

Text begins

Data, statistical information and statistics are closely related, but understanding the key differences between these concepts is important for anyone who needs to navigate the ever-rising ocean of information produced by modern society. Data are the raw materials for producing statistical information, of which statistics are a specific type.

Data

Data are facts, figures, observations, or recordings that can take the form of image, sound, text or physical measurements (ex: distance, weight, wave lengths). Data can be gathered and processed in order to form conclusions. Data can come from many sources and it can be split in two groups based on the form it takes: structured data and unstructured data.

Structured data are data that are organized into pre-defined items that each relates to a specific concept or data item. A set of data gathered using a questionnaire or other fillable form is a good example of structured data: the questions on a questionnaire represent separate, well-defined concepts. In the case of a closed question, the answer will fit in one of multiple pre-defined categories. For an open question, it may take the form of a text or numerical values. If an answer was recorded for each question, the data are complete. If not, there are missing values.

For example, consider how each column in table 1.1.1 on Canadian universities relates to a single, separate concept:

Table 1.1.1
Example of structured data
Table summary
This table displays the results of Example of structured data. The information is grouped by Name of institution (appearing as row headers), City, Province, Established and Number of students (appearing as column headers).

Name of institutionCityProvinceEstablishedNumber of students
Université LavalQuebec QC 1852 43,000
University of WaterlooWaterloo ON 1955 30,000
Dalhousie UniversityHalifax NS 1818 18,000
Simon Fraser UniversityBurnaby BC 1965 30,000

Each row includes the values for one observation unit for which information was collected. Rows are referred to as observations or records. Concepts presented in each column are often called variables. Data sets are groupings of data that have common definitions of observation units and variables.

In order to be processed and analyzed, structured data need to be compiled in a digital data structure that naturally aligns with pre-defined concepts or variables such as a spreadsheet, a database or a delimited text file. Data can then be read by a statistical software that allows the data user to transform and summarize the data, to perform mathematical operations on the data or to visualize them.

Unstructured data are any data that are not arranged according to a pre-defined model. To produce statistical information based on unstructured data, additional processing is needed to organize the information contained in the data. Table 1.1.2 presents examples of how text, images and sounds can be transformed into structured data that can be used for text analysis and for pattern and speech recognition.

Table 1.1.2
Transforming unstructured data into structured data
Table summary
This table displays the results of Transforming unstructured data into structured data. The information is grouped by Unstructured data (appearing as row headers), Processing and Structured data (appearing as column headers).

Unstructured dataProcessingStructured data
A textParsing, to split the text in a list of words; aggregation, to count how many times the same word occurs; use of dictionaries and rules to classify words. A spreadsheet: on each row there is one distinct word, the three columns present the word, the number of occurrences and the category of the word.
An imageAssignment of RGB values to pixels; segmentation of the image into blocks of pixels based on red (R), green (G) and blue (B) components. A database: each record is a group of pixels and the variables summarize the colour components in each group.
A record of someone’s voiceSegmentation of record in distinct sounds; measure of duration and frequencies. A list of segments with duration and frequencies.

With the increased use of computers and smartphones in all areas of our lives, a huge part of the digital data that is being created now is unstructured. Assessing the potential of this data and creating innovative ways of gathering, processing and analyzing it in order to produce valuable statistical information is one of the great challenges of the data revolution.

But what is the difference between statistical information and data?

Statistical information

Statistical information is data that has been recorded, classified, organized, related, or interpreted within a framework so that meaning emerges. Statistical information that is communicated to information users should help them understand the story told by the data and communicate to them the quality of the information that is presented. Statistical information can be presented in various formats: texts, tables, graphs, infographics, videos, or even databases.

Many examples of statistical information produced at Statistics Canada will be presented in the next page, but it is first important to understand one major part of the process of producing statistical information from data: the use of statistics!

Statistics

In general, statistics relate to numerical data; in fact, the term “statistics” can refer to the science of dealing with numerical data itself. Statistics are also a type of information obtained through mathematical operations on data. Above all, statistics aim to provide useful information by means of numbers.

The most commonly used statistics to report statistical information are called descriptive statistics. For numeric variables, measures of central tendency provide the value that is the most representative of the units found in a data set. Measures of dispersion describe the spread of the data around the central tendency. For categorical variables, frequency distributions are used to summarize the data. Proportions, ratios and rates are also useful statistics to analyze the data.

When each row in a data set displays statistics that summarize the information for many units of observation, these data are called aggregate data. Inversely, when each row displays the information for a single unit of observation, the data are referred to as microdata.

Report a problem on this page

Is something not working? Is there information outdated? Can't find what you're looking for?

Please contact us and let us know how we can help you.

Privacy notice

Date modified: 2021-09-02

What is used to summarize two potentially categorical variables?

Contingency tables (also called crosstabs or two-way tables) are used in statistics to summarize the relationship between several categorical variables. A contingency table is a special type of frequency distribution table, where two variables are shown simultaneously.

Which of the following is the best to describe the relationship between two categorical variables?

A bar chart is best method to describe relationship between two categorical variables.

What technique is typically used to investigate relationships between two categorical variables?

This is useful not just in building predictive models, but also in data science research work. One statistical test that does this is the Chi Square Test of Independence, which is used to determine if there is an association between two or more categorical variables.

Which of the following describes what a categorical variable is?

Answer and Explanation: Categorical variables refer to variables that take on non-numerical values. For example, gender, race, ethnicity, and so on are examples of categorical variables.