Data Sets for Social Sciences: Data Basics

Data Basics

Data vs. Statistics

The terms data and statistics are often used interchangably, but in research the two are different. It's important to understand the distinction to determine what you need.

Data are raw ingredients from which statistics are created, and data often need software (e.g. SPSS, Stata, etc.) to be manipulated. Statistcal analysis can be performed on data to prove a hyopthesis, show relationships among the variables collecte, generate custom tables, do Regression, t-test, ANOVA. Through secondary data analysis, many different researchers can re-use the same data set for different purposes.

Statistics are in a "ready to use" format where the data have already been analyzed and processed to produce information in an easy to read format such as facts, figures, charts, tables, and graphs. Statistics are useful when you just need a few numbers to support an argument (ex. In 2003, 98.2% of American households had a television set--from Statistical Abstract of the United States).

On the most basic level it will help to think of data as the entire collection of information gathered during the process of a survey. A statistic takes that data and refines it down to a single number or percentage in order to answer a specific question - such as "what is the median age of Vanderbilt freshmen".

As a general rule of thumb:

  • If what you need is a number to back up your argument, then a statistic will probably do.
  • If, on the other hand, you need to manipulate the information to answer a new or different question, you will likely need to get your hands on some data.

Some data types:

  • Numeric Data are made up of numbers. Numeric Data are processed using statistical software like SPSS, Stata, or SAS.
  • Qualitative Data are data that describe a property or attribute. Examples of qualitative data are interviews, case studies, comments collected on a questionnaire, etc.
  • Spatial Data are geographic information that is used for analysis with GIS software like ArcGIS.
  • Primary Data are data collected through your own research study directly through instruments such as surveys, observations, etc.
  • Secondary Data are data from a research study conducted by someone else. Usually when you are asked to locate statistics on a topic you are using secondary data. An example of secondary data are statistics from the Census of Population and Housing.

Aggregate or Macro Data are higher-level data that have been compiled from smaller units of data. For example, the Census data that you find on AmericanFactfinder have been aggregated to preserve the confidentiality of individual respondents. Microdata contain individual cases, usually individual people, or in the case of Census data, individual households. The Integrated Public Use Microdata Sample (IPUMS) for the Census provides access to the actual survey data from the Census, but eliminates information that would identify individuals.


In ICPSR, a data set or study is made up of the raw data file and any related files, usually the codebook and setup files. The codebook is your guide to making sense of the raw data. For survey data, the codebook usually contains the methology, the actual questionnaire and the values for the responses to each question, along with information on the structure, content, layout of a datafile and any other relevant information about the data set.. For more information on how to use a codebook, I recommend Princeton University's How to Use a Codebook.

ICPSR uses the term series to describe collections of studies that have been repeated over time. For example, the National Health Interview Survey is conducted annually. In the ICPSR archive, you will find a description of the series that provides an overview. You will also find individual descriptions of each study (i.e. National Health Interview Survey, 2004). The study number in ICPSR refers to the individual survey.

Cross-Sectional describes data that are only collected once.

Time Series study the same variable over time. The National Health Interview Survey is an example of time series data because the questions generally remain the same over time, but the individual respondents vary.

Longitudinal Studies describe surveys that are conducted repeatedly, in which the same group of respondents are surveyed each time. This allows for examining changes over the life course. The Project on Human Development in Chicago Neighborhoods (PHDCN) Series contains a longitudinal component that tracks changes in the lives of individuals over time through interviews.

For more definitions, I highly recommend the Glossary of Selected Social Science Computing Terms and Social Science Data Terms compiled by Jim Jacobs, Data Services Librarian, UCSD.

Loading