Skip to Main Content

Course & Subject Guides

Data Sharing @ Pitt

Learn about the principles and how-to of sharing academic research data.

Data dictionaries and codebooks

If you share tabular data such as spreadsheets, you should also describe what the columns in your spreadsheet(s) mean. This is accomplished with a data dictionary. A data dictionary is a separate table or spreadsheet, with one row for each column in the data of interest.

For example, if you have a spreadsheet with two columns, treatment_group and tumor_size, then the corresponding data dictionary should have two rows, one for treatment_group and one for tumor_size. Then, each variable is described according to human-friendly meaning, data type, and potentially more. This type of information—information about data of interest—is also referred to as metadata.

 

The table trees_data has columns tree_id, species, height, and condition. We derive trees_data_dictionary by having name, type, and description fields for each of the tree properties: tree_id, species, height, and condition.
Figure: An example data dictionary creation process. Data adapted from WPRDC.

 

For any categorical variables in your data, there should also be descriptions of the levels (categories) available in the variable and their meaning. Information about categorical variables—such as label, meaning, and assignment criteria—is collected in a code book, which may be expressed within the data dictionary (if the categories are simple) or in its own document. A dedicated codebook document may go into much more detail and may sometimes be encoded in a machine-readable format such as XML.

Each of your tabular data sets (i.e., each spreadsheet) should have an accompanying data dictionary, although the same data dictionary can describe multiple files so long as they all follow the same format.

Data dictionaries and codebooks help others to interpret your data post-publication, but they can also be useful during the life of a longer-running and/or collaborative project, where such documentation facilitates consistency across files and team members. For this reason, you may want to consider developing and maintaining data dictionaries and codebooks early in (and throughout) your data collection process.

Typical fields (columns) in a data dictionary include:

  • Variable’s name (as it appears in the file)
  • Variable’s human-friendly name, if needed
  • Variable meaning / definition
  • Measurement units (cm, kg, etc.)
  • Allowed values (e.g., numeric range, or category levels)
  • Nullability (whether the column may be empty for any row)
  • Notes about data collection, any inconsistencies found, etc.

More resources for data dictionaries and codebooks