Skip to main content

Research Data Management @ Pitt

This guide will assist researchers in planning for the various stages of managing their research data and in preparing data management plans required with funding proposals.

Finding and reusing your data

Now where did I put that file?

Finding and reusing your data will be easier, both for you and for other researchers, if you give a little thought early in the process to how you will name your data files and what file formats you will use to store your data. If you are planning to archive or share your data, you will also want to consider best practices for describing your data.

Choosing a file format

The format of the electronic data files you work with during your research may be determined by the research equipment and computer hardware and software that you have access to. However, for long-term preservation and ease of sharing, best practices may dictate that the files be converted to a different format after your project has ended. Give some thought to this eventuality at the outset. Considerations include:

  • Will your data be in a format that requires proprietary software to access it?
  • If you will be depositing your data in a repository at the end of your project, does the repository have specific guidelines or requirements with respect to file format?
  • What features of your data might be lost or modified in the conversion to another file format?

Stanford University Libraries - Data Management Services provides a useful overview of preferred file formats. From the Stanford resource:

  • Containers: TAR, GZIP, ZIP

  • Databases: XML, CSV

  • Geospatial: SHP, DBF, GeoTIFF, NetCDF

  • Moving images: MOV, MPEG, AVI, MXF

  • Sounds: WAVE, AIFF, MP3, MXF

  • Statistics: ASCII, DTA, POR, SAS, SAV

  • Still images: TIFF, JPEG 2000, PDF, PNG, GIF, BMP

  • Tabular data: CSV

  • Text: XML, PDF/A, HTML, ASCII, UTF-8

  • Web archive: WARC

Additional helpful guidelines for selecting file formats can be found at these websites:

Naming your data files

Before you begin your research, decide on a naming convention for your files. Document the naming convention you choose, and make sure that you and your collaborators follow it. It will save you time and will help others who may use your files in the future. Best practices include:

  • Give files a meaningful name. A file name might include a combination of elements, such as type of equipment used, date, and researcher's surname. Decide on the best order for elements in a file name; it will affect how the files are sorted.
  • Keep names a reasonable length; some applications won't work well with long file names. A maximum of 25 characters is a good rule of thumb.
  • To separate elements in a file name, consider using underscores (_) or hyphens (-). Avoid using blank spaces in a file name. Use periods only to separate the file name from the file type extension (.txt, .jpg, etc.)
  • If including date as part of the file name, use the standard format yyyymmdd to ensure that files sort in chronological order.
  • If your file name will include a numerical component, such as a subject number or version number, use leading zeros (001, 002, etc.) so that files sort in sequential order.
  • Avoid special characters like ~ ! @ # $ % ^ & * ( ) ` ; < > ? , [ ] { } ‘ “
  • Account for versions. The US Geological Survey recommends the following: Include a number behind the file name to indicate the version, e.g.:
    • Bisondata_1.0 = original document

    • Bisondata_1.1 = original document with minor revisions

    • Bisondata_2.0 = document with substantial revisions

More considerations for naming files can be found at these websites:

Data versioning

Versioning refers to saving new copies of your files when you make changes so that you can go back and retrieve specific versions of your files later. Saving multiple versions makes it possible to decide at a later time that you prefer an earlier version. You can then immediately revert back to that version instead of having to retrace your steps to recreate it. 
 
In its most basic form, versioning relies on a sequential numbering system. Within a given version number category (major, minor), these numbers are generally assigned in increasing order and correspond to changes in the data. The US Geological Survey recommends the following structure: 
 
  • DataFileName_1.0 = original document
  • DataFileName_1.1 = original document with minor revisions
  • DataFileName_2.0 = document with substantial revisions
 
The ETDplus project, led by the Educopia Institute, offers additional guidance for version control. Versioning should be taken into account when developing the folder and file naming structure. The following guidance is taken from the ETDplus brief on version control, available on the project site:

At the beginning of a research project, it is important to create a stable folder structure in which you can organize materials. The specific folders will depend on your own research process. File organization could be based on how you plan to gather materials, which experiment or process generated them, when they were created, or other strategies. The key is to use folders that make sense to you and allow you to easily find your materials.A simple method to designate a revision is to note it at the end of the file name. This way, files can be grouped by their name and sorted by version number. For example:

  • image1_v1.jpg
  • image1_v2.jpg
  • image2_v1.jpg
  • image2_v2.jpg
  • ...
If you use version numbers, one issue that can arise is that computers will sort files based on the position of the characters. This can lead to strange, unhelpful results. For example:
  • image1_v1.jpg
  • image1_v10.jpg
  • image1_v2.jpg
  • ...
A good practice that can help you to avoid these problems is to use dates to designate version numbers. If you choose this strategy, format dates as year-month-day (20150930). Using this order will help avoid confusion when collaborating with other researchers or systems that use a day-month-year or month-day-year, and it will help your computer sort versions in chronological order. For example:
  • image1_20151021
  • image1_20151214
  • image1_20160123
  • ...
If the files you are using are created or edited collaboratively, you may want to incorporate names or initials into your file naming conventions so that you know which versions contain updates by each individual on your team. For example:
  • dataset1_20160402_KES
  • dataset1_20160301_WTC
  • dataset1_20160814_GSC