Skip to Main Content

Course & Subject Guides

Data Sharing @ Pitt

Learn about the principles and how-to of sharing academic research data.

Open file formats

Consider the file formats of your data files, for example, Excel spreadsheets (XLSX), comma-separated values (CSV), or Word documents of interview transcripts. File formats vary not only as to their content and information storage methods, but also in terms of how tightly they are controlled as intellectual property. Some formats, such as CSV and XML (eXtensible Markup Language), are entirely in the public domain and readily accessible in a broad variety of software. Meanwhile, certain other formats such as Stata’s .dta format, are proprietary and require the format creator’s software to use. As followers of the FAIR Principles—Accessibility and Interoperability, specifically—we would like to minimize the amount of data that we need to offer in proprietary formats, and maximize that offered in open formats.

A certain class of formats is formally proprietary but functionally open: the Microsoft Office suite and the Adobe Portable Document Format. While companies maintain ownership of the IP, they have also shared documentation with the public developer community which allows implementation of file reading and writing in environments such as Python, R, the command line, or a graphical software product. This means that, practically speaking, reviewers and future readers probably will be able to access and analyze your Word documents, Excel spreadsheets, and PDFs. However, it is always a good idea to export/convert these rich formats into a more basic, open format such as CSV or plain text (TXT). When you perform such conversions, you will lose rich formatting; this is an inevitable tradeoff.

Offering data in more formats also hedges against risk to any one file: for example, if a data set’s Excel file gets corrupted due to “bit rot” (i.e., entropy in the data storage system) in some distant future, then the data may still be recoverable from the corresponding CSV file.  

The table below lists some types of information and examples of associated open file formats. If you are sharing information listed in the left column, check whether it is already in a format listed in the right column.

Common open file formats

Information type

Open file format(s)

Text

TXT (encoded in UTF-8 or ASCII)

Tabular

CSV, TSV

Structured and/or tagged document

XML, TEX, HTML

Image

PNG, JPEG*, SVG

Audio

FLAC, MP3*, WAV

Video

H.264 codec in MP4, MKV

Archive (collection of files)

ZIP, TAR, GZ

* Asterisk (*) denotes inherently "lossy" formats, which are less desirable for preservation. Certain compression schemas, such as H.264 and TIFF, are available in both lossy and lossless versions. 

PDF and Office Suite files are acceptable but ideally accompanied by alternatives such as TXT and CSV.

💡 A final tip about file formats: in File Explorer (Windows) or Finder (macOS), make sure to enable visibility of file name extensions! Your file listing should show “Document.docx” instead of “Document”, for example. Windows instructionsmacOS instructions.

More resources for open file formats