Guides: Data Sharing @ Pitt: Open file formats

Open file formats

Consider the file formats of your data files, for example, Excel spreadsheets (XLSX), comma-separated values (CSV), or Word documents of interview transcripts. File formats vary not only as to their content and information storage methods, but also in terms of how tightly they are controlled as intellectual property. Some formats, such as CSV and XML (eXtensible Markup Language), are entirely in the public domain and readily accessible in a broad variety of software. Meanwhile, certain other formats such as Stata’s .dta format, are proprietary and require the format creator’s software to use. As followers of the FAIR Principles—Accessibility and Interoperability, specifically—we would like to minimize the amount of data that we need to offer in proprietary formats, and maximize that offered in open formats.

A certain class of formats is formally proprietary but functionally open: the Microsoft Office suite and the Adobe Portable Document Format. While companies maintain ownership of the IP, they have also shared documentation with the public developer community which allows implementation of file reading and writing in environments such as Python, R, the command line, or a graphical software product. This means that, practically speaking, reviewers and future readers probably will be able to access and analyze your Word documents, Excel spreadsheets, and PDFs. However, it is always a good idea to export/convert these rich formats into a more basic, open format such as CSV or plain text (TXT). When you perform such conversions, you will lose rich formatting; this is an inevitable tradeoff.

Offering data in more formats also hedges against risk to any one file: for example, if a data set’s Excel file gets corrupted due to “bit rot” (i.e., entropy in the data storage system) in some distant future, then the data may still be recoverable from the corresponding CSV file.

The table below lists some types of information and examples of associated open file formats. If you are sharing information listed in the left column, check whether it is already in a format listed in the right column.

**Common open file formats**
Information type	Open file format(s)
Text	TXT (encoded in UTF-8 or ASCII)
Tabular	CSV, TSV
Structured and/or tagged document	XML, TEX, HTML
Image	PNG, JPEG*, SVG
Audio	FLAC, MP3*, WAV
Video	H.264 codec in MP4, MKV
Archive (collection of files)	ZIP, TAR, GZ

* Asterisk (*) denotes inherently "lossy" formats, which are less desirable for preservation. Certain compression schemas, such as H.264 and TIFF, are available in both lossy and lossless versions.

PDF and Office Suite files are acceptable but ideally accompanied by alternatives such as TXT and CSV.

💡 A final tip about file formats: in File Explorer (Windows) or Finder (macOS), make sure to enable visibility of file name extensions! Your file listing should show “Document.docx” instead of “Document”, for example. Windows instructions; macOS instructions.

More resources for open file formats

Open Data Institute, "Choosing the Right Format for Open Data"
Data Carpentry, "Exporting data"
describes how and why to convert an Excel spreadsheet to a CSV
US Geological Survey, "File Formats"

University of Pittsburgh Library System

Course & Subject Guides

Data Sharing @ Pitt

Get Help with Data Sharing

Guide Contributors

License

Open file formats

More resources for open file formats