Answered By: Kathryn Ruddock
Last Updated: Apr 12, 2019     Views: 4

In addition to creating good metadata for a study, it’s important to label, describe and organize the data files and associated documentation to make it easier for data re-use.

We recommend you use the same practice for labeling, describing and organizing files (including file formats) for all projects in your data deposit.

Labeling Files

By labeling files logically and consistently, both the original creator and future users of the data can more easily identify the contents.

Best practices:

  • Plan ahead with overall architecture and conventions for naming files (consider the order of the characters for logical sorting)
  • Keep names concise but descriptive (less than 25 characters)
  • Avoid spaces, dots and special characters (since can be interpreted as commands in some operating systems)
  • Use underscores or hyphens between words (e.g. Project_Galapagos) or capitalize first letters (e.g. ProjectGalapagos)
  • Format dates consistently (e.g. YYYYMMDD)
  • Include versioning where appropriate
  • Label files independently of folder structure or storage location to avoid ambiguity
  • Maintain a README file with explanations of any abbreviations used in file names

Common elements used in file names:

  • Project name, abbreviation or number
  • Type of data
  • Location
  • Name of creator or initials or research team
  • Version number
  • Creation date
  • File extension
EX. YYYYMMDD_ProjectAbbreviation_FileCategory_Description_VERXX

Describing files (e.g. metadata, readme files)

By describing files appropriately, users will better understand why and how the data was collected and analyzed, what the files contain and how they relate to each other, and within a file, how the variables are defined.

Best practices:

  • Provide documentation at study level (i.e. research question, methodology)
  • Provide documentation at the file or database level (i.e. how files relate to each other, software required, explanation of changes between versions)
  • Provide documentation at variable or item level (i.e. variable names, labels, descriptions, units of measurement) in an accompanying separate codebook or data dictionary.
Examples: readme files, study descriptions, protocols, questionnaires, codebooks, data dictionaries

Organizing files (e.g. tags, hierarchical folder structures)

For studies with multiple datasets and pieces of documentation (e.g. readme file, study description, codebook, etc.), users can more easily identify which files to download when they are grouped in a logical way.

Best practices:

  • Use a standardized list of tags to distinguish between data files and documentation files.
  • For data with a hierarchical folder structure, it is best to upload the dataset as a tar with gzip (.tar.gz) to bypass the unpacking of zip files upon upload in Dataverse.

File formats (e.g. non-proprietary, suitable formats for data analysis)

Files in proprietary formats typically require the software used to create them in order to open and read them. This can be a challenge if the software (or the version) is no longer available. In contrast, open or standard formats (in which the format is published) can be read by more than one application and are more likely to be readable in the future. One drawback of open formats or file migration can be a loss of information and quality.

Best practices:

  • Where possible, save data files in a non-proprietary format so they can be read by others in the future.
  • Ideally, develop an accompanying codebook/user guide to support reuse. Check out the Data Documentation Initiative if you are not sure where to start.
  • If it’s not possible or desirable to save in non-proprietary formats, use formats that have widespread adoption by researchers or industry (e.g. SPSS).
Examples of non-proprietary formats: txt, asc, csv, tab, html, xml, pdf, tif, jpeg, mp4, flac