The Chicago School Library: Research Data Management: Organizing Data

Why is File Naming Important?

Consistent file naming conventions help you avoid errors or duplication in your research, make your files both machine and human-readable, and make file sorting and organization easier.

In the example below, the same sample is given two different names by two different lab members, leading to confusion and duplication of work.

Being specific and consistent in your file naming makes it easier to quickly read and identify files in a list, and makes it clear what type of information is contained within.

In the example below, the file name contains a date, which repetition it is from a gene expression experiment, and you can see that it is a spreadsheet file by the .csv extension.

2016_05_10_gene_expression_rep01.csv

2016_06_01_gene_expression_rep02.csv

2016_06_20_gene_expression_rep03.csv

2016_07_11_gene_expression_rep04.csv

Naming your data files

Before you begin your research, decide on a naming convention for your files. Document the naming convention you choose, and make sure that you and your collaborators follow it. It will save you time and will help others who may use your files in the future. Best practices include:

Give files a meaningful name. A file name might include a combination of elements, such as type of equipment used, date, and researcher's surname. Decide on the best order for elements in a file name; it will affect how the files are sorted.
Keep names a reasonable length; some applications won't work well with long file names. A maximum of 25 characters is a good rule of thumb.
To separate elements in a file name, consider using underscores (_) or hyphens (-). Avoid using blank spaces in a file name. Use periods only to separate the file name from the file type extension (.txt, .jpg, etc.)
If including date as part of the file name, use the standard format yyyymmdd to ensure that files sort in chronological order.
If your file name will include a numerical component, such as a subject number or version number, use leading zeros (001, 002, etc.) so that files sort in sequential order.
Avoid special characters like ~ ! @ # $ % ^ & * ( ) ` ; < > ? , [ ] { } ‘ “
Account for versions. The US Geological Survey recommends the following: Include a number behind the file name to indicate the version, e.g.:
- Bisondata_1.0 = original document
- Bisondata_1.1 = original document with minor revisions
- Bisondata_2.0 = document with substantial revisions

More considerations for naming files can be found at these websites:

Best practices for file naming (Stanford University Libraries - Data Management Services)
File naming and versioning (University of Wisconsin Research Data Services)
File naming conventions (Purdue University Libraries - Data Management for Graduate Researchers)
File management (Cornell University Research Data Management Service Group)

Data Versioning Files

Versioning refers to saving new copies of your files when you make changes so that you can go back and retrieve specific versions of your files later. Saving multiple versions makes it possible to decide at a later time that you prefer an earlier version. You can then immediately revert back to that version instead of having to retrace your steps to recreate it.

In its most basic form, versioning relies on a sequential numbering system. Within a given version number category (major, minor), these numbers are generally assigned in increasing order and correspond to changes in the data. The US Geological Survey recommends the following structure:

DataFileName_1.0 = original document

DataFileName_1.1 = original document with minor revisions

DataFileName_2.0 = document with substantial revisions

The ETDplus project, led by the Educopia Institute, offers additional guidance for version control. Versioning should be taken into account when developing the folder and file naming structure. The following guidance is taken from the ETDplus brief on version control, available on the project site:

At the beginning of a research project, it is important to create a stable folder structure in which you can organize materials. The specific folders will depend on your own research process. File organization could be based on how you plan to gather materials, which experiment or process generated them when they were created, or other strategies. The key is to use folders that make sense to you and allow you to find your materials easily.A simple method to designate a revision is to note it at the end of the file name. This way, files can be grouped by their name and sorted by version number. For example:

image1_v1.jpg
image1_v2.jpg
image2_v1.jpg
image2_v2.jpg

If you use version numbers, one issue that can arise is that computers will sort files based on the position of the characters. This can lead to strange, unhelpful results. For example:

image1_v1.jpg
image1_v10.jpg
image1_v2.jpg

A good practice that can help you to avoid these problems is to use dates to designate version numbers. If you choose this strategy, format dates as year-month-day (20150930). Using this order will help avoid confusion when collaborating with other researchers or systems that use a day-month-year or month-day-year, and it will help your computer sort versions in chronological order. For example:

image1_20151021
image1_20151214
image1_20160123

If the files you are using are created or edited collaboratively, you may want to incorporate names or initials into your file naming conventions so that you know which versions contain updates by each individual on your team. For example:

dataset1_20160402_KES
dataset1_20160301_WTC
dataset1_20160814_GSC

Formatting Dates

Date formats can vary between countries. The most common confusion is between the United States and European formats:

US - April 8, 2021, or 04/08/2021 vs. European - 8th April, 2021 or 08/04/2021.

Choosing a standard format for dates, and using a numerical notation, will help avoid confusion and errors.

ISO 8601 is the best standard for date formats:

YYYY-MM-DD = 2021-04-08 or 20210408

You can also break this down further with time notation if needed:

YYYYMMDDTHH:MM:SS, or 20210408T15:21:09

Learn more about ISO 8601 here.

Extensibility

As you see in the example above about consistent file naming, it's helpful to use extensible file names to help organize and sort files with numerical content. When you view files in your file explorer or folders on your computer, we have all probably experienced the numbers being out of order and having to hunt for the file you need. The answer to that is extensibility!

When creating your file naming structure, think about whether you will be using image outputs or other ordered content and plan for that. If you know you will have hundreds or even thousands of files, building in that placeholder will allow you to easily order and find your files.

**Example of file extensibility**
Good	Bad!
AtherRat_ex001_lipitor.tif	AtherRat_ex1_lipitor.tif
AtherRat_ex002_lipitor.tif	AtherRat_ex10_lipitor.tif
AtherRat_ex003_lipitor.tif	AtherRat_ex2_lipitor.tif
AtherRat_ex004_lipitor.tif	AtherRat_ex3_lipitor.tif

Choosing a File Format

The format of the electronic data files you work with during your research may be determined by the research equipment and computer hardware and software that you have access to. However, for long-term preservation and ease of sharing, best practices may dictate that the files be converted to a different format after your project has ended. Give some thought to this eventuality at the outset. Considerations include:

Will your data be in a format that requires proprietary software to access it?
If you will be depositing your data in a repository at the end of your project, does the repository have specific guidelines or requirements with respect to file format?
What features of your data might be lost or modified in the conversion to another file format?

Stanford University Libraries - Data Management Services provides a useful overview of preferred file formats. From the Stanford resource:

Containers: TAR, GZIP, ZIP
Databases: XML, CSV
Geospatial: SHP, DBF, GeoTIFF, NetCDF
Moving images: MOV, MPEG, AVI, MXF
Sounds: WAVE, AIFF, MP3, MXF
Statistics: ASCII, DTA, POR, SAS, SAV
Still images: TIFF, JPEG 2000, PDF, PNG, GIF, BMP
Tabular data: CSV
Text: XML, PDF/A, HTML, ASCII, UTF-8
Web archive: WARC

Additional helpful guidelines for selecting file formats can be found at these websites:

Choosing formats (Cambridge University Libraries - Data Management)
File formats (Cornell University - Research Data Management Service Group)
L ibrary of Congress File Formats (Library of Congress)

Open Formats for Data

Best practices for preservation is to save your data on preservation formats. These four formats are the gold standard for making sure your data will be available for long term, as they can be opened and viewed on any operating system using any kind of software. They are:

XML: Extensible markup language -- this is used to ensure simplicity, generability and usability across the internet and can be used to save documents or web service content
CSV: Comma separated values -- this is an ideal way to save spreadsheets in a preservation format. Excel, Google Docs and any other spreadsheets can open CSV files
PDF: Ideal for saving documents in perpetuity. Note that PDFs are not easily editable, and should be used to freeze a document in time that will not be changed
TIFF: Tagged image file format -- the gold standard for saving image files. TIFF’s a preservation ready, and will ensure the quality of images over time.

Other Tips for File Naming

Avoid using special characters in your file names:

~ ! @ # $ % ^ & * ( ) ’ “ ; < > ? { } [ ]

Most modern software probably won't allow these characters in names, but avoid them regardless. Special characters can cause confusion with coding or scripting languages or create errors.

Avoid using abbreviations in your file names. This leads to confusion and makes the file name difficult to read. You will probably forget the abbreviation you created! Make file names clear and human-readable.

BAD: msewt.csv

GOOD: 2018_09_20_mouse_weight.csv

File Naming Software Resources

Starting fresh with a new project and developing a file naming scheme is the best way to save time and aggravation. But if you need to clean up an existing file structure, there are tools out there to help you and make it less time-consuming. No endorsement is implied for any specific tool below, this is a list of available options. There may be more out there as well.