Benchmark project structure#
This page describes the expected folder structure and file naming conventions for pose estimation benchmark datasets.
Note
We mark requirements with italicised keywords that should be interpreted as described by the Network Working Group. In decreasing order of requirement, these are: must, should, and may.
Overview#
A benchmark dataset is organised into a
Trainand aTestsplit.Each split contains one or more projects (i.e. datasets contributed by different groups).
Each project contains one or more sessions.
A session centres on a single video file (the session video), from which frames (individually sampled images) and optionally clips (short video segments) are extracted.
Frames and clips are accompanied by label files in COCO keypoints format.
The current scope is limited to single-animal pose estimation from a single camera view. Support for multi-camera setups is planned for a future version.
Folder structure#
.
├── Train/
│ └── <ProjectName>/
│ └── sub-<subjectID>_ses-<sessionID>/
│ ├── Frames/
│ │ ├── sub-<subjectID>_ses-<sessionID>_cam-<camID>_frame-<frameID>.png
│ │ ├── ...
│ │ └── sub-<subjectID>_ses-<sessionID>_cam-<camID>_framelabels.json
│ ├── Clips/ (optional)
│ │ ├── sub-<subjectID>_ses-<sessionID>_cam-<camID>_start-<frameID>_dur-<nFrames>.mp4
│ │ ├── sub-<subjectID>_ses-<sessionID>_cam-<camID>_start-<frameID>_dur-<nFrames>_cliplabels.json
│ │ └── ...
│ └── sub-<subjectID>_ses-<sessionID>_cam-<camID>.mp4
└── Test/
└── <ProjectName>/
└── sub-<subjectID>_ses-<sessionID>/
├── Frames/
│ ├── sub-<subjectID>_ses-<sessionID>_cam-<camID>_frame-<frameID>.png
│ └── ...
├── Clips/ (optional)
│ ├── sub-<subjectID>_ses-<sessionID>_cam-<camID>_start-<frameID>_dur-<nFrames>.mp4
│ ├── sub-<subjectID>_ses-<sessionID>_cam-<camID>_start-<frameID>_dur-<nFrames>_startlabels.json
│ └── ...
└── sub-<subjectID>_ses-<sessionID>_cam-<camID>.mp4
Note
The Test split follows the same structure as Train, but includes different label files (see Label format for details).
Train / Test#
The top level must contain a
Trainand aTestfolder.Each split must contain at least one project folder.
Each session must belong to exactly one split.
Project#
Each project must have exactly one project-level folder within a given split.
The project folder name should be descriptive and without spaces (e.g.
SWC-plusmaze,IBL-headfixed,AIND-openfield).
Session#
Each session must have exactly one session-level folder within a project.
Session folder names must be formatted as
sub-<subjectID>_ses-<sessionID>.<subjectID>and<sessionID>must be strictly alphanumeric (i.e. onlyA-Z,a-z,0-9).A session folder must contain exactly one session video file at its root.
A session folder must contain a
Framesfolder.A session folder may contain a
Clipsfolder.
Examples
valid:
sub-M708149_ses-20200317,sub-001_ses-01invalid:
mouse-M708149_ses-20200317: the first key should besub.sub-M708149_20200317: missing theseskey.sub-M70_8149_ses-20200317: underscores are not allowed within values (ambiguous parsing).sub-M70-8149_ses-2020-03-17: hyphens are not allowed within values (ambiguous parsing).
Session video#
All video files (session videos and clips) should be in MP4 format (H.264 codec, yuv420p pixel format). Contributors should re-encode their videos to this format before submission (see SLEAP documentation for guidance).
Session video filenames must follow the pattern:
sub-<subjectID>_ses-<sessionID>_cam-<camID>.mp4.
Frames#
The Frames folder contains individually sampled images. In the Train split, it also contains a label file with keypoint annotations.
Frames must be extracted from the session video.
Frame images should be in PNG format (
.png). JPEG format (.jpgor.jpeg) may also be used.Frame image filenames must follow the pattern:
sub-<subjectID>_ses-<sessionID>_cam-<camID>_frame-<frameID>.<ext>, where<ext>is.png,.jpg, or.jpeg.<frameID>must be the 0-based index of the frame in the session video.<frameID>must be padded to a consistent width across all frame files within a session (e.g.0000,1000).In the
Trainsplit, a single label file must be provided per camera view, namedsub-<subjectID>_ses-<sessionID>_cam-<camID>_framelabels.json. At present, only one camera view is included, so the split contains exactly one such label file. See Frame labels for details.
Clips#
A session may include a Clips folder containing short video segments and their label files.
Clips must be extracted from the session video and must have the same file format.
Clip filenames must follow the pattern:
sub-<subjectID>_ses-<sessionID>_cam-<camID>_start-<frameID>_dur-<nFrames>.mp4.<frameID>in thestartfield must be the 0-based index of the first frame of the clip in the session video, padded to a consistent width (e.g.0500,1000).<nFrames>in thedurfield must be the duration of the clip in number of frames (e.g.5,30).A single label file must be provided per clip:
In the
Trainsplit, the file is namedsub-<subjectID>_ses-<sessionID>_cam-<camID>_start-<frameID>_dur-<nFrames>_cliplabels.jsonand contains keypoint annotations for every frame in the clip. See Clip labels for details.In the
Testsplit, the file is namedsub-<subjectID>_ses-<sessionID>_cam-<camID>_start-<frameID>_dur-<nFrames>_startlabels.jsonand contains keypoint annotations only for the first frame of the clip. See Clip start labels for details.
File naming#
All filenames follow a key-value pair convention, similar to the BIDS standard and NeuroBlueprint.
Filenames must consist of key-value pairs separated by underscores, with keys and values separated by hyphens. A filename may end with an additional suffix (not a key-value pair) before the extension:
<key>-<value>_<key>-<value>.<extension> <key>-<value>_<key>-<value>_<suffix>.<extension>
The recognised suffixes are:
framelabelsfor frame label files.cliplabelsfor clip label files.startlabelsfor clip start label files.
The following keys are used:
Key
Description
Examples
subSubject identifier
sub-001,sub-M708149sesSession identifier
ses-02,ses-25,ses-20200317camCamera identifier
cam-topdown,cam-side2frame0-based frame index in the session video
frame-0000,frame-0500,frame-1000start0-based frame index of the first frame of a clip in the session video
start-0000,start-0500,start-1000durClip duration in number of frames
dur-5,dur-30The keys
sub,ses, andcammust appear in every filename, in that order.Key values must be strictly alphanumeric for
sub,sesandcam(i.e. onlyA-Z,a-z,0-9).Key values must be strictly numeric for
frame,startanddur(i.e. only0-9).Filenames must not contain spaces.
Label format#
The
Trainsplit includes ground-truth keypoint annotations both for the sampled frames (framelabels.json) and for entire clips (cliplabels.json), if present.The
Testsplit includes keypoint annotations only for the first frame of each clip (startlabels.json), if clips are present. Labels for frames and entire clips are withheld to support evaluation of pose estimation and point tracking methods.Labels must be stored in the same folder as the corresponding frames or clips.
Labels must be stored in COCO keypoints format, with additional requirements described below. Each label file is a JSON file with
images,annotations, andcategoriesarrays. Image, annotation and categoryidvalues must be unique integers within a label file.
Note
Annotation and category id values should be 1-indexed. This convention follows sleap-io’s save_coco function and avoids conflicts with models that treat category 0 as background.
Image id values are always 0-indexed. The indexing origin differs for frame labels and clip labels, and clip start labels follow the same conventions as clip labels. Details are provided below.
Frame labels (framelabels.json)#
Frame labels must only exist in the
Trainsplit.Within the
Framesfolder, there must be one frame label file per camera view, namedsub-<subjectID>_ses-<sessionID>_cam-<camID>_framelabels.json.Each entry in the
imagesarray must have anidequal to the 0-based frame index in the session video (matching the<frameID>in the corresponding image filename).Each entry in the
imagesarray must have afile_namethat exactly matches the name of an existing frame image in theFramesfolder (including the extension).
Example
For a session with 5 labelled frames sampled from different parts of the video, the images array would be:
[
{"id": 1000, "file_name": "sub-M708149_ses-20200317_cam-topdown_frame-01000.png", "width": 1300, "height": 1028},
{"id": 2300, "file_name": "sub-M708149_ses-20200317_cam-topdown_frame-02300.png", "width": 1300, "height": 1028},
{"id": 3500, "file_name": "sub-M708149_ses-20200317_cam-topdown_frame-03500.png", "width": 1300, "height": 1028},
{"id": 7200, "file_name": "sub-M708149_ses-20200317_cam-topdown_frame-07200.png", "width": 1300, "height": 1028},
{"id": 19800, "file_name": "sub-M708149_ses-20200317_cam-topdown_frame-19800.png", "width": 1300, "height": 1028}
]
Here each id is the 0-based frame index in the session video (matching the <frameID> in the filename), and each file_name includes the .png extension.
Clip labels (cliplabels.json)#
Clip labels must only exist in the
Trainsplit.If a
Clipsfolder is present, there must be one clip label file per clip, namedsub-<subjectID>_ses-<sessionID>_cam-<camID>_start-<frameID>_dur-<nFrames>_cliplabels.json.The
imagesarray must contain an entry for every frame in the clip, in consecutive, monotonically increasing order (covering the entire clip duration).Clip labels follow the same COCO keypoints format as frame labels, but with different conventions for image
idandfile_namevalues:Each image
idmust be the 0-based index of the frame within the clip (i.e.0,1,2, …), not the index in the session video.Each
file_namemust follow the same pattern as frame image filenames, but without the extension. Theframefield in thefile_namemust correspond to the index of that frame in the session video.
This means that each entry in the images array encodes two pieces of information: the id gives the local position within the clip, while the frame field in file_name gives the global position in the session video. Note that in both cases the indices are 0-based.
Example
For a clip starting at frame 1000 with a duration of 5 frames, the images array would be:
[
{"id": 0, "file_name": "sub-M708149_ses-20200317_cam-topdown_frame-1000", "width": 1300, "height": 1028},
{"id": 1, "file_name": "sub-M708149_ses-20200317_cam-topdown_frame-1001", "width": 1300, "height": 1028},
{"id": 2, "file_name": "sub-M708149_ses-20200317_cam-topdown_frame-1002", "width": 1300, "height": 1028},
{"id": 3, "file_name": "sub-M708149_ses-20200317_cam-topdown_frame-1003", "width": 1300, "height": 1028},
{"id": 4, "file_name": "sub-M708149_ses-20200317_cam-topdown_frame-1004", "width": 1300, "height": 1028}
]
Here id: 0 through id: 4 are the local clip indices, while frame-1000 through frame-1004 in the file_name values refer to the original frame positions in the session video.
Clip start labels (startlabels.json)#
Clip start labels must only exist in the
Testsplit.If a
Clipsfolder is present, there must be one clip start label file per clip, namedsub-<subjectID>_ses-<sessionID>_cam-<camID>_start-<frameID>_dur-<nFrames>_startlabels.json.Clip start labels provide keypoint annotations for the first frame of the clip only. They are intended for point-tracker evaluation, where the annotated points serve as the initial positions from which a tracker should propagate.
Clip start labels are identical to Clip labels, except that the
imagesarray must contain exactly one entry corresponding to the first frame of the clip, and therefore must haveid: 0.
Example
For a clip starting at frame 1000 with a duration of 5 frames, the images array would be:
[
{"id": 0, "file_name": "sub-M708149_ses-20200317_cam-topdown_frame-1000", "width": 1300, "height": 1028}
]
Visibility encoding#
Keypoint visibility must use ternary encoding:
0: not labelled1: labelled but not visible (occluded)2: labelled and visible
Example#
Below is a concrete example project structure:
.
├── Train/
│ └── SWC-plusmaze/
│ └── sub-M708149_ses-20200317/
│ ├── Frames/
│ │ ├── sub-M708149_ses-20200317_cam-topdown_frame-01000.png
│ │ ├── sub-M708149_ses-20200317_cam-topdown_frame-02300.png
│ │ ├── sub-M708149_ses-20200317_cam-topdown_frame-03500.png
│ │ ├── sub-M708149_ses-20200317_cam-topdown_frame-07200.png
│ │ ├── sub-M708149_ses-20200317_cam-topdown_frame-19800.png
│ │ └── sub-M708149_ses-20200317_cam-topdown_framelabels.json
│ ├── Clips/
│ │ ├── sub-M708149_ses-20200317_cam-topdown_start-1000_dur-5.mp4
│ │ └── sub-M708149_ses-20200317_cam-topdown_start-1000_dur-5_cliplabels.json
│ └── sub-M708149_ses-20200317_cam-topdown.mp4
└── Test/
└── SWC-plusmaze/
└── sub-M235678_ses-20210415/
├── Frames/
│ ├── sub-M235678_ses-20210415_cam-topdown_frame-00500.png
│ ├── sub-M235678_ses-20210415_cam-topdown_frame-01200.png
│ ├── sub-M235678_ses-20210415_cam-topdown_frame-04800.png
│ ├── sub-M235678_ses-20210415_cam-topdown_frame-09100.png
│ └── sub-M235678_ses-20210415_cam-topdown_frame-15300.png
├── Clips/
│ ├── sub-M235678_ses-20210415_cam-topdown_start-0500_dur-5.mp4
│ └── sub-M235678_ses-20210415_cam-topdown_start-0500_dur-5_startlabels.json
└── sub-M235678_ses-20210415_cam-topdown.mp4