5. Data Storage
SPADES leverages the AWS S3 keyvalue store for file storage scalability. All files stored in S3 are encrypted with the AES-256 encryption to ensure data security. SPADES supports two different file formats for file uploads and processing:
- mHealth files
- Actigraph GT3X files
mHealth Format for Storing Annotated Physical Activity Data
This document describes a data format (referred to as “mHealth Format”) for storing annotated physical activity data, in particular when the data consists of raw accelerometer signals and other low and high sampling rate sensors. Annotated data are data that are labeled with start/end times for various activities, states, or contexts.
The mHealth Format is intended to be:
- flexible/extensible so that new data types can be added as needed,
- easy to understand,
- easy to manage and share for research purposes,
- easy to break into subsets of data or recombine (e.g., by day or hour),
- conducive to storing data collected in both short lab sessions and over many days, weeks, or months,
- decipherable by people by visual inspection without special software (for low-sampling rate data),
- machine interpretable,
- conducive to use for data saving on mobile phones,
- not impacted by changes in daylight savings time or changes of clocks on mobile devices, so there is not a fear of loss of data, and
- appropriate for storage on a distributed file system to enhance data accessibility.
The format is sufficiently extensible to handle a variety of new sensors that may permit multi-day continuous data collection, variable sampling rates, variable sensitivity levels, multi-modal data collection, and acquisition of other useful information about data quality.
This format is intended to be used in projects where data are collected on people measuring their movement, behaviors, physiological state, and/or context. These data can be collected from multiple sensors and sensor types, including commercial wearable monitors, custom wearable monitors, and phones.
The goal of this dataset format is to permit easy sharing of data for research purposes. For this reason, data are stored as simply as possible, but with all information that might be needed for researchers who did not collect the data themselves to use it in new projects. Moreover, we want as much of the dataset to be both human and machine interpretable, so that software can be built to easily manipulate and visualize the data. To make this possible, in addition to raw data collected from sensors, the dataset must contain well-organized meta-information about the data. The meta-information should include information such as sensor operating characteristics, but also information about how the sensors were worn on the body or placed in the environment. Typically while the sensors are being worn, observational data are also collected, which may include information about the type, intensity, duration, and/or location of physical activity, as well as the contexts in which the data are being collected.
The format is expected to be used in studies where data are collected for many days or weeks, ultimately from hundreds or thousands of participants in single studies. Therefore, it is important that the data not only be in a format that is easy to understand but also in a format that allows efficient storage in a distributed file system. Further, the format should permit efficient manipulation of data using distributed computational operations (e.g., computing meta-data from the raw data).
This data format and the tools that use it are intended to simplify the sharing of such data with other researchers. When datasets are adequately formatted for reuse, researchers should be able to easily and automatically load large datasets into programs that perform batch analyses and make inferences about the activities performed. Visualization software should make it easier to inspect large datasets.
Researchers may want to browse an example dataset collected from one individual before reading further, to get a sense for the files and data structure being proposed: [TO DO add link].
Format design decisions and explanation
To meet the needs mentioned above, the format has the following characteristics.
- All sensor data (high and low sampling rate), including annotation and meta-data, are stored in a computer in a human readable, comma-delimited text (ASCII) format. The CSV (comma-separated values) text file format is human readable (via common spreadsheet and data analysis programs), is easily parsed by computer, and can be edited by hand if necessary. CSV files are particularly wellsuited for analysis by high throughput analytical platforms (such as MapReduce and Spark), and they can be easily imported into database systems (such as MySQL), statistical software (such as Matlab and R) and big data frameworks (such as Apache Hadoop) that are important for managing massive datasets (the type of data that will soon be generated from mobile devices).
- All CSV files are compressed (using the commonly available and efficient gzip compression library) to preserve disk space. The downside to storing data in the CSV (comma-separated values) text format is that high sampling rate data may require substantial disk space, especially when the format of the data is known a priori and format-specific, binary storage strategies might be used. Disk space concerns are addressed by gzipping all CSV files as they grow in size (see Appendix: Gzip Compression for a justification). The computational overhead of gzipping CSV files with high-resolution data are acceptable, even on mobile phones, when the files are limited to about an hour of high sampling rate data per file (e.g., from 1-2 seconds of compression time per file).
- All sensor data, regardless of original data format, is converted into the CSV format and synchronized with other sensor data. This conversion requires special software for each sensor type. As long as the native format used by the sensor to save the data is known, converters can be easily written to create timestamped CSV files for the mHealth format. For example, converters are necessary (and need to be provided) to read and convert data from common devices such as the Actigraph and ActivPAL into the format described here. The converters will generate one file per sensor, with timestamps that allow sensor synchronization across files.
- All CSV files begin with a header line, labeling each column of data, to ensure the file is easily human interpretable . The first line in each CSV file is the comma-delimited row containing keys for each column (e.g., HEADER_TIME_STAMP, AUC, ACTIVITY_COUNT, START_DT, STOP_DT, etc.). Some of the most common header names are standardized (see Table 2 below) to provide data type information to software, but researchers can add data type labels as needed. For all sensor, event, or feature data files, the first key/column in the header line is HEADER_TIME_STAMP (because sensor data files will always have timestamps), and for all other data files, the first key/column should start with “HEADER_[First column_name]”, which permits any code reading the files to determine if a line is a header or data line, and then to determine the column name if the line is a header line. 1
- Meta-data are saved in computer- and human- readable CSV files with standard names.
Meta-data typically include:
- Descriptions of sensor types, properties, and placements at the time of data collection. These files are also data files that include sufficient detail so that the protocol can be both understood and reproduced by others. For instance, in some cases, the locations where sensors are worn might change, and the files should allow for proper annotation of such sensor swapping. The format described in more detail below provides naming conventions that should be followed for common meta-data types. (see sensor file and sensors file)
- Descriptions of each participant. These meta-data files about each participant would include all special notes that might influence interpretation of the data for that particular participant’s data. E.g, dominant arm, anthropometrics that affect measurement (tilt angle of sensor on waist), and gait/motion inconsistencies due to musculoskeletal deficiencies that can affect measurement (e.g. limp, Parkinsons, injury). These notes should be easily understood by those not present during data collection. (see subject data file)
- Timestamped labels of the activities performed, context-information,
and/or physiological states. The time synchronization will depend upon the nature of the data
(e.g., for some datasets and sensors such as gait recognition, the labels will be
synchronized with millisecond accuracy, for others synchronization within a few seconds
may be sufficient. Examples of activities might be running, walking, and talking.
Examples of contexts might be location or properties of location (e.g., distance to a
park). Examples of physiological states might be heart rate or galvanic skin response.
(see annotation data file and ontology data file)
1The header/data line distinction is important, because in some cases scripts may concatenate CSV files. Header files from multiple CSV files (e.g., for different hour of data from the same data source) could therefore end up in the same data file. To allow software to strip these lines, the lines should begin with “HEADER_”.
- Self-reported data provided by the participant throughout the study (e.g., Ecological Momentary Assessment). Answers to questions and other self-reported data are synchronized with all other data sources.
- Notes on any special considerations about the data. These notes include special exceptions for particular subjects, descriptions of how various data sources were synchronized, or any other information not already specified in other files that might help others interpret the data. (See session data file)
- Data are saved in a directory structure that segregates data by year, month, day, and hour. The directory structure accomplishes several goals. It permits data for particular timeframes to be easily identified and isolated. It permits data collected at different timeframes to be easily merged. It logically breaks up large amounts of data into manageable, intuitive chunks. It prevents data from being overwritten (even when there is a time zone change or daylight savings time change). Finally, it creates a logical structure for saving data that spans multiple hours, days, or even months (permitting scripts that allow further dataset parsing/merging). The details of the directory format are described below.
- Data files are stored with the start time of data saving included in the file name. This format, described in detail below, ensures that data conflicts are avoided and data will never be accidentally overwritten.
- Data files should not contain data from more than one time zone. If the device (i.e. mobile phone) changes time zones, it should stop saving data to the current file and create a new file for the new time zone.
- Time zone information (UTC offset) is encoded in filenames for easy access to the file/directory’s time zone information . UTC offset is the difference in hours and minutes from Coordinated Universal Time (UTC) for a particular place and date. Within files, the timestamps are stored in local time. However, to ensure that time zone information (and especially daylight savings time shift information) is not lost and does not create data storage conflicts, UTC offset is stored in the filename structure.
- All file and directory names are limited to alphanumeric characters. Filenames and directory names will contain alphanumeric and/or selected special characters (0-9, a-z or A-Z, and “-”). This prevents problems from displaying these filenames on the web or saving them in various file systems. Spaces should not be used. File naming uses CamelCase structure (e.g., “ OriginalRaw”). A dot is used in the filename but the format uses it as a separator, so it should not be added to the name.
- Timestamps are in localtime
format. There are two types of timestamps. Timestamps in file
names are in local time with timezone information. Timestamps within files are in local time
without timezone information (because that information can be gathered from the file name if
- Many files contain timestamp information using the following convention: [YYYY]-[MM]-[DD]-[hh]-[mm]-[ss]-[mmm]-[P/M][hhmm], where YYYY=year (4 digits), MM=month (2 digits [01-12]), DD=day (2 digits [01-31]), hh=hour (2 digits [00-23]), mm=minute (2 digits [00-59]), ss=second (2 digits [00-59]), mmm=millisecond (3 digits [000-999]), P=plus and M=minus and hhmm represents the hour and minutes of the offset of the time from GMT. The International Standard ISO 8601 is used to store all time-related data. Example: 3:15 AM local time on January 1, 2000 for a timezone five hours behind GMT would be listed in the filename as “2000-01-01-03-15-00-000-M0500”.
- Internal to files, the timestamp is listed in the same way but without the timezone information following this convention: [YYYY]-[MM]-[DD] [hh]:[mm]:[ss].[mmm]. Example: 3:15 AM local time on January 1, 2000 would be listed in the timestamp column of a data file as “2000-01-01 03:15:00.000”. Timestamps should typically be the first column in a data file. The header for this column should be “HEADER_TIME_STAMP”. The timestamps always refer to the local time in the place where the data were saved. Timezone information can always be obtained by parsing the filename of the file storing the data.
- Data in CSV files are formatted in the following way to minimize encoding problems:
- Each record is one line, but fields may contain embedded line-breaks (see below) so that a record may span more than one line.
- Fields are separated with commas.
- Headers are all caps to help differentiate them from other data. Examples of header names are HEADER_TIME_STAMP and START_TIME and DATA_SAMPLE.
- Variable names use alphanumeric characters and CamelCase. Examples of variable names are PhoneOn, Yes, No, 0, 1, and ButtonClicked.
- All data entries that are text values with more than one word or special characters should be surrounded by double-quote characters. This is so that special characters, such as commas, can be handled and strings can be more than one word in length (e.g. “This is a sample text”). One-word text values do not need to be surrounded by double-quote (e.g. male, female, High, Moderate, etc.). If the text value contains double quotes, the double quotes need to be properly escaped with a backslash (e.g. “The subject said, \”I did not feel well today \”, repeatedly”).
- Fields with embedded commas must be delimited with double-quote characters. For example, a field with value New York, NY should be should be stored as “New York, NY” because the value has an embedded comma.
- Fields that contain double quote characters must be surrounded by
and the embedded double-quotes
must each be represented by a
pair of consecutive double quotes. For example
The subject said “No”
should be stored as
“The subject said ““No”””.
- A field that contains embedded line-breaks
must be surrounded by double-quotes. For example (this is a single field in a record),
John, Please bring the patient in. J.D. should be stored as : “ John, Please bring the patient in. J.D. ”
- Leading and trailing spacecharacters
(i.e. spaces and tabs) adjacent to comma
field separators and not in quotes are not allowed and will be ignored. (e.g.
“Eating” , “Drinking”, “Walking” is equivalent to “Eating”,”Drinking”,”Walking”because the spaces will be removed)
- Fields with leading or trailing spaces must be delimited with double-quote characters. (e.g. To preserve the spaces in the previous example, the record should be stored as: “Eating ”,“ Drinking”, “ Walking”)
- Fields may always be delimited with double quotes. The delimiters will always be discarded.
- The first record in a CSV file is a header record containing column (field) names, and the first header should start with “HEADER_[HEADER_NAME]”. (E.g., “HEADER_FIRST_NAME”)
The remainder of this document describes the details of the format, which uses conventions for data files, file naming, and directory structure.
Uniquely identifying sensors (SensorType)
The format assigns meaningful “SensorType” names to all sensors (e.g., “Wocket”, “ActigraphGT3XPlus”, “FitbitV3”) so that simply by looking at the file name, researchers can have a sense of what type of data will be in the file. Refer to Table 1 for some SensorType examples. Individual pieces of sensor hardware are uniquely identifiable by their SensorType combined with a “SensorID.” SensorIDs differentiate sensors of the same type using a unique identifier. For wireless sensors, the SensorID might be a MAC Address. For phone sensors, the SensorID might be the unique phone ID (e.g., IMEI). When such a unique number does not exist, GUIDs (Globally Unique Identifiers) generated by the user’s computer system are used to ensure uniqueness. They can be randomly generated. The SensorType name should be in CamelCase style.
Table 1. Sensor Types
|SensorType (unique type identifier, CamelCase style)||Source for
with ‘x’; when no
‘NA’ for not
|AndroidPhone||IMEI||(e.g., LGxNexus4xAndro id5x1x1)||3-axis accelerometer, gyroscope, light, steps, location|
|iPhone||IMEI||3-axis accelerometer, gyroscope, light, steps, location|
|AndroidWearWatch||Watch doesn’t have IMEI, we can use the MAC address or Serial number||3-axis accelerometer, gyroscope, heart rate monitor|
|Wocket||MAC Address||3-axis -/+ 4 g 10 bit 40 Hz (5-120) accelerometer|
|ActigraphGT3XPlus||Serial Number||(e.g. v5x3)||3-axis -/+ 6 g 12 bit 30-100Hz accelerometer, steps|
|ActigraphGT9X||Serial Number||3-axis -/+ 8 g 12 bit 30-100Hz accelerometer and -/+ 16 g IMU accelerometer|
|ActigraphGT3XBT||Serial Number||3-axis -/+ 8 g 12 bit 30-100Hz accelerometer|
|ActivPAL||Serial Number||3-axis 0-1.5g 8 bit 10 Hz accelerometer|
|GENEActiv||Serial Number||-/+ 8g 12 bit Selectable 10-100Hz (10Hz) accelerometer|
|OmronHJ720IT||Serial Number||Steps per minute|
|QstarzGPSBTQ1000XT||Serial Number||Location longitude, latitude, velocity|
|ZephyrHxMBT||MAC Address||Heart rate beats per minute As more sensors are created, the corresponding SensorType name must be added to the table|
As more sensors are created, the corresponding SensorType name must be added to the table above.
Typing sensor data (DataType)
More information on the type of sensor data in a file can be obtained by looking at the first header line in the file. As much as possible, researchers are encouraged to use headers in sensor data files that are consistent, so that data computation and visualization software tools can determine the type of sensor data from the datasets. Table 2 includes the most common types of sensor data and the recommended header fields.
Table 2. Data Types
|Type of data name (DataType) (CamelCase, alphanumeric characters only, no spaces)||Recommended header data column names|
Data directory structure
The top-level directory structure for data storage follows this convention, where all names include only alphanumeric characters and “-” (text following the “//” are comments):
- [StudyName] // Name of study in camelcase
- [ParticipantID1] // All data for participant ID1 is under this directory. A good structure for ParticipantIDs would be “Participant_001”, “Participant_002”, etc. or something similar.
- [ParticipantID2] // All data for participant ID2 is under this directory.
- [ParticipantIDn] // All data for participant IDn is under this directory.
- OriginalRaw // Raw data files from each sensor in the original manufacturer’s format are stored here, untouched. This is useful in case an error is found in a converter or some other type of unexpected problem is discovered.
- MasterSynced // This directory stores the fully synchronized dataset with no
additional processing or computed data files. This is the raw data collected from
the field but converted into the mHealth format and fully synchronized. These
data should never be deleted.
- [YYYY] // 4-digit year
- [MM] // 2-digit month. January is “01” and December is “12”.
- [DD] // 2-digit day. “01”-”31”.
- [HH] // 2-digit
hour in format HR, where
is the “00” hour, 6AM-7AM
is the “06”
hour, and 11PM-Midnight
is the “23” hour. All
sensor data from high sampling rate sensors and
most data from low sampling rate sensors are
stored in the hour directories. Data file naming
conventions provide clues to the data file content
and prevent naming collisions.
Three types of data files are stored in these hour
directories: sensor data (*.sensor.csv.gz),
annotation data (*.annotation.csv.gz”), and event
Only original data are contained in this directory.
Meta-data computed from these data are placed in
the Meta-data directory.
- [SensorType-DataType-VersionInfo].[SensorID].[YYYY]-[MM]-[DD]-[hh]-[mm]-[ss]-[mmm]-[P/M][hhmm].sensor.csv.gz // Compressed, CSV sensor data from this particular sensor, where the time is the time the first line of data was written. The SensorType and SensorID and VersionInfo together uniquely identify the sensor. The content of these files is described below.
- [OntologyID].[AnnotatorID].[YYYY]-[MM]-[DD]-[hh]-[mm]-[ss]-[mmm]-[P/M][hhmm].annotation.csv.gz // Compressed, CSV annotation data, where the OntologyID and AnnotatorID are project-specific and combine to uniquely identify the type of data contained within the file. The time is the time the first line of data was written. The content of these files is described below.
- [EventType].[EventID].[YYYY]-[MM]-[DD]-[hh]-[mm]-[ss]-[mmm]-[P/M][hhmm].event.csv.gz // Compressed, CSV event data, where the EventType and EventID are project-specific and combine to uniquely identify the type of data contained within the file. The time is the time the first line of data was written. The content of these files is described below.
- … // Other raw data, annotation, or event data files
- [HH] // 2-digit hour in format HR, where Midnight-1AM is the “00” hour, 6AM-7AM is the “06” hour, and 11PM-Midnight is the “23” hour. All sensor data from high sampling rate sensors and most data from low sampling rate sensors are stored in the hour directories. Data file naming conventions provide clues to the data file content and prevent naming collisions. Three types of data files are stored in these hour directories: sensor data (*.sensor.csv.gz), annotation data (*.annotation.csv.gz”), and event data (*.event.csv.gz). Only original data are contained in this directory. Meta-data computed from these data are placed in the Meta-data directory.
- [DD] // 2-digit day. “01”-”31”.
- [MM] // 2-digit month. January is “01” and December is “12”.
- [YYYY] // 4-digit year
- Subject.csv // Data on this particular subject that never changes.
- [SensorType].[SensorID].annotation.csv.gz // Data on how the SensorType + SensorID sensor is placed/worn that may change over time // Participant-specific data computed by a system from the data contained within MasterSynced. This would include all computed summary data, all computed features, etc.
- MetaData-[TaskName1] // This label should describe the data contained within, which is data computed
from MasterSynced. The structure of this directory mirrors the structure of
MasterSynced and can include copies of the data that have been manipulated by
algorithms. These should be data that could be automatically regenerated from
MasterSynced if necessary.
- [YYYY] // 4-digit year
- [MM] // 2-digit month.
- [DD] // 2-digit day.
- [HH] // 2-digit
hour. Only data that can be
(re)computed from MasterSynced are
- [AlgorithmType].[AlgorithmID].[YYYY]-[MM]-[DD]-[hh]-[mm]-[ss]-[mmm]-[P/M][hhmm].feature.csv.gz // Computed data. AlgorithmType and AlgorithmID should be unique and descriptive. Otherwise, the format is identical to the event data file described above.
- … // Other feature data
- [HH] // 2-digit hour. Only data that can be (re)computed from MasterSynced are stored here.
- [DD] // 2-digit day.
- [MM] // 2-digit month.
- [YYYY] // 4-digit year
- MetaData-[TaskName2] // More meta-data (e.g., classification output).
- Sessions.csv // Information about sessions that will not change.
- Sensors.csv // Information about sensors that will not change.
- Annotators.csv // Information about all annotators.
- Ontology.[OntologyID1].csv // Information about the [OntologyID1] ontology.
- Ontology.[OntologyID2].csv // Information about the [OntologyID2] ontology.
- MetadataCrossParticipant // Data computed across participants for certain sessions,
recorded by time in the study (versus absolute date)
- [SessionName1] // A session that must be defined in the Sessions.csv.gz file
- Year[Number] // Year index (Number>0). E.g., “Year1”, “Year2”, ...
- Month[Number] // Month index (Number>0). E.g., “Month1”,
- [SessionName1].[YYYY]-[MM]-[DD]-[hh]-[mm]-[ss]-[mmm]-[P/M][hhmm].feature.csv.gz // Month-level data computed from original raw data or meta-data
- Week[Number] // Week in study index (Number>0), E.g.,
“Week1”, “Week3”, “Week11”, where 7 days are a week.
- [SessionName1].[YYYY]-[MM]-[DD]-[hh]-[mm]-[ss]-[mmm]-[P/M][hhmm].feature.csv.gz // Week-level data
- Day[Number] // Day in study index (Number>0).
E.g., “Day1”, “Day23”, “Day1043”
- [SessionName1].[YYYY]-[MM]-[DD]-[hh]-[mm]-[ss]-[mmm]-[P/M][hhmm].feature.csv.gz // Day-level data
- Month[Number] // Month index (Number>0). E.g., “Month1”, “Month11”, “Month12”
- Year[Number] // Year index (Number>0). E.g., “Year1”, “Year2”, ...
- [SessionName1] // A session that must be defined in the Sessions.csv.gz file
The above directory format segregates data logically as follows:
- [StudyName] . Contains all information related to a specific study.
- [ParticipantID] . Contains all information related to a specific participant within that study.
- OriginalRaw . Contains all the original raw files output during the study from various devices for a particular participant. Files in this directory should never be deleted to preserve the original, raw, untouched data.
- MasterSynced . Contains the master copy of the original data, synchronized, in the mHealth format. Once the data have been cleaned and synchronized, any transformation done to these original data files should create new files in the appropriate unique subdirectory(s) in the Meta-data directory.
- Meta-data . This directory contains meta-data created when processing the MasterSynced data, broken into Meta-data “TaskName” directories. These may include fully or partially transformed data or algorithm outputs that can be subsequently used as the source of data for later tasks/visualization and can be deleted or modified easily without any impact on the OriginalRaw data or the MasterSynced data. Each meta-data task contains all information/data related to a specific analysis, algorithm, and/or transformation on a set of original data of a specific participant’s data. This analysis, algorithm, and/or transformation can be anything ranging from simple timeshifted transformations of the original data to creating sessionspecific data sets, which can then be transformed further. For example, a researcher may want to run an algorithm to find the average number of ambulatory steps a specific participant takes in an hour. In this case, a [MetadataTask] could be named “ AverageStepsPerHour ”. These “task” directories should maintain a directory structure identical to ParticipantID’s MasterSynced directory structure where the synchronized raw data is stored.
- MetadataCrossParticipant . Contains all information/data related to any type of analysis, algorithm, and/or transformation done on multiple participants’ data. For example, a researcher may want to calculate the average number of steps taken in Day1, Day2, etc. for a particular study session. In this case, the “task” subdirectory could be named “ AverageStepsPerHour ”. In these directories, data are organized by day-, month-, and year-in-study.
- [YearNumber], [MonthNumber], [WeekNumber], [DayNumber] . These directories represent the number of year/month/week/day of a particular session of the study. For example, a researcher may want to run a cross-participant session (named “Session1”) of a certain study (named “Study1”) for one week. The data files for this session should be stored in this directory structure: Study1/MetadataCrossParticipant/Session1/Year1/Month1/Week1/[Day1-Day7], where [Day1- Day7] represents 7 subdirectories under Week1 for each day in the session. Another example could be a session of a study being run on multiple participants for 3 years, in which case the data files should be stored in this directory structure: [StudyName]/MetadataCrossParticipant/[SessionName]/[Year1 - Year3].
Data and meta-data file descriptions
Sensor data file.
Sensor data are always stored hourly in gzipped ASCII csv files files with this filename structure:
[SensorTypeDataTypeVersionInfo].[SensorID].[YYYY]-[MM]-[DD]-[hh]-[mm]-[ss]-[mmm]-[P/M][hhmm].sensor.csv.gz . The [P/M][hhmm] encodes the offset from GMT, which provides data on daylight savings time mode (P = plus, M = Minus). SensorType is a unique identifier, and SensorID is a unique ID for that sensor type (see Table 1). DataType is a unique identifier for a specific data type the sensor can generate (see Table 2). For example, a sensor file could be: WocketAccelerationCalibrated. 00066606D3AD.2009-06-25-15-0042-001-M0500. sensor.csv.gz. Within the csv file, the data always have this format: TimeStamp, Comma-Delimited-Numeric-Data. The first line in the csv file will always be the header line, which describes the format of the numeric data. In most cases data use pre-defined headers in Table 2. For example, the first two lines in an ActigraphGT3XPLUS sensor sensor.csv file could look like this:HEADER_TIME_STAMP,X_ACCELERATION_METERS_PER_SECOND_SQUARED,Y_ACCELERATION_METERS_PER_SECOND_SQUARED,Z_ACCELERATION_METERS_PER_SECOND_SQUARED
Event data file.
Events with duration can be stored hourly in gzipped ASCII csv files with this filename structure:
[EventType].[EventID].[YYYY]-[ MM]-[ DD]-[ hh]-[ mm]-[ ss]-[ mmm]-[ P/M][hhmm].event.csv . The EventType and EventID together form a unique identifier. A sample filename that might store phone call log data could be: PhoneCalls.00066606D3AD.2009-06-25-15-00-42-001-M0500.event.csv. In the example the EventID could be the phone ID. Event data is stored in this structure: TimeStampDataWritten, StartTimeEvent, EndTimeEvent, Detail (Arbitrary string or comma-separated strings). Events with no duration use the startTimeEvent and leave the EndTimeEvent blank. The first line in the csv file will always be the header line, which describes the format of the data. In most cases data use pre-defined headers in Table 2. For example, the first two lines in the PhoneCalls event.csv file could look like this:HEADER_TIME_STAMP, START_TIME, END_TIME, CALL_INCOMMING_OR_OUTGOING, PHONE_NUMBER
2014-08-22 11:18:18.060, 2014-08-22 11:18:18.060, 2014-08-22 11:19:00.011,Out, 617-111-1111
Feature data file.
The feature data file is identical in format to the event data file, but named with feature.csv instead of event.csv and with a descriptive AlgorithmType. Typically the feature data file will be stored in the meta-data directories and will include processed data that can be recomputed from the raw data. For entries that do not require an END_TIME, that entry is just left blank. The structure of the filename is:[AlgorithmType].[AlgorithmID].[YYYY]-[MM]-[DD]-[hh]-[mm]-[ss]-[mmm]-[P/M][hhmm].feature.csv.gz
Annotation data file.
Annotated activity labels and annotation definitions are encoded in a human and computer readable CSV file. The annotation data file stores information about events with duration, were it is helpful to have additional meta-data about the event available (e.g., annotation definitions). The filename format is:[OntologyID].[AnnotatorID].[YYYY]-[MM]-[DD]-[hh]-[mm]-[ss]-[mmm]-[P/M][hhmm].annotation.csv
A sample filename is: PhysicalActivities.jpn1009.2009-06-25-15-00-42-001-M0500.annotation.csv, where “jpn1009” is the unique identifier of the person who generated the annotations (it could alternatively be the name of an algorithm that generated annotations). Annotation data are stored in this structure: TimeStamp,StartTime,StopTime,LABEL_NAME,Label_GUID, SPECIFIC_SENSOR_ID[Other optional information]. The following properties may be defined in the file:
|LABEL_NAME||Required short name of the label to be used in the visualization software (e.g., “Running” or “Doing homework”). This name should correspond to the name found in the Ontology.[OntologyID].csv file for the LABEL_GUID.|
|LABEL_ID||Optional (but recommended) unique global identifier attached to the label. This is pulled from the Ontology.[OntologyID].csv file. [Note about grabbing this from some sort of common dictionary.]|
|RATING_TIME_STAMP||Optional timestamp indicator of when the rating was given by annotator.|
|RATING_INTENSITY||Optional label intensity value. It should be one of the values for this LABEL_GUID that can be found in the Ontology.[OntologyID].csv file.|
|RATING_CONFIDENCE||Optional confidence value (0 to 1) of the rating assigned by annotator.|
|SPECIFIC_SENSOR_ID||Optional sensor id if the annotation is related to a specific sensor.|
The first two lines of an annotation.csv file could look like this:HEADER_TIME_STAMP,START_TIME,STOP_TIME,LABEL_NAME,LABEL_GUID
2014-08-22 11:18:18.060,2014-08-21 11:08:00.060,2014-08-21 11:28:00.060,”Walking 3MPH”,ed9f3ad4-8c7b-4542-bcde-9fdb4a7d1d0b
More information about the assigned label can be gathered using the Ontology.[OntologyID].csv.gz file. More information about the person who provided this annotation can be obtained from the Annotators.csv file.
Annotators data file.
The Annotators data file contains additional information about any person who has saved annotations. The file is named: Annotators.csv. The file has one comma-delimited line per annotator, with the following header information:
|ANNOTATOR_ID||A unique string that corresponds to the AnnotatorID used in the annotation file names||(e.g., a GUID or “DoeJohn5”)|
|ANNOTATOR_NAME||A string name of the annotator||(e.g., “John Doe”)|
|ANNOTATOR_EMAIL||A string email of the annotator||(e.g., “email@example.com”)|
- NOTES: A string with notes about the annotator that might be important if interpreting the dataset.
Ontology data file.
The Ontology file contains the definitions of and relationships between annotation labels and categories for a particular OntologyID. Labels are grouped into categories. Within each category, labels may be mutually exclusive. Category and label GUIDs are included to permit tracking of change histories and synchronizing definitions between databases when definitions do not change. The file is named as: Ontology.[OntologyID].csv. It uses a KEY, VALUE(s) structure on each line, with the following key/value(s) allowed:
|ONTOLOGY_NAME||String||A short name for the ontology (e.g., to be used in the visualization software). An example is “Exercise Activities” or “Activity Study Activities”|
|DESCRIPTION||String||A clear, concise description of the ontology.|
|DATE_CREATED||Timestamp||Timestamp when this annotation definition was created. Use the timestamp convention.|
|LAST_MODIFIED||Timestamp||Timestamp when this annotation was last modified. Use the timestamp convention.|
|METHOD||String||[String, Choice of 4 options] Indication of how the annotation was done. Options are:
|NOTES||String||Any special notes should be included here. In particular, a full description of how the intensity values for activities are supposed to be determined, if those are used.|
|CATEGORY||String||String name, Category GUID, String Description, Exclusivity [“True” or “False”], Defines a category, which includes a set of labels and properties on those labels.
|LABEL||String||String name, Label GUID, String Description, comma-separated String of IntensityOptions The label is used to annotate data. Labels could be activities, states, contexts, or output from algorithms.
Although ontologies are experiment-specific, researchers are encouraged to reuse the annotation schema of others whenever possible, in order to facilitate data sharing. Examples from prior experiments would be featured on the data sharing website. An example AnnotationData file can be found here: annotation.example.csv and the corresponding example ontology file can be found here: ontology.example.csv
It’s also recommended following sensor ontology information into the ontology data file if needed. The CATEGORY and LABELS defined in the file are:
|SIDE_OF_BODY||The side of the body on which the sensor was worn (left, right, or center).|
|LOCATION||Preferred/default position of sensor on the body from the following list: hand, wrist, forearm, bicep, upper_arm, shoulder, upper_chest, lower_chest, hip, waist, front_pocket, rear_pocket, ear, back_head, top_head, small_back, upper_back, lower_back, forehead, thigh, above_knee, below_knee,calf,above_ankle, top_foot, below_foot, side_ankle, in_bag, in_purse, in_backpack, in_briefcase, in_saddlebag, unknown.|
|BODY_ATTACHMENT||How the monitor is adhered to the body (tape, belt, clipped_on_clothing, in_pouch)|
Sensors data file.
Sensor configuration information is encoded in a human and computer readable Sensors.csv file. This is stored in the [StudyName] directory. Each line in the file contains information about one sensor used in the study, where sensors are uniquely identified using the SensorType and SensorID values. In cases where a single physical sensor saves two different types of data streams, such as accelerometer data and heart rate, the sensor would have two entries (i.e., two lines) in the file. The following header information is required:
|SENSOR_TYPE||The SensorType as found in Table 1.|
|SIDE_OF_BODY||The unique identifier such as serial number, MAC address, or other device-specific number that uniquely identifies a piece of hardware (see Table 1).|
|MODEL||The manufacturer’s model designation for the sensor.|
|VENDOR||The vendor of the sensor.|
|DESCRIPTION||A description of the sensor and what it does.|
|NICKNAME||A short nickname for the sensor (e.g., to be displayed in visualization software).|
|NOTES||Any special notes that another research might want/need to know about this particular sensor.|
|TYPE_OF_DATA||The data type of the sensor described in this line (See Table 2, column 1 for names). Can have multiple data types generated by a sensor I guess, especially for the phone. IMU sensors can also generate multiple data types.|
Optional header information is as follows:
|SAMPLING_RATE_HZ||The expected sampling rate in samples/second. For some type of emerging sensors, “variable” is a legitimate value, meaning that the sampling rate may change throughout the dataset, either intentionally or because of the design of the hardware. “unknown” is also a legitimate value.|
|NUMBER_OF_AXES||The number of axes in each sample.|
|RANGE_MINIMUM||The minimum value the sensor will output in the current configuration.|
|RANGE_MAXIMUM||The maximum value the sensor will output in the current configuration.|
|G_RANGE||The +/-range in g’s measured by an accelerometer sensor. For some type of emerging sensors, “variable” is a legitimate value.|
|CALIBRATION||Calibration information that may be used by some algorithms. For an accelerometer, this might be x1g (output of x-axis of accelerometer when oriented towards gravity), x1ng (output of x-axis of accelerometer when oriented away from gravity), y1g, y1ng, z1g, z1ng, xstd (the mean standard deviation of output values when the sensor is set on a solid, non-moving surface), ystd, and zstd.|
|BITS||The number of bits of resolution for each reading. For accelerometers, this is typically 10 or 12 or 24.|
|EPOCH_SECONDS||The length of the epoch, in seconds.|
Examples of descriptions for several common sensor types and output types are included in the sample sensors.csv file, found here: sensors.csv.
Subject data file.
Subject information that will not change across the study is recorded in a human and computer readable CSV file. Whenever possible the data should be stored in universally-accepted formats such as SI units (e.g., kg vs. pounds, kph vs. mph). This subject information is stored in the [ParticipantID] directory. The file uses the CSV format and stores the following required information with these headers:
|SUBJECT_ID||Unique ID for the subject, Note that it should be a unique ID generated by the system when participant enrolls. IMEI of a person’s phone should not be used as the SUBJECT_ID, which would cause a problem should a person switch phones or maintain more than one mobile phone.|
|SEX||Male or Female|
The following information should also be included when available:
|SPECIAL_NOTES||Any notes about the subject that might be important when using the data.|
|DOMINANT_HAND||Left or Right or Unknown|
|DOMINANT_LEG||Left or Right or Unknown|
|RESTING_HEART_RATE||Resting heart rate|
|MAXIMAL_HEART_RATE||Maximal heart rate|
|PHONE_LOCATION||Where a mobile phone is normally carried|
|SERVICE_PROVIDER||Mobile phone service provider|
|MOBILE_PHONE_MODEL||Mobile phone model|
|SELF_REPORTED_FITNESS||Results of the Stanford Brief Activity Survey, a test that can be easily administered in two minutes.|
If certain information is not available, it is left blank. An example Subject.csv file can be found here: Subject.csv
Sessions data file.
The sessions description file contains information about the session that someone analyzing the data might need to know. This is stored in the [StudyName] directory. Recommended information to include is: dataset name, dataset purpose, information about the setup, and any information someone using the dataset would need to be aware of. The file uses the CSV format and stores the following information for each session, with these headers:
|SESSION_NAME||This should not include underscores or spaces because it is used in the file structure|
|SESSION_START_DATE_TIME||Date when data collection for all participants started for this session|
|SESSION_END_DATE_TIME||Date when data collection for all participants ended for this session|
|SESSION_DESIGN||An overview of the session design|
|SESSION_ISSUES||Known issues about the design of the session|
|SENSOR_PLACEMENT||Detailed information about how sensors were placed on a person during this particular session. There should be enough information so that specific details can be reproduced.|
|REFERENCE||Reference to cite if using the data|
|ACKNOWLEDGEMENTS||Acknowledgements indicating who collected the data and supported that data collection in what way|
|CONTACT_NAME||Contact name for more information|
|CONTACT_EMAIL||Contact email for more information|
|WEBSITE||Website for more information|
Storing the session’s start and end timestamps allows for human and computer to easily find a participant’s session data in the participant’s “ Sensors ” directory. These data can then be exported using an algorithm that understands the simple structure of the data storage directory and how to parse this sessions.csv file. This algorithm should produce a gzipped file. Exporting data in this way should generate data in the participant’s root subdirectory (i.e. [StudyName]/[ParticipantID]) of the name [MetaDataTaskName].[YYYY]-[MM]-[DD]-[hh]-[mm]-[ss]-[mmm]-[P/M][hhmm].csv.gz. An example Sessions.csv file can be found here: Sessions.csv.
Sensor annotation file.
A special annotation file is used to store data about each sensor that cannot be stored in the Sensors.csv file because it could change throughout the study. This file is designated with this filename: [SensorType].[SensorID].annotation.csv. It is stored in The [ParticipantID] directory. This annotation file uses the Ontology.Sensor.csv. It should have the same format as annotation data file. An example could be,HEADER_TIME_STAMP, START_TIME, STOP_TIME, FIRMWARE, SIDE_OF_BODY, LOCATION, BODY_ATTACHMENT
2014-08-22 11:18:18.060,2014-08-21 11:08:00.060,2014-08-21 11:28:00.060,1.1.0,Left,Wrist,Tape
Using text formats with Gzip compression
One concern some have raised with saving data in a text (ASCII) format instead of binary is that text is not an efficient use of disk space, and that the I/O processing of the text files will dramatically slow down software using the data. Each concern is discussed below.
Efficient use of disk space
The mHealth format saves all data in text (ASCII) files and then calls for compressing those files using gzip compression. Some tests were therefore performed to verify that saving high sampling rate CSV text data and then compressing the resulting file was both efficient in terms of storage and efficient in terms of computation required.
Best-and worst-case csv files with high resolution 3-axis accelerometer data were generated to assess the impact on compression. For the best-case scenario, the csv contained 1 hour of 50 Hz data with a full readable timestamp per row and floating point g values that are not changing (mimicking sedentary behavior). Below are 10 rows from that file:
The resulting CSV file is approximately 7.9MB for a full hour, and when gzipped it compresses to 499KB, a 94% compression ratio. [Add how long it takes on a phone to do the compression]
For the worst-case scenario, a second csv file with 1 hour of 50 Hz data with a full readable timestamp per row and floating point g values that are randomly changing over a 6 g range was generated. Below are 10 rows from the file:
The resulting CSV file is approximately 7.8 MB for a full hour. Notice that it has approximately the same size as the previous case because both files have approximately the same number of characters. Gzipping the file reduces it to 2.0MB, a 74% compression ratio.
In contrast, the same tests were performed where the data were saved in binary format. For the bestcase scenario, the resulting binary file is approximately 7.0MB for a full hour, and when gzipped it compressed to 486KB with a 93% compression ratio. For the worst-case scenario, the resulting binary file is approximately 7.0MB for a full hour, and when gzipped it compressed to 2.3MB with a 67% compression ratio. Note that for both tests, the resulting binary files have an identical filesize, which is expected due to the fact that all floating point numbers (accelerometer x,y,z) are stored as bytes (4 bytes each, 12 bytes total).
Comparing the two versions of the second test (CSV vs binary, randomized data points), we can conclude that the CSV format yielded a smaller file size after gzipped (2.0MB vs 2.3MB, respectively). This is expected, due to the fact that CSV files can be gzipped more efficiently because in the average case of an accelerometer data file, the first 2-3 characters (including the minus sign for negative numbers) of the floating point values are largely repeated across the file. In the case of the binary data file, all floating point values are always stored as 4 bytes which have to be preserved by the gzip compression, resulting in a less efficient compression rate.
Realistically, accelerometer values do not change randomly and dramatically on all axes as shown in the previous example. There are physical limitations to movement and people spend considerable time sedentary. Therefore, it is likely that the average data file size for an hour of accelerometer data collected from a wearable sensor will be in the middle of the range between 500K and 2.0MB (i.e. around 1.2MB). In addition, the inclusion of a full timestamp per row has minimal impact because most of the characters are repeated in every row and therefore gzip compresses the timestamp efficiently. Indeed, compressed Actigraph data from real users (without timestamps) led to csv.gz files of about 700KB per hour.
|gt3x||mHealth**||compressed differentials***||csv||sampling rate(Hz)||start (UTC)||end (UTC)|
|35.6 kB||34.1 kB||23.4 kB||332.5 kB||30||2010/10/26 13:30:00||2010/10/26 13:35:56|
|1.8 MB||1.7 MB||1.1 MB||13.5 MB||30||2015/04/09 14:02:00||2015/04/09 17:36:33|
|1.8 MB||1.7 MB||1.1 MB||13.1 MB||30||2015/04/09 14:04:00||2015/04/09 17:36:06|
|486.2 kB||1.3 MB||0.9 MB||13.5 MB||30||2015/04/09 14:00:00||2015/04/09 17:36:55|
|20.8 MB||19.6 MB||12.2 MB||167.2 MB||80||2011/10/13 17:30:00||2011/10/14 09:11:26|
|61.5 MB*||173.5 MB||123.6 MB||1.9 GB||30||2013/11/19 00:00:00||2013/12/09 11:29:02|
|72.2 MB*||192.0 MB||129.2 MB||1.9 GB||30||2013/12/09 00:00:00||2013/12/09 11:19:48|
* Large gaps occur in the data for GT3X file format version 2. When converting to csv (mHealth or otherwise), Actigraph’s guide instructs developers to fill in these gaps using the last known data point, causing the resulting mHealth file to be larger than the compressed gt3x version, depending on how large and frequent the gaps are within the gt3x file. For files spanning over longer ranges of time, the size difference may become more apparent given that Actigraph leaves gaps in the data empty. For the two marked data points, without filling the gaps, the mHealth file sizes are 66.3 MB (csv is 409.2 MB, compressed differentials is 39.6 MB, ~60% total gap in the data) and 88.1 MB (csv is 473.9 MB, compressed differentials is 48.7 MB, ~55% total gap in the data) respectively.
** Output from gt3x-to-mhealth converter was not split into separate hourly files for comparison sake.
*** For an hour’s worth of data sampled at 30Hz where the person wearing the accelerometer is sedentary, compressed differentials can yield an output as low as 316.4 kB (eg. 3rd row in the table), where each data point is 3 bytes except the first line, which has the full timestamp and data (to serve as a reference point).