Model your data as a series of process observations or measures that are associated with an outcome of interest. Compose each observation as a common set of features (aka., independent variables or factors) along with the associated outcome. Both features and outcomes are either numerical (dates, times, ages, etc.), binary (yes/no, true/false, etc.) or categorical (Gender, Service line , Floor unit, Shift, DRG, etc.).
You prepare a comma delimited file with a header row as shown below. Include a single categorical or binary outcome column as shown in the right most column. All data can be either numerical or categorical. Input data columns contain feature values that may be expected to have some impact on the outcome. Feature selection, also called variable selection, is best determined in advance of data preparation by subject matter experts. We work closely with our customers to incorporate domain knowledge within the feature set design.
Numerical feature values must begin with a number and have an ordered relationship, like patient temperature for example.
Categorical features must begin with a letter and are those where there is no order between the possible values for the variable (i.e. there is no order relationship between Sunny and Rain, one is not bigger nor smaller than the other, but are just distinct.
The sample below shows encounter data coupled with healthcare HCAHPS survey results. In this example, all encounter data features are categorical. Use as many categories (columns) as you wish; however, use only the minimum number of categorical values as possible within each category. For instance, if the answer to a question is 'sometimes' or 'usually' that answer should be rolled up to a single category. In our example below responses of 'sometimes', 'usually, and 'never' were rolled up to a category of 'Other'. A good practice is to not use categorical values that do not help distinguish between outcomes of interest. In this case we were only interested in distinguishing leading factors between 'Always', the only desired response, and all other responses.
This is the question:
Although Table 1 below consists of only 100 samples and 8 encounter features it is difficult to manually identify which factors most distinguish the undesired response of 'other' from the desired response of 'Always'. View the table below. For us humans there are too many inputs, too many outputs, too many anomolies and randomness. We would never be able to determine, from looking at the data, what data relationships can predict the outcome. However; machine learning is able to train on 80% of this data and predict the response label (outcome) of the other 20% with 96-100% accuracy and tell you what findings it used in prediction!
We can use a series of rules derived from machine learning to implement process interventions designed to improve patient satisfaction. For example; we may learn that a particular service line and staff shift are strong factors in a negative survey response. One can then focus on those process areas for a positive impact on survey results.
Table 1 - Simulated encounter and HCAHPS results data