From: Artificial intelligence tool development: what clinicians need to know?
No | Data curation step | Explanation |
---|---|---|
1. | Data Exploration | This involves getting familiar with the dataset by examining its structure, size and basic statistics. This step helps identify any missing values, outliers or inconsistencies in the data. It provides insights into the distribution of features and potential patterns that may exist within the datasets |
2. | Linking and Combining Different Sources into One Dataset | This involves integrating data from multiple datasets or sources into a single cohesive dataset. This step allows for a comprehensive analysis of data from various sources, enabling insights and patterns that may not be apparent when analysing individual datasets separately |
3. | Deidentification (Pseudo or Anonymisation) | Deidentification involves removing or obfuscating personally identifiable information from the dataset to protect individual privacy. This step is crucial for handling sensitive data and ensuring compliance with data protection regulations. Deidentified data can still be used for analysis and modelling while preserving the anonymity of individuals |
4. | Data Annotation | This involves labelling or tagging the data with relevant information such as class labels or categories, to prepare it for supervised learning tasks. This step is essential for training ML models as it provides ground truth labels for the algorithm to learn from. Data annotation can be done manually by human annotators or using automated tools, depending on the complexity and scale of the dataset |
5. | Data Preprocessing | This critical step includes several substeps to clean and prepare the data for analysis or modelling. This may involve removing noise or irrelevant information, handling missing values, addressing class imbalances, encoding categorical variables and scaling or normalising features. Data preprocessing aims to ensure that the data is in a standardised format and suitable for ML algorithms. This ensures that all features have a similar scale and distribution, which can improve the performance of certain ML algorithms such as gradient descent-based methods. Common techniques include scaling features to have zero mean and unit variance (standardisation) or scaling features to a specified range (min–max scaling) |
6. | Data Quality Assurance | This involves ensuring the integrity, accuracy and reliability of the dataset throughout the curation process. This includes conducting thorough checks for errors, inconsistencies or biases in the data, as well as validating the annotations and preprocessing steps. Data quality assurance aims to identify and rectify any issues that could impact the performance or validity of the ML models trained on the dataset |
7. | Data Splitting | This involves dividing the dataset into training, validation, and test sets to evaluate the performance of ML models. A common practice is to split the data into a 70–30 or 80–20 ratio, with the larger portion allocated for training. The training set is used to train the model, the validation set is used to tune hyperparameters and optimise model performance, and the test set is used to evaluate the final performance of the model on unseen data. Proper data splitting ensures that the model’s performance estimates are unbiased and generalisable to new data |
The quantity of unique inputs for features/variables in the datasets for AI/ML algorithms is in the thousands. When faced with a lack of data quantity or poor data quality even after data cleaning and preprocessing as mentioned above, there are several strategies that can be employed to overcome these challenges: | ||
1. | Data Augmentation | This involves generating additional training samples by applying various transformations to the existing data. This includes random rotations, flips, crops or colour adjustments for image data, adding noise or perturbations for other types of data. Data augmentation helps increase the diversity and variability of the training data improving the model’s ability to generalise to unseen examples and reducing the risk of overfitting |
2. | Feature/variable Engineering | This involves creating new features or transforming existing ones to improve the performance of ML models. This may include extracting relevant information from raw data, combining or aggregating features, or applying mathematical transformations to make the data more informative or discriminative. Feature engineering may improve the model’s ability to capture underlying patterns in the data |
3. | Ensemble Methods | Combine predictions from multiple weak models to create a stronger and more robust model. Ensemble methods such as bagging, boosting or stacking can help mitigate the effects of noisy or low-quality data by leveraging diverse models and averaging their predictions |
4. | Semisupervised Learning | Incorporate unlabelled data along with the limited labelled data to train the model may leverage the abundant unlabelled data to improve model performance and generalisation even when labelled data is scarce |
5. | Active Learning | Strategically select which samples to label by iteratively training the model on a small labelled dataset, then using the model to select the most informative samples for annotation. This approach maximises the utility of limited labelling resources |
6. | Transfer Learning | Utilise pretrained models on large, relevant datasets and fine-tune them on your smaller or lower-quality dataset. Transfer learning leverages the knowledge learned from the pretrained model to boost performance on the target task with limited data |
7. | Domain Knowledge Integration | Incorporate domain knowledge and expertise into the modelling process to guide feature selection, model architecture design and interpretation of results. Domain knowledge can help compensate for data limitations and improve the relevance and accuracy of the model's predictions |