Data cleaning
rdm |
Some guidelines for cleaning your data:
- Open the file and look at the data description to be sure they fit to each other. Do the data comply with the GDPR?
- How are the missing value encoded ? Be careful, there can be several types of missing values (« I don’t want to answer » is not the same as « I don’t know the answer»).
- Individuals/observations are in line and not in column.
- The colum names should be written on one line and not two (it will help for a potential importation).
- Withdraw the useless lines and columns (and avoid to leave empty columns in the middle of the data).
- Look at the data types : there should be only one per column.
- If you import the data, check if there are the same before and after the importation. For example, you should have the same number of lines and of columns.
- For numerical variables, compute summary statistics (minimum, mean, main quantiles, maximum, boxplots, histograms, ...) and check if the taken values are possible (e.g. : a negative value as a number of heart beats).
- Check the levels of the categorical variables, especially if there are some differences in the way they are written (e.g. : Belgium is different from belgium).
- Look for duplicated observations.
- Look for consistency between your variables(e.g : marital status = single but in the column other marital status the person wrote married).
- You should always document the changes in your data. Keep them in a file with your data.