Raw data is seldom pure since it is the blood of modern decision-making. It is usually received in a sloppy manner, half-finished or unequal and therefore analysing it becomes a big challenge. To any data analyst, data cleaning is not only a good practice but a prerequisite. Here are some key methods of data cleaning that should be considered in the arsenal of every analyst. These methods help you make sure that your insights are supported by a strong background.
Handling Missing Values
One of the most widespread data quality problems is missing values. They may arise due to different causes, which may include input of the wrong information, malfunction of the systems or even due to a lack of information. Their disregard may result in biased outcomes and wrong models.- Row Deletion: In cases where there are too many missing numbers in a row or the data is too big, then deleting the entire row may be possible. It is appropriate when the rate of missing data is low (i.e. less than 5 per cent).
- Column Deletion: In some cases, there may be a high percentage of missing values; in these instances, it can be important to delete the column, as it may be unable to provide much valuable information.
- The value of the gap is assumed by another value.
- Mean / Median / Mode Imputation: It is the substitution of the missing numerical values with the mean or the median of the column. On nominal data, it is possible to rely on the mode (the most frequent value).
- Forward/Backwards Fill: In the case of time-series data, it can be useful to carry the preceding (or succeeding) valid observation forward (or backwards).
- Regression Imputation: It involves a method of estimating missing data using regression models on other data points. It is a more complex approach, but it may also be biased when it is not applied with caution.
Removing Duplicates
The duplication of records may alter the statistics, may overcount and will give the wrong impression of the actual data. To further know about it, one can visit the Data Analyst Course. It is important to identify and eliminate them to get the correct analysis.- Precise Duplicates: The majority of data cleaning software and programming languages (such as Python with Pandas or SQL) have simple functions to detect and delete rows that are identical in all or a subset of columns.
- Partial Duplicates: Records do not necessarily represent complete duplicates, but may be the representatives of a single object with minor differences between them. This may require the use of fuzzy matching algorithms, string similarity (Levenshtein distance) or (if it is available) unique identifiers.
Correction of the wrong Information
These inconsistencies can be attributed to disparities in most cases because of various approaches in terms of data entry, lack of standardization, or a combination of different sets of data that are acquired on dissimilar bases. They can be of all sorts, such as spellings that vary, formats that are non-standardised or not.- Case Consistency: Convert all the text to the same uniform case (e.g. New York, New York, NEW YORK, become New York).
- Whitespace Removal: Get rid of leading/trailing whitespaces.
- Spelling Correction: The most frequent spelling errors can be corrected with the help of dictionaries or fuzzy matching.
- Value Mapping Value mapping is comparable, except that a look-up table is used to map inconsistent values to a standard one (e.g. all variants of the USA climate map to the United States).
Dates and Numbers: Standardisation
- Date Format Unification: All dates are in a single format (e.g. YYYY-MM-DD).
- Unit Conversion: Transform all quantitative measurement units to a common set (i.e. all weights are turned to kilograms, all currencies are changed to USD).
Handling Outliers
Outliers are the elements that noticeably stand out among the rest of the observations. They may be valid and unusual data points, or they may be mistakes. Major IT hubs like Delhi and Noida offer high-paying jobs for skilled professionals. A Data Analyst Course in Delhi can help you start a promising career in this domain. Depending on their nature, they can have a drastic impact on statistical analysis and machine learning models.- Visualisation: Box plots, scatter plots, and histograms are excellent visual tools and will be used to determine the outliers.
- Z-score: This is used to determine the number of standard deviations that a data point deviates from the mean. A Z-score greater than 3 or less than -3 is one of the most common possible thresholds.
- Removal -In cases where there is an evident outlier that is due to a mistake in data entry, then that outlier could be removed.
- Transformation: downplay the effect of skewed distributions and outliers by using mathematical transformations (e.g. log transformation).
- Imputation: The outliers are modelled as missing values and imputed in some manner.
- Box plots. In particular, plots such as box plots are helpful when you need to identify outliers by the IQR method.
Validating Data Types
Unrightfully, kinds of data are a sensitive yet strong issue. The use of a text column can represent a numerical column, and the result will not permit mathematical manipulations. The strings could also represent dates, and not in the form of date-time objects.- Performing Explicit Conversion: Convert columns to their correct data types, and an object column to a categorical column, etc.
- Error Handling: Do not disregard conversion errors (e.g. non-numeric characters in a numerical column): either replace them with nulls or fix them.
Removing Irrelevant Data
The dataset, in certain instances, might contain columns or rows that are irrelevant to the question or the analysis that is being done on it. Even though it is not so dirty, they can make noises, decrease the processing time, and distract with the basic insights.- Remove Columns: Delete those columns that do not pertain to the analysis, or are redundant or have too many empty fields to be useful.
- Row Filtering: This can also be used to filter rows that are not related to your analysis (e.g. data in a different period, items in specific categories that you are not interested in).