If you want to trust your analysis first of all you should trust your data. In order to increase your data quality first you should know the noisy ones.
I. Null or Blank Values
Is Null functions of every query language works. Score should be considered 1, means that it is definitely wrong.
II. Definetly Wrong Values:
- Comparison with Other Databases
There are databases that you can trust more like government’s databases, and when you compare your data with them if the value does not match then should be considered as a wrong value and scored as 1.
- Comparison with Internal Databases
If two data from your multiple databases do not match means that one of them should be considered as wrong. Initially scoring both of them 0,66 is logical.
(
We know that the result is false so;
There are three possibilities here so the Truth Table:
I. Column & II. Column Result
True False False
False True False
False False False
Assumption: All of these possibilities should be considered as having the same probability.
So both of these two columns have %66 possibility of being wrong.
After some cleansing activities these possibilities can be changed with observed possibilities, this is an initial assumption.
)
- Comparison of different Columns
There are related data in every databases like a person’s whose job column is doctor and whose education column is high school is not possible. At least one of these values should be considered as wrong. So initially scoring both of them 0,66 may be logical.
III. Values that have possibility of being wrong
This is the prediction part of database scoring.
There may be used two methods here:
1. Clustering-Segmentation
Cluster would have some dominant characteristic like most of the customers is married or income mean 2.000 so going away from these values makes it most probable for being a wrong value.
2. Modeling
A model is built for every single column. This is a much more expensive method then first.
IV. Values That Can Not be Predicted
There always remain a data that can not be predicted. So these values have the same possibility of being wrong or right. So initially these kind of values should have 0.5 score.
As it is stated above these are the initial scores. All of these segments should be re-scored after data cleansing feedbacks.
Güven Kızıltaş
No comments:
Post a Comment