Data Cleansing and DE-Duplication
However, investigations show that many such applications fail to work successfully. There are many reasons to cause the failure, such as poor system infrastructure design or query performance.
Its implementation has been done by taking user defined database and results are presented in this chapter. Clean and Match is made of three components- Data, Clean and Match. Data gives us imported list of tables. Clean option consists of seven modules each having different purposes. The clean section is basically used to analyze, clean, correct and correctly populate given table without removing duplicity. It has separate cleansing modules like Statistics Module, Case converter, Text cleaner, Column cleaner, E-mail cleaner, column splitter and column merger. Match section is used to detect duplicity using fuzzy matching de-duplication technique. WinPure Clean and Match contains a unique 3 step approach for finding duplications in given list or database.
Study objectives codetermine the required precision of the outcome measures, the error rate that is acceptable, and, therefore, the necessary investment in data cleaning. Longitudinal studies necessitate checking the temporal consistency of data. Plots of serial individual data such as growth data or repeated measurements of categorical variables often show a recognizable pattern from which a discordant data point clearly stands out [Ki FY, Liu JP, Wang W, 1995]. In clinical trials, there may be concerns about investigator bias resulting from the close data inspections that occur during cleaning, so that examination by an independent expert may be needed. In small studies, a single outlier will have a greater distorting effect on the results . Some screening methods such as examination of data tables will be more effective, whereas others, such as statistical outlier detection, may become less valid with smaller samples.
Data cleansing process mainly consists of identifying the errors, detecting the errors and corrects them. Despite the data need to be analyzed quickly, the data cleansing process is complex and time-consuming in order to make sure the cleansed data have a better quality of data. The importance of domain expert in data cleansing process is undeniable as verification and validation are the main concerns on the cleansed data.
Society for Clinical Data Management. Good clinical data management practices, version 3.0. Milwaukee (Wisconsin): Society for Clinical Data Management; 2003. Available: http://www.scdm.org/GCDMP
Armitage P, Berry G. Statistical methods in medical research, 2nd ed. Oxford: Blackwell Scientific Publications; 1987. 559 pp.
Ki FY, Liu JP, Wang W, Chow SC. The impact of outlying subjects on decision of bio-equivalence. J Biopharm Stat. 1995;5:71–94.