Essay sample

Data Cleansing and DE-Duplication

Free ideas for

Today, data plays an important role in people‟s daily activities. With the help of some database applications such as decision support systems and customer relationship management systems (CRM), useful information or knowledge could be derived from large quantities of data

However, investigations show that many such applications fail to work successfully. There are many reasons to cause the failure, such as poor system infrastructure design or query performance.

Free ideas for

Data cleaning is needed in process of combining heterogeneous data sources with relation or tables in databases. Data cleaning or data cleansing or data scrubbing is defined as removing and detecting errors along with ambiguities existing in files, log tables. It is done with the aim to improve quality of data. Data quality and data cleaning are both related terms. Both are directly proportional to each other. If data is cleansed timely then quality of data will get improved day by day. There are various data cleaning tools that are freely available on net. The tools include Winpure Clean and Match, OpenRefine, Wrangler, Data cleaner and many more. The thesis presents information about WinPure Clean and Match data cleaning tool, its benefits and applications in running environment due to its three filtered mechanism of cleaning data. Its implementation has been done by taking user defined database and results are presented in this chapter. Clean and Match is made of three components- Data, Clean and Match. Data gives us imported list of tables. Clean option consists of seven modules each having different purposes. The clean section is basically used to analyze, clean, correct and correctly populate given table without removing duplicity. It has separate cleansing modules like Statistics Module, Case converter, Text cleaner, Column cleaner, E-mail cleaner, column splitter and column merger. Match section is used to detect duplicity using fuzzy matching de-duplication technique

WinPure Clean and Match contains a unique 3 step approach for finding duplications in given list or database.

Free ideas for

The sensitivity of the chosen statistical analysis method to outlying and missing values can have consequences in terms of the amount of effort the investigator wants to invest to detect and remeasure. It also influences decisions about what to do with remaining outliers (leave unchanged, eliminate, or weight during analysis) and with missing data (impute or not) [Armitage P, Berry G., 1987]. Study objectives codetermine the required precision of the outcome measures, the error rate that is acceptable, and, therefore, the necessary investment in data cleaning. Longitudinal studies necessitate checking the temporal consistency of data. Plots of serial individual data such as growth data or repeated measurements of categorical variables often show a recognizable pattern from which a discordant data point clearly stands out [Ki FY, Liu JP, Wang W, 1995]. In clinical trials, there may be concerns about investigator bias resulting from the close data inspections that occur during cleaning, so that examination by an independent expert may be needed

In small studies, a single outlier will have a greater distorting effect on the results . Some screening methods such as examination of data tables will be more effective, whereas others, such as statistical outlier detection, may become less valid with smaller samples.

Free ideas for

As has been noted, methods no longer suitable for big data. Data cleansing process mainly consists of identifying the errors, detecting the errors and corrects them

Despite the data need to be analyzed quickly, the data cleansing process is complex and time-consuming in order to make sure the cleansed data have a better quality of data. The importance of domain expert in data cleansing process is undeniable as verification and validation are the main concerns on the cleansed data.

Free ideas for

Society for Clinical Data Management. Good clinical data management practices, version 3.0. Milwaukee (Wisconsin): Society for Clinical Data Management; 2003. Available:

Armitage P, Berry G. Statistical methods in medical research, 2nd ed. Oxford: Blackwell Scientific Publications; 1987. 559 pp.

Ki FY, Liu JP, Wang W, Chow SC. The impact of outlying subjects on decision of bio-equivalence. J Biopharm Stat. 1995;5:71–94.

Was this essay example useful for you?

Do you need extra help?

Order unique essay written for you
essay statistic graph
Topic Popularity