The data sets which are collected over a period of time without any boundary lines are often difficult to understand and simulate. The results generated are often confusing as the data sets are not properly initialized. The values which are extended during data mining often do not match the actual/original values. Based on the data collected, attributes can be determined by the prediction model. The attribute value can be either true or false. This prediction model can classify data sets into two categories: Category one matches one attribute, category two matches another (can be taken as true or false). The results are analyzed through observers' expertise, historical data and simulation results. Tools supporting data prediction model are widely used by research community. Complex data sets can define more than two attributes which can refine the prediction model. For example if we are dealing with patient data: a patient can be anaemic or non anaemic. The correlation model can be anaemic patient with depression or without depression and vise versa.
A Query using Pearson coefficient [1] to determine correlation between two attributes :
SELECT
Group1, Group2,
((psum - (sum1 * sum2 / n)) / sqrt((sum1sq - pow(sum1, 2.0) / n) * (sum2sq - pow(sum2, 2.0) / n)))
AS
r, n
FROM
(SELECT
n1.Group AS Group1,
n2.Group AS Group2,
SUM(n1.depression) AS sum1,
SUM(n2.depression AS sum2,
SUM(n1.depression * n1.depression) AS sum1sq,
SUM(n2.depression * n2.depression) AS sum2sq,
SUM(n1.depression * n2.depression) AS psum,
COUNT(*) AS n
FROM
testdata AS n1
LEFT JOIN
testdata AS n2
ON
n1.anaemic = n2.nonanaemic
WHERE
n1.Group > n2.Group
GROUP BY
n1.Group, n2.Group) AS step1
ORDER BY
r DESC,
n DESC
[1] http://www.vanheusden.com/misc/pearson.php
No comments:
Post a Comment