Math  /  Data & Statistics

QuestionFind the information gain by splitting the dataset using Age? \begin{tabular}{|c|c|c|c|c|c|} \hline \#\# & Age & Prescription & Astigmatism & Rate & \begin{tabular}{l} Lenses \\ (Class) \end{tabular} \\ \hline 1 & Young & Myope & No & Reduced & None \\ \hline 2 & Young & Myope & Yes & Normal & Hard \\ \hline 3 & Young & Hypermetrope & No & Reduced & None \\ \hline 4 & Young & Hypermetrope & Yes & Reduced & None \\ \hline 5 & Young & Hypermetrope & Yes & Normal & Hard \\ \hline 6 & Young & Myope & No & Reduced & None \\ \hline 7 & Middle & Myope & Yes & Reduced & None \\ \hline 8 & Middle & Myope & Yes & Normal & Hard \\ \hline 9 & Middle & Hypermetrope & No & Normal & Soft \\ \hline 10 & Middle & Hypermetrope & Yes & Reduced & None \\ \hline 11 & Middle & Hypermetrope & Yes \square & Normal & None \\ \hline 12 & Middle & Myope & No $\$ & Reduced & None \\ \hline 13 & Middle & Myope & No & Normal & None \\ \hline 14 & Senior & Myope & Yes & Reduced & None \\ \hline 15 & Senior & Myope & Yes & Normal & Hard \\ \hline 16 & Senior & Hypermetrope & No & Reduced & None \\ \hline 17 & Senior & Hypermetrope & No & Normal & Soft \\ \hline \end{tabular}

Studdy Solution

STEP 1

1. The dataset consists of 17 examples with multiple attributes including Age, Prescription, Astigmatism, Rate, and the target class Lenses.
2. The goal is to compute the information gain by splitting the dataset based on the attribute Age.
3. Information Gain (IG) is calculated using the concept of entropy in information theory.
4. Entropy is a measure of the impurity or disorder in the dataset.
5. The formula for entropy HH for a binary classification problem is given by: H(S)=p+log2(p+)plog2(p) H(S) = -p_+ \log_2(p_+) - p_- \log_2(p_-) where p+p_+ is the proportion of positive examples and pp_- is the proportion of negative examples in the dataset.
6. The information gain for an attribute is computed as: IG(S,A)=H(S)vValues(A)SvSH(Sv) IG(S, A) = H(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} H(S_v) where SS is the original dataset, AA is the attribute, Values(A)Values(A) are the unique values of AA, SvS_v is the subset of SS for which attribute AA has value vv, and S|S| is the number of examples in SS.

STEP 2

1. Calculate the entropy of the original dataset.
2. Split the dataset based on the attribute Age and calculate the entropy for each subset.
3. Calculate the weighted average entropy of the subsets.
4. Compute the information gain by subtracting the weighted average entropy from the original entropy.

STEP 3

Calculate the entropy of the original dataset. First, count the number of instances of each class (None, Hard, Soft).
There are 17 examples in total. The counts are: - None: 10 - Hard: 5 - Soft: 2
The proportions are: pNone=1017,pHard=517,pSoft=217 p_{\text{None}} = \frac{10}{17}, \quad p_{\text{Hard}} = \frac{5}{17}, \quad p_{\text{Soft}} = \frac{2}{17}
The entropy H(S)H(S) is: H(S)=(1017log21017+517log2517+217log2217) H(S) = -\left( \frac{10}{17} \log_2 \frac{10}{17} + \frac{5}{17} \log_2 \frac{5}{17} + \frac{2}{17} \log_2 \frac{2}{17} \right)

STEP 4

Calculate the numerical values of the proportions and the final entropy H(S)H(S).
pNone=10170.588,pHard=5170.294,pSoft=2170.118 p_{\text{None}} = \frac{10}{17} \approx 0.588, \quad p_{\text{Hard}} = \frac{5}{17} \approx 0.294, \quad p_{\text{Soft}} = \frac{2}{17} \approx 0.118
H(S)=(0.588log20.588+0.294log20.294+0.118log20.118) H(S) = -\left( 0.588 \log_2 0.588 + 0.294 \log_2 0.294 + 0.118 \log_2 0.118 \right)
H(S)(0.588×(0.765)+0.294×(1.771)+0.118×(3.087)) H(S) \approx -\left( 0.588 \times (-0.765) + 0.294 \times (-1.771) + 0.118 \times (-3.087) \right)
H(S)0.765×0.588+1.771×0.294+3.087×0.1180.450+0.521+0.3641.335 H(S) \approx 0.765 \times 0.588 + 1.771 \times 0.294 + 3.087 \times 0.118 \approx 0.450 + 0.521 + 0.364 \approx 1.335

STEP 5

Split the dataset based on the attribute Age and calculate the entropy for each subset.
The unique values of Age are: Young, Middle, Senior.
Subset for Young: - None: 4 - Hard: 2 - Soft: 0
Subset for Middle: - None: 5 - Hard: 1 - Soft: 1
Subset for Senior: - None: 1 - Hard: 2 - Soft: 1
Calculate the proportions for each subset.

STEP 6

Calculate the entropy for the subset where Age is Young.
There are 6 examples. The counts are: - None: 4 - Hard: 2 - Soft: 0
The proportions are: pNone=46,pHard=26,pSoft=06=0 p_{\text{None}} = \frac{4}{6}, \quad p_{\text{Hard}} = \frac{2}{6}, \quad p_{\text{Soft}} = \frac{0}{6} = 0
The entropy H(Young)H(\text{Young}) is: H(Young)=(46log246+26log226+0log20) H(\text{Young}) = -\left( \frac{4}{6} \log_2 \frac{4}{6} + \frac{2}{6} \log_2 \frac{2}{6} + 0 \log_2 0 \right)

STEP 7

Calculate the numerical value of the entropy H(Young)H(\text{Young}).
pNone=460.667,pHard=260.333 p_{\text{None}} = \frac{4}{6} \approx 0.667, \quad p_{\text{Hard}} = \frac{2}{6} \approx 0.333
H(Young)=(0.667log20.667+0.333log20.333) H(\text{Young}) = -\left( 0.667 \log_2 0.667 + 0.333 \log_2 0.333 \right)
H(Young)(0.667×0.585+0.333×1.585) H(\text{Young}) \approx -\left( 0.667 \times -0.585 + 0.333 \times -1.585 \right)
H(Young)0.585×0.667+1.585×0.3330.390+0.5280.918 H(\text{Young}) \approx 0.585 \times 0.667 + 1.585 \times 0.333 \approx 0.390 + 0.528 \approx 0.918

STEP 8

Calculate the entropy for the subset where Age is Middle.
There are 7 examples. The counts are: - None: 5 - Hard: 1 - Soft: 1
The proportions are: pNone=57,pHard=17,pSoft=17 p_{\text{None}} = \frac{5}{7}, \quad p_{\text{Hard}} = \frac{1}{7}, \quad p_{\text{Soft}} = \frac{1}{7}
The entropy H(Middle)H(\text{Middle}) is: H(Middle)=(57log257+17log217+17log217) H(\text{Middle}) = -\left( \frac{5}{7} \log_2 \frac{5}{7} + \frac{1}{7} \log_2 \frac{1}{7} + \frac{1}{7} \log_2 \frac{1}{7} \right)

STEP 9

Calculate the numerical value of the entropy H(Middle)H(\text{Middle}).
pNone=570.714,pHard=170.143,pSoft=170.143 p_{\text{None}} = \frac{5}{7} \approx 0.714, \quad p_{\text{Hard}} = \frac{1}{7} \approx 0.143, \quad p_{\text{Soft}} = \frac{1}{7} \approx 0.143
H(Middle)=(0.714log20.714+0.143log20.143+0.143log20.143) H(\text{Middle}) = -\left( 0.714 \log_2 0.714 + 0.143 \log_2 0.143 + 0.143 \log_2 0.143 \right)
H(Middle)(0.714×0.484+0.143×2.807+0.143×2.807) H(\text{Middle}) \approx -\left( 0.714 \times -0.484 + 0.143 \times -2.807 + 0.143 \times -2.807 \right)
H(Middle)0.484×0.714+2.807×0.143+2.807×0.1430.346+0.402+0.4021.150 H(\text{Middle}) \approx 0.484 \times 0.714 + 2.807 \times 0.143 + 2.807 \times 0.143 \approx 0.346 + 0.402 + 0.402 \approx 1.150

STEP 10

Calculate the entropy for the subset where Age is Senior.
There are 4 examples. The counts are: - None: 1 - Hard: 2 - Soft: 1
The proportions are: pNone=14,pHard=24,pSoft=14 p_{\text{None}} = \frac{1}{4}, \quad p_{\text{Hard}} = \frac{2}{4}, \quad p_{\text{Soft}} = \frac{1}{4}
The entropy H(Senior)H(\text{Senior}) is: H(Senior)=(14log214+24log224+14log214) H(\text{Senior}) = -\left( \frac{1}{4} \log_2 \frac{1}{4} + \frac{2}{4} \log_2 \frac{2}{4} + \frac{1}{4} \log_2 \frac{1}{4} \right)

STEP 11

Calculate the numerical value of the entropy H(Senior)H(\text{Senior}).
pNone=14=0.250,pHard=24=0.500,pSoft=14=0.250 p_{\text{None}} = \frac{1}{4} = 0.250, \quad p_{\text{Hard}} = \frac{2}{4} = 0.500, \quad p_{\text{Soft}} = \frac{1}{4} = 0.250
H(Senior)=(0.250log20.250+0.500log20.500+0.250log20.250) H(\text{Senior}) = -\left( 0.250 \log_2 0.250 + 0.500 \log_2 0.500 + 0.250 \log_2 0.250 \right)
H(Senior)(0.250×2.000+0.500×1.000+0.250×2.000) H(\text{Senior}) \approx -\left( 0.250 \times -2.000 + 0.500 \times -1.000 + 0.250 \times -2.000 \right)
H(Senior)2.000×0.250+1.000×0.500+2.000×0.2500.500+0.500+0.5001.500 H(\text{Senior}) \approx 2.000 \times 0.250 + 1.000 \times 0.500 + 2.000 \times 0.250 \approx 0.500 + 0.500 + 0.500 \approx 1.500

STEP 12

Calculate the weighted average entropy of the subsets.
The weights are the proportions of each subset relative to the total dataset.
Weight for Young=SYoungS=617 \text{Weight for Young} = \frac{|S_{\text{Young}}|}{|S|} = \frac{6}{17} Weight for Middle=SMiddleS=717 \text{Weight for Middle} = \frac{|S_{\text{Middle}}|}{|S|} = \frac{7}{17} Weight for Senior=SSeniorS=417 \text{Weight for Senior} = \frac{|S_{\text{Senior}}|}{|S|} = \frac{4}{17}
Hweighted=(617×H(Young))+(717×H(Middle))+(417×H(Senior)) H_{\text{weighted}} = \left( \frac{6}{17} \times H(\text{Young}) \right) + \left( \frac{7}{17} \times H(\text{Middle}) \right) + \left( \frac{4}{17} \times H(\text{Senior}) \right)

STEP 13

Calculate the numerical value of the weighted average entropy.
Hweighted=(617×0.918)+(717×1.150)+(417×1.500) H_{\text{weighted}} = \left( \frac{6}{17} \times 0.918 \right) + \left( \frac{7}{17} \times 1.150 \right) + \left( \frac{4}{17} \times 1.500 \right)
Hweighted(0.353)+(0.474)+(0.353) H_{\text{weighted}} \approx \left( 0.353 \right) + \left( 0.474 \right) + \left( 0.353 \right)
Hweighted1.180 H_{\text{weighted}} \approx 1.180

STEP 14

Compute the information gain by subtracting the weighted average entropy from the original entropy.
IG(S,Age)=H(S)Hweighted IG(S, \text{Age}) = H(S) - H_{\text{weighted}}
IG(S,Age)=1.3351.1800.155 IG(S, \text{Age}) = 1.335 - 1.180 \approx 0.155
Solution: The information gain by splitting the dataset using Age is approximately 0.155.

Was this helpful?

Studdy solves anything!

banner

Start learning now

Download Studdy AI Tutor now. Learn with ease and get all help you need to be successful at school.

ParentsInfluencer programContactPolicyTerms
TwitterInstagramFacebookTikTokDiscord