CM146 Problem Set 1-5 Solutions: MLE and Decision Trees Guide

Understanding Maximum Likelihood Estimation for Bernoulli Data

Maximum Likelihood Estimation (MLE) is a fundamental method in statistics and machine learning for estimating the parameters of a probability distribution. In the context of the CM146 problem set, you are asked to estimate the parameter θ of a Bernoulli distribution from observed data. This is a common task in areas like A/B testing, where you want to estimate the probability of a user clicking a button or converting. For instance, just as the 2026 FIFA World Cup qualifiers use win probabilities to predict outcomes, MLE helps find the most likely success rate given observed wins and losses.

Likelihood Function for Bernoulli Trials

Given n independent and identically distributed (iid) Bernoulli random variables X₁, ..., Xₙ with parameter θ, the likelihood function is the product of individual probabilities: L(θ) = Πₙₖ₌₁ P(Xₖ = xₖ; θ) = θ^(sum xₖ) * (1-θ)^(n - sum xₖ). This function does not depend on the order of observations because multiplication is commutative. For example, whether you observe heads then tails or tails then heads, the likelihood remains the same.

Log-Likelihood and Derivatives

Taking the natural log simplifies the product into a sum: ℓ(θ) = (sum xₖ) log θ + (n - sum xₖ) log(1-θ). The first derivative is dℓ/dθ = (sum xₖ)/θ - (n - sum xₖ)/(1-θ). Setting this to zero yields the MLE: θ̂ = (1/n) * sum xₖ, i.e., the sample mean. The second derivative is negative for θ in (0,1), confirming a maximum.

Python Plotting Example

For n=10 with six 1s and four 0s, the MLE is 0.6. The following Python code (using numpy and matplotlib) plots the likelihood function:

import numpy as np
import matplotlib.pyplot as plt
theta = np.linspace(0, 1, 101)
n = 10; k = 6
L = theta**k * (1-theta)**(n-k)
plt.plot(theta, L)
plt.xlabel('θ'); plt.ylabel('L(θ)')
plt.axvline(0.6, color='r', linestyle='--')
plt.show()

The plot shows a peak at θ=0.6, confirming the closed-form MLE. For n=5 with three 1s, n=100 with sixty 1s, and n=10 with five 1s, the MLEs are 0.6, 0.6, and 0.5 respectively. As n increases, the likelihood function becomes narrower, indicating greater certainty.

Decision Tree Splitting Heuristics: Entropy vs. Error

Decision trees are popular in machine learning for classification tasks. The ID3 algorithm uses entropy reduction to choose splits. This problem explores why entropy is preferred over simply reducing misclassification error. In the given setting, the target function is Y = X1 OR X2 OR X3, with n boolean features. All 2^n examples are present.

Mistakes of a 1-Leaf Tree

A 1-leaf tree predicts the majority class. Since Y=1 for all examples except when X1=X2=X3=0, there are 2^(n-3) examples with Y=0 (since X4...Xn can be anything). For n≥4, 2^(n-3) <= 2^(n-1), so Y=1 is the majority. Thus the tree predicts Y=1 for all, making 2^(n-3) mistakes. For n=4, that's 2^(1)=2 mistakes.

Can a Split Reduce Mistakes?

No split can reduce mistakes by even one. Any split divides the data into subsets, but because the target depends only on X1,X2,X3, splitting on any other feature yields subsets with the same proportion of Y=1 as the whole. Splitting on X1, X2, or X3 will create subsets where some are pure (e.g., X1=1 gives all Y=1) but others still have errors. The total number of mistakes remains 2^(n-3). This shows that error reduction is not a good heuristic because it can get stuck.

Entropy and Information Gain

The entropy of Y in the root is H(Y) = B( (2^n - 2^(n-3)) / 2^n ) = B(1 - 1/8) = B(7/8). For n≥4, B(7/8) ≈ 0.5436 bits. Splitting on X1 reduces entropy: after split, the conditional entropy H(Y|X1) = (1/2)H(Y|X1=0) + (1/2)H(Y|X1=1). For X1=1, Y=1 always, so entropy 0. For X1=0, Y depends on X2,X3: probability of Y=1 is 3/4, so entropy B(3/4) ≈ 0.8113. Thus H(Y|X1) = 0.5*0 + 0.5*0.8113 = 0.4056, yielding information gain of about 0.138 bits. This non-zero gain demonstrates that entropy can find useful splits even when error reduction fails.

Entropy and Information: Zero Gain Condition

If splitting on attribute Xj produces subsets where the proportion of positives is the same in all subsets, then the conditional entropy equals the original entropy, so information gain is zero. This is because the split does not change the distribution of Y. For example, if a feature is irrelevant (like X4 in our OR function), splitting on it yields subsets with identical positive ratios, leading to zero gain.

Programming Exercise: Applying Decision Trees to Titanic Data

The Titanic dataset is a classic for binary classification. Using scikit-learn's DecisionTreeClassifier, you can predict survival based on features like passenger class, sex, age, etc. The problem asks to preprocess data (handle missing values, encode categorical variables) and evaluate using cross-validation. For instance, you might find that sex is the most important feature, as women had higher survival rates. This exercise mirrors real-world applications like predicting customer churn or loan default.

By mastering MLE and decision trees, you gain tools essential for data science roles in tech, finance, and healthcare. These concepts underpin many AI applications, from recommendation systems to autonomous driving.