Probabilitisc Artificial Intelligence
professor: Andreas Krause
1. Gaussian
N(x∣μ,σ)=(2π)2n∣Σ∣1exp(−21(x−μ)TΣ−1(x−μ))
- among all distributions over the real numbers with known mean and variance, Gaussian distribution has the maixmum entropy.
1.0. Deduce
suppose the gaussian distribution is f(x)
Assumption 1:distribution is isotropic, so we consider it in 2-D.
f(x)f(y)=f(x2+y2)f(0)
Then, let’s divide each side by f(0)2.
f(0)f(x)f(0)f(y)=f(0)f(x2+y2)
Then, apply ln to transform multiply to addition.
ln(f(0)f(x))+ln(f(0)f(y))=ln(f(0)f(x2+y2))
To make it easier, we denote g(x)=ln(f(0)f(x)).
g(x)+g(y)=g(x2+y2)
Now, it’s easy to figure out the solution of g(x). (c is a constant number here)
g(x)=cx2
So we could get the expression of f(x).
f(x)=f(0)ecx2
Assumption 2: The integral of f(x) is 1.
∫−∞∞f(x)dx=1
Also put it int 2-D.
(∫−∞∞f(x)dx)2=∫−∞∞∫−∞∞f(x)f(y)dxdy=∫−∞∞∫−∞∞f(0)2ec(x2+y2)dxdy=∫0∞∫02πf(0)2ecr2rdθdr=2πf(0)2(2cecr2)∣r=0r=∞=−cπf(0)2=1
So we can represent f(0) using c,c<0
f(x)=π−cecx2
Now we got a n sample data x1,x2,⋯,xn from distribution f(x). And we can calculate their expectation (μ) and standard deviation (σ).
μσ=i=1∑nxi=n1i=1∑n(xi−μ)2
To represent c with μ,σ, we use the Maximum likelihood estimate
L(x)=i=1∏nf(x−xi)⇒ln(L(x))=i=1∑nln(f(x−xi))⇒{∂x∂ln(L(x))=0∂c∂ln(L(x))=0⇒{x=μc=−2σ21
So we could get the full expression of f(x) which is corresponding to N(0,σ).
f(x)=2πσ1exp(−2σ2x2)
Intuitively, we could get the representation of N(μ,σ).
f(x)=2πσ1exp(−2σ2(x−μ)2)
As higher dimension, μ,x∈Rd becomes a vector. For the most simple case, each element in x is orthogonal. Then σ∈Rd is also a vector.
f(x)=(2π)dσ1exp(−2σ2(x−μ)T(x−μ))
But normally, each element of x is not orthogonal. So consider a transformation
y=σA(x−μ)
Then we could get the normal representation of gaussian distribution. We denote Σ=ATAσ2
f(y)=(2π)dσ∣A∣exp(−2σ2(x−μ)TATA(x−μ))=(2π)d∣Σ∣211exp(21(x−μ)TΣ−1(x−μ))
1.1. p(N∣N)
suppose XA,XB are disjoint subsets from X∼N(μV,ΣVV)
then p(XA∣XB=xB)=N(μA∣B,ΣA∣B)
μA∣BΣA∣B=μA+ΣABΣBB−1(xB−μB)=ΣAA−ΣABΣBB−1ΣBA
1.2. p(MN)
suppose X∼N(μX,ΣXX), and matrix M∈Rm×d.
then Y=MX∼N(μY,ΣYY)
μYΣYY=MμX=MTΣXXM
1.3. p(N+N)
suppose X∼N(μX,ΣXX), and X′∼N(μX′,ΣX′X′)
then Y=X+X′∼N(μY,ΣYY)
μYΣYY=μX+μX′=ΣX+ΣX′
1.4. p(NN)
suppose X∼N(μX,ΣXX), and X′∼N(μX′,ΣX′X′)
then Y=XX′∼N(μY,ΣYY)
ΣYYμY=(ΣXX−1+ΣX′X′−1)−1=ΣYYΣXX−1μX+ΣYYΣX′X′−1μX′
if X′∼N(0,I)
ΣYYμY=(ΣXX−1+I)−1=ΣYYΣXX−1μx
2. Probability
Mean
E[x]=n1i=1∑nxi
Variance
Var[x]=E[x2]−E[x]2
Covariance
Cov[x,y]=E[xy]−E[x]E[y]
Var[x−y]=Var[x]+Var[y]−2Cov[x,y]
Max Likelihood Estimation (MLE)
θ^=argmaxθP(X1:n∣θ)
- sometimes will add l2 norm which is ∥⋅∥2 for vector and ∥⋅∥F for matrix
Max a priority(MAP)
θ^=argmaxθ(X1:n∣θ,X′)
KL-Divergence
DKL(p∥q)=∫p(x)logq(x)p(x)dx
- DKL(p∥q) : mode averaging
- DKL(q∣∣p) : mode seeking, given p∼N(0,[σ1200σ22]) , σq2=σ121+σ2212
- DKL(q∣∣p) is well defined if q is a subset of p
$\Gamma $ distribution
Γ(x;α,β)E[Γ(x;α,β)]Var[Γ(x;α,β)]=Γ(α)βαxα−1e−βx=βα=β2α
B distribution
B(x;α,β)E[B(x;α,β)]Var[B(x;α,β)]=B(α,β)xα−1(1−x)β−1=Γ(α+β)Γ(α)Γ(β)=α+βα=(α+β)2(α+β+1)αβ
Bernoulli distribution
Bernoulli(x;p)E[Bernoulli(x;p)]Var[Bernoulli(x;p)]=px(1−p)1−x=p=p(1−p)
Poisson distribution
Pr(x;λ)E[Pr(x;λ)]Var[Pr(x;λ)]=k!λke−λ=λ=λ
3. Bayesian Linear Regression
Assume
dataset X={x1,⋯xm} Y={y1,⋯ym}
prior
- f(xi)=xiTw
- yi=f(xi)+ϵi ϵi∼N(0,σn2)
- w∼N(0,σp2)
Then
P(w∣Y,X)μΣf(x∗)=N(μ,Σ)=σn21ΣXTY=(σn21XTX+σp21I)−1∼N(x∗Tμ,x∗TΣx∗)
online
X⊤XX⊤Y=i=1∑txi⊤xi=i=1∑tyixi⇒XnewXnew⊤⇒Xnew⊤Ynew=X⊤X+xt+1xt+1⊤=X⊤Y+yt+1xt+1
fast
- reduce from O(d3) to O(d2)
(A+xx⊤)−1=A−1−1+x⊤A−1x(A−1x)(A−1x)T
(Xnew⊤Xnew+σn2I)−1=(AX⊤X+σn2+xt+1xt+1⊤)−1
4. Gaussian Process
- closed formulae for Bayesian posterior update exist
kernel
RBF Kernel
k(u,v)=σF2exp(−2l2(u−v)2)
-
l length scale control the distance of data,
-
σF output scale control the magnitude
-
l→0lim k(u,v)=σF2δ(u−v)
-
posterior variance: l→0lim k′(x,x)=k(x,x)−kxX(KXX+σn2I)−1kxX⊤=σF2+σ2σF2σ2
Assume
dataset X={x1,⋯xm} Y={y1,⋯ym}
prior
- f∼GP(μ,k)
- yi=f(xi)+ϵi ϵi∼N(0,σn2)
Then
μx∗′kx∗x∗′P(f∣X,Y)∼GP(f;μ′,k′)=μx∗+Kx∗X(KXX+σn2I)−1Y=Kx∗x∗−Kx∗XT(KXX+σn2I)−1Kx∗X
Especially for Linear Kernel
K(x,x′)=λxTx equals to BLR with λ=σp2
Woodbury push-through
U(VU+I)−1=(UV+I)−1U
Fast GPS
k(x,x′)≈ϕ(x)Tϕ(x′)ϕ(x)∼Rm
computational cost : O(nm2+m3)
Fourier Features
k(x,x′)≈k(x−x′)⇒k(x−x′)=∫Rdp(ω)ejωT(x−x′)dw
Bochner Theorem : p(ω)≥0⇒k≥0
Inducing points
Subset of Regressors(SoR) : qSOR(f∣u)=N(Kf,uKu,uu,0)≈N(Kf,uKu,u−1u,Kf,f−Qf,f)
Fully independent training conditional (FITC) : qFIFC(f∣u)=N(Kf,uKu,uu,diag(Kf,f−Qf,f))≈N(Kf,uKu,u−1u,Kf,f−Qf,f)
optimize hyperparameters via maximizing the marginal likelihood
computational cost : O(n3) (∵Ku,u−1)
Kalman Filter
xt+1yt=Ftxt+Σx,t=Htxt+Σy
predict
x^t+1=FtxtΣ^x,t+1=HtΣx,tHt⊤+Σdt
correct
Kt+1xt+1Σx,t+1=Σx,tHt⊤(HtΣ^x,tHt⊤+Σy)−1=x^t+1+Kt+1(yt+1−Htx^t+1)=(I−Kt+1Ht)Σ^x,t
Approximation Method
Laplacian approximation
- one modal
- no previous knowlege
- left : backward KL DKL(q∥p) (blue is q, orange is p)
- right : forward KL DKL(p∥q)
ELBO
Q∗=Q∈Qargmin DKL(Q(Z)∥P(Z∣X))=Q∈Qargmax EZ∼Q(Z)[log(P(X∣Z))]−DKL(Q(Z)∣∣P(Z))
Reparameterization tricks
5. Markov Chain Monte Carlo(MCMC)
π=πP
where π is the stationary state, P is the transition matrix
ergodic : ∃t∈N→(P)t>0
Metropolis-Hastings Algorithm (MH Algorithm)
given proposal distribution R(x′∣x) and unnormalized stationary distribution Q(x)
accpet rate α=min{1,Q(x)R(x′∣x)Q(x′)R(x∣x′)}
new transition matrix col_standarlize(R⊙α)
Gibbs Sampling
a special case of MH algorithm, with αij=1
Metropolis Adjusted Langevin Algorithm(MALA)
R(x′∣x)=N(x′∣x−τ∇f(x);2τI)
Stochastic Graident Langevin Dynamics (SGLD)
θt+1=θt−η(∇log p(θt)+nNj=1∑n∇logp(yij∣θt,xij))+ϵtϵt∼N(0,2ηI)
where L(θt) is the likelihood
6. Bayesian Deep Learning
P(y∣x,θ)=N(y;fμ(x,θμ),exp(fσ(x,θσ)))
θ^=θargmin−lnP(θ)−i=1∑nlnP(yi∣xi,θ)=θargminλ∣∣θ∣∣22+21i=1∑n[σ(xi;θσ)21∣∣yi−μ(xi;θμ)∣∣2+lnσ(xi;θσ)2]
Variational Inference (Bayes by Backprop)
q∈QargminKL(q∣∣p(⋅∣y))=q∈QargmaxEθ∼q[lnP(y∣θ)]−KL(q∣∣p(⋅))
p(⋅) is posterior for θ here
Evidence Lower Bound (ELBO)
L(q)=Eθ∼q[lnP(y∣θ)]−KL(q∣∣p(⋅))
Prediction
P(y′∣x′,X,Y)=Eθ∼q[P(y′∣x′,θ)]
Var[y′∣x′,X,Y]=Eθ∼q[Var[y′∣x′,θ]]+Var[E[y′∣x′,θ]]
Eθ∼q[Var[y′∣x′,θ]] : Aleatoric uncertainty(random)
Var[Eθ∼q[y′∣x′,θ]] : Epistemic uncertainty(knowledge)
Stochastic Weight Averaging-Gaussian(SWAG)
μSWAΣSWA=T11∑Tθi=T−111∑T(θi−μSWA)(θi−μSWA)T
7. Bayesian Optimization
Uncertainty Sampling
xt=x∈Dargmaxσt−12(x)
- maximizing information gain in homoscedastic noise case
- t→∞lim x^t=∈Dargmax μt(x), f(x^)→f(x∗)
Mutual Information
F(s)=H(f)−H(f∣ys)=21log∣I+σ−2Ks∣F(ST)≥(1−e1)S⊆D,∣S∣≤Tmax F(S)
regret
RT=t=1∑T(maxx∈Df(x)−f(xt))
sublinear if TRT→0
Multi-arm bandit
acquisition↑⟺exploitationacquisition↓⟺exploration
Upper Confidence Sampling(UCB)
a=μt(x)+βσt(x)
Probability of Improvement(PI)
a=Φ(σt(x)μt(x)−f∗)
Expected Improvement(EI)
a=(μt(x)−f∗)Φ(σt(x)μt(x)−f∗)+σt(x)ϕ(σt(x)μt(x)−f∗)
8. Reinforcement Learning
Bellman Theorem
V∗(x)=maxa(r(x,a)+γx′∑P(x′∣x,a)V∗(x′))
Hoffeding bound
P(∣μ−n1i=1∑nZi∣>ε)≤2exp(−C22nε2)
model-based RL
learn Markov Decision Processes, learn p and r, e.g gaussian process
model-free RL
learn value function directly
on policy
model has control of the action
off policy
Value Iteration
polynomial time, performance depend on the input
- guarantee converge to an ε optimal policy not the exact optimal policy
Policy Iteration
- guaranteed to monotonically improve the policy
ϵ greedy Algorithm
model-based
when random number <ϵ do the random action
R~max~ algorithm
model-based
set reward R and transition probability P(x∗∣x,a)=1 at first
- with probability 1−σ, R~max~ will reach an ε - optimal
- polynomial time in ∣X∣,∣A∣,T,ε1,log(δ1)
Temporal Difference(TD) - Learning
model-free
on policy
V^π(x)←(1−αt)V^π(x)+αt(r+γV^π(x′))
Theorem ∑tαt=∞,∑tαt2<∞⇒P(V^π→Vπ)=1
Q-Learning
model-free: estimate Q∗ directly from samples,
off policy
Q^∗←(1−αt)Q^∗(x,a)+αt(r+γa′maxQ^∗(x′,a′))
Init: Q^∗(x,a)=1−γRmaxΠt=1Tinit(1−αt)−1
Theorem ∑tαt=∞,∑tαt2<∞⇒P(Q^∗→Q∗)=1
- with probability 1−σ, R~max~ will reach an ε - optimal
- polynomial time in ∣X∣,∣A∣,T,ε1,log(δ1)
- need decay learning rate to guarantee convergence
DQN
L(θ)=∑(r+γa′maxQ(x′,a′;θold)−Q(x,a;θ))2
problem : maximization bias
Double DQN
L(θ)=∑(rt+γQ(xt+1,a′argmaxQ′(x,a;θ′);θ)−Q(x,a;θ))2
θ′←τθ+(1−τ)θ′
Policy Gradient
model free
J(θ)=Eτ∼πθ(τ)r(τ)
∇J(θ)=E(x,a)∼πθ[Q(x,a)∇logπ(a∣x;θ)]
REINFORCE
on policy
∇θJ(θ)θ=Eτ∼pθ(τ)[t=0∑Tγt(t′=t∑Tγt′−trt)∇θlnπθ(at∣xt)]←θ+ηt∇θJ(θ)
Policy Search
Actor-Critic
on policy
∇θπJ(θπ)θπθQ=E(x,a)∼πθπ[QθQ(x,a)∇θπlnπθπ(a∣x)]←θπ+ηt∇θπJ(θπ)←θQ−ηt(QθQ(x,a)−r−γQθQ(x′,πθπ(x′)))∇θQQθQ(x,a)
use baseline to reduce variance
Trust- regin policy optimization (TRPO)
on policy
Proximal Policy Optimization (PPO)
on policy
Lθk(θk)=Eτ∼πkt=0∑∞[πθk(a∣x)πθ(a∣x)(r+γQπθk(x′,a)−Qπθk(x,a))]θk←θk−ηt∇θkLθk(θk)
Deep Deterministic Policy Gradients(DDPG)
off policy
randomly add noise the ensure sufficient exploration
θQθπθQoldθπold←θQ−η∇θQEτ∼πθπ[(QθQ(x,a)−(r+γQθQold(x′,πθπold(x′)))2]←θπ+η∇θπEτ∼πθπ[QθQ(x,πθπ(x))]←(1−ρ)θQold+ρθQ←(1−ρ)θπold+ρθπ
Soft Actor Critic(SAC)
off policy
Random Shooting methods
Monte-Carlo Tree Search