Talking face generation is synthesizing a lip synchronized talking face video by inputting an arbitrary face image and audio clips. People naturally conduct spontaneous head motions to enhance their speeches while giving talks. Head motion generation from the speech is inherently difficult due to the nondeterministic mapping from speech to head motions. Most existing works map speech to motion in a deterministic way by conditioning certain styles, leading to sub-optimal results. In this paper, we decompose the speech motion into two complementary parts{:} pose modes and rhythmic dynamics. Accordingly, we introduce a shallow diffusion motion model (SDM) by equipping a two-stream architecture, i.e., a pose mode branch for primary posture generation, and a rhythmic motion branch for rhythmic dynamics synthesis. On one hand, diverse pose modes are generated by conditional sampling in a latent space, guided by speech semantics. On the other hand, rhythmic dynamics are synced with the speech prosody. Extensive experiments demonstrate the superior performance against several baselines, in terms of fidelity, similarity, and syncing with speech.