To demonstrate that our proposed model can significantly improve the naturalness and expressiveness of the synthesized speech, some samples are provided for comparison. GT (Reconstructed) means the audio reconstructed from ground truth Mel-spectrogram. FastSpeech 2 means an open-source implementation of FastSpeech 2. HCE means hierarchical context encoder (HCE) model, which predicts the style on global-level from the context. In addition, a well-trained HIFI-GAN is used as the vocoder to generate waveform.