Character level text generation. Follow the following steps and show your works in a jupyter notebook. The requirements of the notebook are emphasis in bold.
- Find and download a suitable data for training a character level text generation model.
- The text material can be English, Chinese, or Code.
- The training data should contain at least 5000 sentences or comparable amount of data.
- Gather the statistics of $p(x_t|x_{t-1}, x_{t-2}, \ldots, x_{t-n})$ for the training dataset, where $x_i$ are characters in the training dataset. Do the followings for at least 2 different $n$,
- Store the information in a counter.
- Show how many distinct tuples $(x_{t-1}, x_{t-2}, \ldots, x_{t-n})$ are there in the training data.
- Which tuple $(x_{t-1}, x_{t-2}, \ldots, x_{t-n})$ appears most times int the dataset?
- Show the top 3 candidate's $x_t$ that is most likely to appear right after the above tuple.
- Generate 3 paragraph of text according to the statistics. You may start from any tuples that is appeared in the dataset.
- Warmup for neural network and deep learning
- Choose any of the deep learning framework. Pytorch lightning and keras are recommended.
- Instead of using a counter, train a neural network to learn $p(x_t|x_{t-1}, x_{t-2}, \ldots, x_{t-n})$ or $p(x_t|x_{t-1}, x_{t-2}, \ldots)$
- The network can be an RNN, 1D-CNN or an MLP. You can limit $x_i$ to be a subset of the characters, e.g. choose only the top 100~1000 most frequently appeared characters.
- Generate 3 paragraph of text using this neural network model.