Skip to content
Snippets Groups Projects
Commit 693bd3a5 authored by Marco Kuhlmann's avatar Marco Kuhlmann
Browse files

Add the notebook on the sampling recipe

parent daa79130
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id:8e7907e8 tags:
# Sampling words by frequency
%% Cell type:markdown id:67167697 tags:
This notebook illustrates the sampling recipe mentioned in lab L1.
%% Cell type:markdown id:4167876e tags:
## Goal
We want to sample words from a vocabulary with a probability that is proportional to their counts (absolute frequencies) in some given text. That is, if we have two words $w_1$ and $w_2$, where $w_2$ appears $k$ times as often as $w_1$, then the expected number of times we sample $w_1$ should be $k$ times higher than the expected number of times we sample $w_2$.
%% Cell type:markdown id:81d0f6a3 tags:
## Sampling recipe
Imagine all the words in the vocabulary covering a line marked with numbers between 0 and the sum of all word frequencies, where each word covers an interval corresponding to its frequency. To sample a word, we choose a random point on that line, and return that word whose interval includes this chosen point. In doing so, we will sample words with a probability that is proportional to its frequency.
%% Cell type:markdown id:1ded3597 tags:
## Example
We illustrate the sampling recipe with a concrete example.
%% Cell type:code id:bf86a87a tags:
``` python
import numpy as np
import torch
```
%% Cell type:markdown id:cc9f8460 tags:
Here is a list of counts for words in a ten-word vocabulary:
%% Cell type:code id:864d01f4 tags:
``` python
counts = np.array([14507, 5014, 4602, 4529, 4000, 3219, 3010, 2958, 2225, 1271])
```
%% Cell type:markdown id:32bf71c5 tags:
To implement the sampling recipe, we need the cumulative sums of these counts. We can get them with the function [`torch.cumsum()`](https://pytorch.org/docs/stable/generated/torch.cumsum.html).
%% Cell type:code id:f73b493e tags:
``` python
cumulative_sums = torch.cumsum(torch.from_numpy(counts), dim=0)
cumulative_sums
```
%% Cell type:markdown id:8cf8c78e tags:
To choose a random point on the counts line, we sample a random number between 0 and 1 and multiply it with the sum of all counts, which is the last entry in the list of cumulative sums. Here we choose $5$ such points.
%% Cell type:code id:8f0b6933 tags:
``` python
random_points = torch.rand(5) * cumulative_sums[-1]
random_points
```
%% Cell type:markdown id:d0838e73 tags:
To return the word whose interval on the counts line includes a chosen point, we use the function [`torch.searchsorted()`](https://pytorch.org/docs/stable/generated/torch.searchsorted.html). This function takes a sorted sequence and tensor of values and finds the indices from the sorted sequence such that, if the corresponding values were inserted before the indices, the order of the corresponding dimension within the sorted sequence would be preserved.
%% Cell type:code id:d6035c26 tags:
``` python
torch.searchsorted(cumulative_sums, random_points)
```
%% Cell type:markdown id:2e7f0b8c tags:
Good luck with the lab!
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment