Tools to help construct datasets, which may be related to loading, processing, or encoding data.
%load_ext autoreload
%autoreload 2
%matplotlib inline
# Only needed for testing.
from collections import Counter
from itertools import chain
import numpy as np
from torch.utils.data import DataLoader

from htools import eprint, assert_raises
/Users/hmamin/anaconda3/lib/python3.7/site-packages/ipykernel/ipkernel.py:287: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.
  and should_run_async(code)

probabilistic_hash_item[source]

probabilistic_hash_item(x, n_buckets, mode='int', n_hashes=3)

Slightly hacky way to probabilistically hash an integer by
first converting it to a string.

Parameters
----------
x: int
    The integer or string to hash.
n_buckets: int
    The number of buckets that items will be mapped to. Typically
    this would occur outside the hashing function, but since
    the intended use case is so narrow here it makes sense to me
    to include it here.
mode: type
    The type of input you want to hash. This is user-provided to prevent
    accidents where we pass in a different item than intended and hash
    the wrong thing. One of (int, str). When using this inside a
    BloomEmbedding layer, this must be `int` because there are no
    string tensors. When used inside a dataset or as a one-time
    pre-processing step, you can choose either as long as you
    pass in the appropriate inputs.
n_hashes: int
    The number of times to hash x, each time with a different seed.

Returns
-------
list[int]: A list of integers with length `n_hashes`, where each integer
    is in [0, n_buckets).

probabilistic_hash_tensor[source]

probabilistic_hash_tensor(x_r2, n_buckets, n_hashes=3, pad_idx=0)

Hash a rank 2 LongTensor.

Parameters
----------
x_r2: torch.LongTensor
    Rank 2 tensor of integers. Shape: (bs, seq_len)
n_buckets: int
    Number of buckets to hash items into (i.e. the number of
    rows in the embedding matrix). Typically a moderately large
    prime number, like 251 or 997.
n_hashes: int
    Number of hashes to take for each input index. This determines
    the number of rows of the embedding matrix that will be summed
    to get the representation for each word. Typically 2-5.
pad_idx: int or None
    If you want to pad sequences with vectors of zeros, pass in an
    integer (same as the `padding_idx` argument to nn.Embedding).
    If None, no padding index will be used. The sequences must be
    padded before passing them into this function.

Returns
-------
torch.LongTensor: Tensor of indices where each row corresponds
    to one of the input indices. Shape: (bs, seq_len, n_hashes)
sents = [
    'I walked to the store so I hope it is not closed.',
    'The theater is closed today and the sky is grey.',
    'His dog is brown while hers is grey.'
]
labels = [0, 1, 1]
class Data(Dataset):
    
    def __init__(self, sentences, labels, seq_len):
        x = [s.split(' ') for s in sentences]
        self.w2i = self.make_w2i(x)
        self.seq_len = seq_len
        self.x = self.encode(x)
        self.y = torch.tensor(labels)
        
    def __getitem__(self, i):
        return self.x[i], self.y[i]
    
    def __len__(self):
        return len(self.y)
    
    def make_w2i(self, tok_rows):
        return {k: i for i, (k, v) in 
                enumerate(Counter(chain(*tok_rows)).most_common(), 1)}
    
    def encode(self, tok_rows):
        enc = np.zeros((len(tok_rows), self.seq_len), dtype=int)
        for i, row in enumerate(tok_rows):
            trunc = [self.w2i.get(w, 0) for w in row[:self.seq_len]]
            enc[i, :len(trunc)] = trunc
        return torch.tensor(enc)

We construct a toy dataset with a vocabulary of size 23. In reality, you might wish to lowercase text or use a better tokenizer, but this is sufficient for the purposes of demonstration.

ds = Data(sents, labels, 10)
len(ds.w2i)
23
dl = DataLoader(ds, batch_size=3)
x, y = next(iter(dl))
x, y
(tensor([[ 2,  5,  6,  3,  7,  8,  2,  9, 10,  1],
         [13, 14,  1, 15, 16, 17,  3, 18,  1,  4],
         [19, 20,  1, 21, 22, 23,  1,  4,  0,  0]]), tensor([0, 1, 1]))
x.shape
torch.Size([3, 10])

We hash each word index 4 times, as specified by the n_hashes parameter in probabilistic_hash_tensor. Notice that we only use 7 buckets, meaning the embedding matrix will have 7 rows rather than 23 (not counting a padding row).

x_hashed = probabilistic_hash_tensor(x, n_buckets=7, n_hashes=4)
print('x shape:', x.shape)
print('x_hashed shape:', x_hashed.shape)
x shape: torch.Size([3, 10])
x_hashed shape: torch.Size([3, 10, 4])

Below, each row of 4 numbers encodes a single word.

x_hashed
tensor([[[2, 0, 2, 2],
         [1, 6, 2, 4],
         [2, 0, 1, 4],
         [5, 2, 6, 5],
         [1, 5, 1, 3],
         [0, 4, 0, 0],
         [2, 0, 2, 2],
         [2, 4, 0, 2],
         [3, 4, 4, 6],
         [5, 0, 3, 6]],

        [[5, 4, 4, 2],
         [5, 3, 3, 1],
         [5, 0, 3, 6],
         [2, 1, 1, 1],
         [2, 4, 4, 4],
         [2, 4, 2, 6],
         [5, 2, 6, 5],
         [3, 5, 0, 0],
         [5, 0, 3, 6],
         [1, 5, 2, 6]],

        [[5, 5, 3, 4],
         [4, 5, 5, 1],
         [5, 0, 3, 6],
         [6, 2, 0, 6],
         [4, 2, 6, 1],
         [3, 6, 1, 6],
         [5, 0, 3, 6],
         [1, 5, 2, 6],
         [0, 0, 0, 0],
         [0, 0, 0, 0]]])

See how each word is mapped to a list of 4 indices.

for word, i in zip(sents[0].split(' '), x[0]):
    print(word, probabilistic_hash_item(i.item(), 7, int, 4))
I [2, 0, 2, 2]
walked [1, 6, 2, 4]
to [2, 0, 1, 4]
the [5, 2, 6, 5]
store [1, 5, 1, 3]
so [0, 4, 0, 0]
I [2, 0, 2, 2]
hope [2, 4, 0, 2]
it [3, 4, 4, 6]
is [5, 0, 3, 6]

Notice that hashing the words directly is also possible, but the resulting hashes will be different than if hashing after encoding words as integers. This is fine as long as you are consistent.

for row in [s.split(' ') for s in sents]:
    eprint(list(zip(row, (probabilistic_hash_item(word, 11, str) for word in row))))
    print()
 0: ('I', [0, 5, 5])
 1: ('walked', [2, 5, 1])
 2: ('to', [10, 4, 6])
 3: ('the', [4, 1, 4])
 4: ('store', [4, 6, 3])
 5: ('so', [7, 8, 8])
 6: ('I', [0, 5, 5])
 7: ('hope', [1, 2, 7])
 8: ('it', [3, 9, 0])
 9: ('is', [3, 1, 3])
10: ('not', [6, 10, 4])
11: ('closed.', [3, 6, 10])

 0: ('The', [1, 6, 9])
 1: ('theater', [8, 10, 2])
 2: ('is', [3, 1, 3])
 3: ('closed', [5, 5, 0])
 4: ('today', [3, 10, 8])
 5: ('and', [7, 2, 4])
 6: ('the', [4, 1, 4])
 7: ('sky', [1, 2, 9])
 8: ('is', [3, 1, 3])
 9: ('grey.', [7, 6, 7])

 0: ('His', [0, 10, 3])
 1: ('dog', [8, 6, 6])
 2: ('is', [3, 1, 3])
 3: ('brown', [9, 8, 9])
 4: ('while', [9, 2, 8])
 5: ('hers', [0, 5, 4])
 6: ('is', [3, 1, 3])
 7: ('grey.', [7, 6, 7])

Below, we show that we can obtain unique representations for >99.9% of words in a vocabulary of 30,000 words with a far smaller embedding matrix. The number of buckets is the number of rows in the embedding matrix.

def unique_combos(tups):
    return len(set(tuple(sorted(x)) for x in tups))
def hash_all_idx(vocab_size, n_buckets, n_hashes):
    return [probabilistic_hash_item(i, n_buckets, int, n_hashes) 
            for i in range(vocab_size)]
vocab_size = 30_000
buckets2hashes = {127: 5,
                  251: 4,
                  997: 3,
                  5_003: 2}
for b, h in buckets2hashes.items():
    tups = hash_all_idx(vocab_size, b,  h)
    unique = unique_combos(tups)
    print('\n\nBuckets:', b, '\nHashes:', h, '\nUnique combos:', unique,
          '\n% unique:', round(unique/30_000, 4))

Buckets: 127 
Hashes: 5 
Unique combos: 29998 
% unique: 0.9999


Buckets: 251 
Hashes: 4 
Unique combos: 29996 
% unique: 0.9999


Buckets: 997 
Hashes: 3 
Unique combos: 29997 
% unique: 0.9999


Buckets: 5003 
Hashes: 2 
Unique combos: 29969 
% unique: 0.999

Datasets

@auto_repr
class LazyDataset(Dataset):
    """Lazily load batches from an enormous dataframe that can't fit into 
    memory.
    """

    def __init__(self, df_path, length, shuffle, chunksize=1_000, 
                 c=2, classes=('neg', 'pos'), **kwargs):
        """
        Parameters
        ----------
        df_path: str
            File path of dataframe to load.
        length: int
            Number of rows of data to use. This is required so that we don't 
            have to go through the whole file and count the number of lines,
            which can be enormous with a big dataset. It also makes it easy to
            work with a subset (the data should already be shuffled, so 
            choosing the top n rows is fine).
        shuffle: bool
            If True, shuffle the data in each chunk. Note that if batch size
            is close to chunk size, this will have minimal effect. If possible,
            the training set should therefore load as large a chunk as 
            possible if we want to shuffle the data. Shuffling is unnecessary 
            for the validation set.
        chunksize: int
            Number of rows of df to load at a time. This should usually 
            be significantly larger than the batch size in order to retain
            some randomness in the batches.
        c: int
            Number of classes. Used if training with FastAI.
        classes: iterable
            List of tuple of class names. Used if training with FastAI.
        kwargs: any
            Additional keyword arguments to pass to `read_csv`, eg. 
            compression='gzip'.
        """
        if length < chunksize:
            warnings.warn('Total # of rows < 1 full chunk. LazyDataset may '
                          'not be necessary.')

        self.length = length
        self.shuffle = shuffle
        self.chunksize = chunksize
        self.df_path = df_path
        self.df = None
        self.chunk = None
        self.chunk_idx = None
        self.df_kwargs = kwargs
        
        # Additional attributes required by FastAI. 
        # c: Number of classes in model.
        self.c = c
        self.classes = list(classes)
        
    def __len__(self):
        return self.length
    
    def __getitem__(self, idx):
        """Because not all indices are loaded at once, we must do shuffling
        in the dataset rather than the dataloader (e.g. if the loader randomly
        samples index 5000 but we have indices 0-500 loaded, it will be
        unavailable).

        Parameters
        ----------
        idx: int
            Retrieve item i in dataset.

        Returns
        -------
        tuple[np.array]: x array, y array
        """
        # Load next chunk of data if necessary. Must specify nrows, otherwise
        # we will chunk through the whole file.
        if not self.chunk_idx:
            while True:
                try:
                    self.chunk = self.df.get_chunk()
                    break
                except (AttributeError, StopIteration):
                    self.df = pd.read_csv(self.df_path, engine='python',
                                          chunksize=self.chunksize,
                                          nrows=len(self),
                                          **self.df_kwargs)

            self.chunk_idx = self.chunk.index.values
            if self.shuffle: np.random.shuffle(self.chunk_idx)
            self.chunk_idx = deque(self.chunk_idx)
            
        *x, y = self.chunk.loc[self.chunk_idx.popleft()].values
        return np.array(x), y.astype(float)
class ImageMixer:
    """The transformation that powers MixupDS.

    Inspired by the "Visual-Spatial - Entangled Figures" task here:
    http://www.happy-neuron.com/brain-games/visual-spatial/entangled-figures

    The key idea: when a human plays the happy neuron task, it is much easier
    when viewing an entanglement of objects they recognize (e.g. bicycle, leaf,
    etc.) than of random scribbles. I noticed that for the harder levels, my
    strategy was to try to quickly identify a couple distinctive features, then
    search for them in the individual images once they appeared. This quick
    feature extraction seems very close to what we want to achieve during the
    pre-training step.
    """

    def __init__(self, n=3, a=5, b=8, dist=None, **kwargs):
        """
        Parameters
        ----------
        n: int
            Number of images to use as inputs. With the current implementation,
            the constructed image will use exactly 2 of these. The rest will
            be negatives (zero weight).
        a: int
            Parameter in beta distribution.
        b: int
            Parameter in beta distribution.
        dist: torch.distribution
            This can be anything with a "sample()" method that generates a
            random value between 0 and 1. By default, we use a Beta
            distribution. If one is passed in, a and b are ignored.
        kwargs: any
            Makes it easier to use in `get_databunch` function. Extra kwargs
            are ignored.
        """
        assert n >= 2, 'n must be >=2 so we can combine images.'

        self.dist = dist or torch.distributions.beta.Beta(a, b)
        self.n = n

    def transform(self, *images):
        """Create linear combination of images.

        Parameters
        ----------
        images: torch.tensors

        Returns
        -------
        tuple(torch.tensor): First item is (n_channel, h, w*self.n), meaning
        images are horizontally stacked. The first of these is the new image.
        The second item is the rank 1 tensor of weights used to generate the
        combination. These will serve as labels in our self-supervised task.
        """
        w = self._generate_weights()
        return (self._combine_images(images, w), *images), w

    def _generate_weights(self):
        """
        Returns
        -------
        weights: torch.Tensor
            Vector of length self.n. Exactly 2 of these values are
            nonzero and they sum to 1. This will be used to compute a linear
            combination of a row of images.
        """
        weights = np.zeros(self.n)
        p = self.dist.sample()
        indices = np.random.choice(self.n, size=2, replace=False)
        weights[indices] = p, 1 - p
        return torch.tensor(weights, dtype=torch.float)

    def _combine_images(self, images, weights):
        """Create linear combination of multiple images.

        Parameters
        ----------
        images: torch.Tensor
        weights: torch.Tensor
            Vector with 1 value for each image. Exactly 2 of these values are
            nonzero and they sum to 1. I.e. if we have 3 images a, b, and c,
            weights would look something like [0, .3, .7].

        Returns
        -------
        torch.tensor: Shape (channels, height, width), same as each of the
            input images.
        """
        images = torch.stack(images, dim=0)
        # 3 new dimensions correspond to (c, h, w), NOT self.n.
        return (weights[:, None, None, None] * images).sum(0).float()

plot_images[source]

plot_images(images, titles=None, nrows=None, figsize=None, tight_layout=True, title_colors=None)

Plot a grid of images.

Parameters
----------
images: Iterable
    List of tensors/arrays to plot.
titles: Iterable[str] or None
    Title for each subplot. Must be same length and order as `images`.
nrows: int or None
    If provided, this manually sets the number of rows in the grid.
figsize: tuple[int] or None
    Determines size of plot. By default, we double the number of rows and
    columns, respectively, to get dimensions.
tight_layout: bool
    Often helps matplotlib formatting when we have many images.
title_colors: Iterable[str] or None
    If provided, this should have the same length as `titles` and
    `images`. It will be used to determine the color of each title (see
    PredictionExaminer in incendio.core for an example).

class RandomTransform[source]

RandomTransform(func, p=0.5)

Wrap a function to create a data transform that occurs with some
probability p.

class RandomPipeline[source]

RandomPipeline(*transforms, p=0.5) :: BasicPipeline

Create a pipeline of callables that are applied in sequence, each with
some random probability p (this can be the same or different for each
step). This is useful for on-the-fly data augmentation (think in the
__getitem__ method of a torch Dataset).

Below, we define a few toy functions to demonstrate how we can alter an input string. In practice, we'd use more useful transformations like the ones defined in incendio.nlp, e.g. ParaphraseTransform.

def to_upper(t):
    return t.upper()

def times_3(t):
    return t * 3

def join(t, sep='---'):
    return sep.join(t)
text = 'dog'
pipeline = RandomPipeline(to_upper, times_3, join)
pipeline
RandomPipeline(
	RandomTransform(to_upper, p=0.5),
	RandomTransform(times_3, p=0.5),
	RandomTransform(join, p=0.5)
)
for i in range(5):
    print(pipeline(text))
d---o---g
dog
D---O---G---D---O---G---D---O---G
D---O---G---D---O---G---D---O---G
DOGDOGDOG
pipeline = RandomPipeline(to_upper, times_3, join, p=[1., .4, 1])

for i in range(5):
    print(pipeline(text))
D---O---G---D---O---G---D---O---G
D---O---G
D---O---G---D---O---G---D---O---G
D---O---G
D---O---G
with assert_raises(ValueError):
    pipeline = RandomPipeline(join, p=0)
As expected, got ValueError(p must be in range (0, 1]. I.E. you can choose to always apply a transform, but if you never want to apply it there's no need to include it in the pipeline.).
with assert_raises(AssertionError):
    pipeline = RandomPipeline(to_upper, times_3, join, p=[.2, 1])
As expected, got AssertionError(p must be a float or a list with one float for each transform.).
transforms = {times_3: .33,
              join: .67,
              to_upper: .95}
pipeline = RandomPipeline.from_dict(transforms)

for i in range(5):
    print(pipeline(text))
D---O---G
D---O---G
D---O---G---D---O---G---D---O---G
DOG
D---O---G

File Handling

class BotoUploader[source]

BotoUploader(bucket, verbose=True)

Uploads files to S3. Built as a public alternative to Accio. Note to
self: the interfaces are not identical so be careful to know which you're
using.
up = BotoUploader('gg-datascience')
ft = up._convert_local_path('data/v1/history.csv')
tt = up._convert_local_path('data/v1/history.csv', 'hmamin')
tf = up._convert_local_path('data/v1/history.csv', 'hmamin', retain_tree=False)
ff = up._convert_local_path('data/v1/history.csv', retain_tree=False)

print('No S3 prefix, Yes retain file tree:\n' + ft)
print('\nYes S3 prefix, Yes retain file tree:\n' + tt)
print('\nYes S3 prefix, No retain file tree:\n' + tf)
print('\nNo S3 prefix, No retain file tree:\n' + ff)
No S3 prefix, Yes retain file tree:
data/v1/history.csv

Yes S3 prefix, Yes retain file tree:
hmamin/data/v1/history.csv

Yes S3 prefix, No retain file tree:
hmamin/history.csv

No S3 prefix, No retain file tree:
history.csv

Plotting

plot_images[source]

plot_images(images, titles=None, nrows=None, figsize=None, tight_layout=True, title_colors=None)

Plot a grid of images.

Parameters
----------
images: Iterable
    List of tensors/arrays to plot.
titles: Iterable[str] or None
    Title for each subplot. Must be same length and order as `images`.
nrows: int or None
    If provided, this manually sets the number of rows in the grid.
figsize: tuple[int] or None
    Determines size of plot. By default, we double the number of rows and
    columns, respectively, to get dimensions.
tight_layout: bool
    Often helps matplotlib formatting when we have many images.
title_colors: Iterable[str] or None
    If provided, this should have the same length as `titles` and
    `images`. It will be used to determine the color of each title (see
    PredictionExaminer in incendio.core for an example).

Below, we demonstrate plotting a list of images stored as numpy arrays in a grid.

images = [np.clip(5 * np.random.uniform(size=(3, 16, 16)) / i, 0, 1)
          for i in range(1, 17)]
plot_images(images, titles=[f'Image {i}' for i in range(16)])

Here is a similar example showcasing the title color functionality. Notice we can use tensors here too.

images = [torch.clamp(5 * torch.rand(3, 16, 16) / i, 0, 1)
          for i in range(1, 17)]
plot_images(images, titles=[f'Image {i}' for i in range(16)], 
            title_colors=['red' if i%2 == 0 else 'green' for i in range(16)])