%load_ext autoreload
%autoreload 2
%matplotlib inline

/Users/hmamin/anaconda3/lib/python3.7/site-packages/ipykernel/ipkernel.py:287: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.
  and should_run_async(code)

# Only needed for testing.
from string import ascii_lowercase

# Nonsense sample text.
text = [
    f"Row {i}: I went, yesterday; she wasn't here after school? Today. --2"
    for i in range(25_000)
]

df = pd.DataFrame(text, columns=['a'])
df.tail()

NLP = tokenizer()

Word tokenize a single string.

Parameters
----------
x: str
    A piece of text to tokenize.
nlp: spacy tokenizer, e.g. spacy.lang.en.English
    By default, a spacy tokenizer with a small English vocabulary
    is used. NER, parsing, and tagging are disabled. Any spacy
    tokenzer can be passed in, but keep in mind other configurations
    may slow down this function dramatically.

Returns
-------
list[str]: List of word tokens from a single input string.

Word tokenize a sequence of strings using multiprocessing. The max
number of available processes are used.

Parameters
----------
rows: Iterable[str]
    A sequence of strings to tokenize. This could be a list, a column of
    a DataFrame, etc.
nlp: spacy tokenizer, e.g. spacy.lang.en.English
    By default, a spacy tokenizer with a small English vocabulary
    is used. NER, parsing, and tagging are disabled. Any spacy
    tokenzer can be passed in, but keep in mind other configurations
    may slow down this function dramatically.
chunk: int
    This determines how many items to send to multiprocessing at a time.
    The default of 1,000 is usually fine, but if you have extremely
    long pieces of text and memory is limited, you can always decrease it.
    Very small chunk sizes may increase processing time. Note that larger
    values will generally cause the progress bar to update more choppily.

Returns
-------
list[list[str]]: Each nested list of word tokens corresponds to one
of the input strings.

# ~5-6 seconds
x = df.a.apply(tokenize)

# ~1-2 seconds
x = tokenize_many(df.a)

Embeddings object. Lets us easily map word to index, index to
word, and word to vector. We can use this to find similar words,
build analogies, or get 2D representations for cdting.

Parameters
----------

Returns
-------
str: Same language and basically the same content as the original text,
    but usually with slightly altered grammar, sentence structure, and/or
    vocabulary.

text = """
Visit ESPN to get up-to-the-minute sports news coverage, scores, highlights and commentary for NFL, MLB, NBA, College Football, NCAA Basketball and more.
"""
back_translate(text, 'es')

'Visit ESPN to get coverage of sports news, scores, highlights and comments from the NFL, MLB, NBA, college football, NCAA basketball and more.'

text = """
Visit ESPN to get up-to-the-minute sports news coverage, scores, highlights and commentary for NFL, MLB, NBA, College Football, NCAA Basketball and more.
"""
back_translate(text, 'fr')

'Visit ESPN for up-to-date sports information, scores, highlights and commentary for the NFL, MLB, NBA, college football, NCAA basketball and more.'

Implements the algorithm from the paper:

All-But-The-Top: Simple and Effective Post-Processing
for Word Representations (https://arxiv.org/pdf/1702.01417.pdf)

There are three steps:
1. Compute the mean embedding and subtract this from the
original embedding matrix.
2. Perform PCA and extract the top d components.
3. Eliminate the principal components from the mean-adjusted
embeddings.

Parameters
----------
emb: np.array
    Embedding matrix of size (vocab_size, embedding_length).
d: int
    Number of components to use in PCA. Defaults to
    embedding_length/100 as recommended by the paper.

Reduce embedding dimension as described in the paper:

Simple and Effective Dimensionality Reduction for Word Embeddings
(https://lld-workshop.github.io/2017/papers/LLD_2017_paper_34.pdf)

Parameters
----------
emb: np.array
    Embedding matrix of size (vocab_size, embedding_length).
d: int
    Number of components to use in the post-processing
    method described here: https://arxiv.org/pdf/1702.01417.pdf
    Defaults to embedding_length/100 as recommended by the paper.

Returns
-------
np.array: Compressed embedding matrix of shape (vocab_size, new_dim).

Data Augmentation

Text transform that paraphrases input text as a method of data
augmentation. This is rather slow so it's recommended to precompute
samples and save them, but you could generate samples on the fly if
desired. One further downside of that approach is you'll have a huge
paraphrasing model on the GPU while (presumably) training another model.

Other paraphrasing models exist on Model Hub but as of 11/14/2020, none of
the results compared favorably to this pegasus model, at least based on
a rough "eyeball check". While smaller and presumably faster, many of
these appear to require processing a single example at a time which
diminishes these gains. If you do attempt to use them, you'll likely need
to write a new class with a preprocessing method that does something like
the following:
_preprocess(text) -> 'paraphrase: {text}</s>'
I'm recording this here because many are missing documentation and it took
me some time to discover this.

text = 'It was a beautiful sunny day and birds were chirping.'
texts = ['Play fun games online for free! Watch your favorite movies and tv '
         'shows here.', 
         'Bill hated school, especially math. His teacher was losing '
         'patience with him.']

p_tfm = ParaphraseTransform()

/Users/hmamin/anaconda3/lib/python3.7/socket.py:660: ResourceWarning: unclosed <socket.socket fd=66, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('192.168.1.119', 63347), raddr=('52.217.101.22', 443)>
  self._sock = None
ResourceWarning: Enable tracemalloc to get the object allocation traceback
/Users/hmamin/anaconda3/lib/python3.7/socket.py:660: ResourceWarning: unclosed <socket.socket fd=66, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('192.168.1.119', 63354), raddr=('52.217.101.22', 443)>
  self._sock = None
ResourceWarning: Enable tracemalloc to get the object allocation traceback

p_tfm(text, n=3, temperature=10)

['Birds were singing in the sun.',
 'Birds were singing on a nice sunny day.',
 'Birds were singing on a sunny day.']

Text transform that truncates a piece of text and completes it using
a text generation model for the purposes of data augmentation. We
recommend precomputing samples and saving them for later use, but you
could generate samples on the fly if desired. Aside from speed, this
approach also has the drawback of having a text generation model on the
GPU while (presumably) training another model.

g_tfm = GenerativeTransform(n=3)

print(texts)
g_tfm._preprocess(texts, drop_pct=.75)

['Play fun games online for free! Watch your favorite movies and tv shows here.', 'Bill hated school, especially math. His teacher was losing patience with him.']

['Play fun games online', 'Bill hated school,']

g_tfm(text, drop_pct=.75)

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence

['It was a disaster. It was a disaster in',
 'It was a really rough thing to say," the',
 'It was a different kind of world. The city']

g_tfm(texts)

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence
Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence

['Play fun games online for free! Watch your favorite movies and tv series from Disney in this game',
 'Play fun games online for free! Watch your favorite movies and TV shows on Netflix, Amazon Video',
 'Play fun games online for free! Watch your favorite movies and TV shows online at your mobile device',
 'Bill hated school, especially math. His teacher was losing her eyesight due to a small',
 'Bill hated school, especially math. His teacher was losing sight of the fact that his work',
 'Bill hated school, especially math. His teacher was losing touch with him. Even though she']

res = g_tfm(texts, flat=False)
res

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence
Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence

[['Play fun games online for free! Watch your favorite movies and streams live for free! Read,',
  'Play fun games online for free! Watch your favorite movies and TV shows play out! Explore The',
  'Play fun games online for free! Watch your favorite movies and play them yourself.\n\nFree'],
 ["Bill hated school, especially math. His teacher was losing sleep over his student's grades,",
  'Bill hated school, especially math. His teacher was losing her job at the time.\n',
  'Bill hated school, especially math. His teacher was losing control of his mind. He wouldn']]

g_tfm(texts, n=2, min_length=3, max_length=5)

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence
Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence

["Play fun games online for free! Watch your favorite movies and sports videos and you'll",
 'Play fun games online for free! Watch your favorite movies and television shows in real time',
 'Bill hated school, especially math. His teacher was losing her temper, so he',
 'Bill hated school, especially math. His teacher was losing her job and he just']

g_tfm(text, n=5, drop_pct=.5, min_keep=2)

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence

["It was a beautiful sunny day. I'm thankful. My",
 'It was a beautiful sunny day in August but we had to',
 'It was a beautiful sunny day, and I went down to',
 'It was a beautiful sunny spot in our neighbourhood. We had',
 'It was a beautiful sunny Friday evening.\n\n"It']

Text transform that masks one or more words in a piece of text and
fills them using RoBERTa for the purposes of data augmentation. We
recommend precomputing samples and saving them for later use, but you
could generate samples on the fly if desired. In addition to being slow,
that approach also entails having a mask filling model on the GPU while
(presumably) training another model.

fm_tfm = FillMaskTransform(n=4, max_n=8)

fm_tfm(texts)

['Play fun games online for free! Watch your favorite movies and tv shows here.',
 'Play fun videos online for free! Watch your favorite movies and tv shows here.',
 'Play fun music online for free! Watch your favorite movies and tv shows here.',
 'Play fun shows online for free! Watch your favorite movies and tv shows here.',
 'Bill hated school, especially math. His teacher was losing patience with him.',
 'Bill hated school, especially math. His teacher was losing touch with him.',
 'Bill hated school, especially math. His teacher was losing faith with him.',
 'Bill hated school, especially math. His teacher was losing friends with him.']

fm_tfm(texts, flat=False)

[['Play fun games online for free! Watch your favorite movies and tv shows here.',
  'Play fun games online for free! Watch your favourite movies and tv shows here.',
  'Play fun games online for free! Watch your own movies and tv shows here.',
  'Play fun games online for free! Watch your favorites movies and tv shows here.'],
 ['Bill hated school, especially math. His teacher was losing patience with him.',
  'Bill hated school, especially math. His teacher was having patience with him.',
  'Bill hated school, especially math. His teacher was no patience with him.',
  'Bill hated school, especially math. His teacher was lacking patience with him.']]

fm_tfm(text, n=2, strategy='best')

['It was a beautiful sunny day and birds were chirping.',
 'It was a beautiful sunny day and birds kept chirping.']

fm_tfm(text, n=2, strategy='random')

['It was a beautiful sunny day and they were chirping.',
 'It was a beautiful sunny day and monkeys were chirping.']

fm_tfm(text, n_mask=2, return_all=True, flat=False)

[['It was a beautiful sunny day and birds were chirping.'],
 ['It was a beautiful sunny day and birds were chirping.',
  "It's a beautiful sunny day and birds were chirping.",
  'It is a beautiful sunny day and birds were chirping.',
  'It seemed a beautiful sunny day and birds were chirping.'],
 ['It was a beautiful sunny day and birds were chirping.',
  'It was a beautiful sunny day and birds kept chirping.',
  'It was a beautiful sunny day and birds started chirping.',
  'It was a beautiful sunny day and birds are chirping.']]

Notice how quickly samples pile up when using n=-1, n_mask>1, and return_all=True.

fm_tfm(text, n=-1, n_mask=2, return_all=True)

['It was a beautiful sunny day and birds were chirping.',
 'It was a beautiful sunny day and birds were chirping.',
 'This was a beautiful sunny day and birds were chirping.',
 'Today was a beautiful sunny day and birds were chirping.',
 'Yesterday was a beautiful sunny day and birds were chirping.',
 'Sunday was a beautiful sunny day and birds were chirping.',
 ' It was a beautiful sunny day and birds were chirping.',
 ' it was a beautiful sunny day and birds were chirping.',
 'Saturday was a beautiful sunny day and birds were chirping.',
 'It was a beautiful sunny day and birds were chirping.',
 'It was a beautiful sunny day and we were chirping.',
 'It was a beautiful sunny day and kids were chirping.',
 'It was a beautiful sunny day and frogs were chirping.',
 'It was a beautiful sunny day and chickens were chirping.',
 'It was a beautiful sunny day and they were chirping.',
 'It was a beautiful sunny day and children were chirping.',
 'It was a beautiful sunny day and monkeys were chirping.',
 'This was a beautiful sunny day and birds were chirping.',
 'This was a beautiful sunny day and birds kept chirping.',
 'This was a beautiful sunny day and birds are chirping.',
 'This was a beautiful sunny day and birds started chirping.',
 'This was a beautiful sunny day and birds began chirping.',
 'This was a beautiful sunny day and birds happily chirping.',
 'This was a beautiful sunny day and birds everywhere chirping.',
 'This was a beautiful sunny day and birds stopped chirping.',
 'Today was a beautiful sunny day and birds were chirping.',
 'Today was a beautiful sunny afternoon and birds were chirping.',
 'Today was a beautiful sunny morning and birds were chirping.',
 'Today was a beautiful sunny spring and birds were chirping.',
 'Today was a beautiful sunny summer and birds were chirping.',
 'Today was a beautiful sunny weather and birds were chirping.',
 'Today was a beautiful sunny sky and birds were chirping.',
 'Today was a beautiful sunny evening and birds were chirping.',
 'Yesterday was a beautiful sunny day and birds were chirping.',
 'Yesterday was a beautiful sunny day where birds were chirping.',
 'Yesterday was a beautiful sunny day when birds were chirping.',
 'Yesterday was a beautiful sunny day as birds were chirping.',
 'Yesterday was a beautiful sunny day; birds were chirping.',
 'Yesterday was a beautiful sunny day while birds were chirping.',
 'Yesterday was a beautiful sunny day, birds were chirping.',
 'Yesterday was a beautiful sunny day - birds were chirping.',
 'Sunday was a beautiful sunny day and birds were singing',
 'Sunday was a beautiful sunny day and birds were flying',
 'Sunday was a beautiful sunny day and birds were plentiful',
 'Sunday was a beautiful sunny day and birds were nesting',
 'Sunday was a beautiful sunny day and birds were soaring',
 'Sunday was a beautiful sunny day and birds were abundant',
 'Sunday was a beautiful sunny day and birds were buzzing',
 'Sunday was a beautiful sunny day and birds were watching',
 'It was a beautiful sunny day and birds were chirping.',
 'It was a beautiful sunny day and we were chirping.',
 'It was a beautiful sunny day and kids were chirping.',
 'It was a beautiful sunny day and frogs were chirping.',
 'It was a beautiful sunny day and chickens were chirping.',
 'It was a beautiful sunny day and they were chirping.',
 'It was a beautiful sunny day and children were chirping.',
 'It was a beautiful sunny day and monkeys were chirping.',
 'it was a beautiful sunny day and birds were chirping.',
 'it was another beautiful sunny day and birds were chirping.',
 'it was this beautiful sunny day and birds were chirping.',
 'it was the beautiful sunny day and birds were chirping.',
 'it was one beautiful sunny day and birds were chirping.',
 'it was that beautiful sunny day and birds were chirping.',
 'it was an beautiful sunny day and birds were chirping.',
 'it was very beautiful sunny day and birds were chirping.',
 'It was a beautiful sunny day and birds were chirping.',
 'This was a beautiful sunny day and birds were chirping.',
 'Today was a beautiful sunny day and birds were chirping.',
 'Yesterday was a beautiful sunny day and birds were chirping.',
 'Sunday was a beautiful sunny day and birds were chirping.',
 ' It was a beautiful sunny day and birds were chirping.',
 ' it was a beautiful sunny day and birds were chirping.',
 'Saturday was a beautiful sunny day and birds were chirping.']

fm_tfm(texts, n=2, n_mask=2, return_all=True, flat=False)

[[['Play fun games online for free! Watch your favorite movies and tv shows here.'],
  ['Play fun games online for free! Watch your favorite movies and tv shows here.',
   'Play fun games online for free! Watch your favourite movies and tv shows here.'],
  ['Play fun games online for free! Watch your favorite movies and TV shows here.',
   'Play fun games online for free! Watch your favorite movies and tv shows here.']],
 [['Bill hated school, especially math. His teacher was losing patience with him.'],
  ['Bill hated school, especially math. His teacher was losing patience with him.',
   'Bill hated school, especially math. His teacher was losing patience for him.'],
  ['Bill hated school, especially math. His teacher was losing patience with him.',
   'Bill hated school, especially math. His teacher was having patience with him.']]]

We also provide a convenience function that allows us to easily generate new samples from a source dataframe or csv while preserving any desired metadata so we can map each generated row to the correct label, ID, raw text, etc.

Create augmented versions of a dataframe of text, optionally preserving
other columns for identification purposes. We recommend precomputing and
saving variations of your data rather than doing this on the fly in a
torch dataset since they can be rather space- and time-intensive.
Augmented versions of an input row should generally be kept in the same
training split: in order to keep the label the same, we usually want to
make relatively limited changes to the raw text (just enough to provide a
regularizing effect).

Parameters
----------
source: str, Path, or pd.DataFrame
    If str or Path, this is a csv containing our text data. Alternatively,
    you can pass in a dataframe itself.
transform: str or callable
    If str, this must be one of the keys in `NLP_TRANSFORMS` from this
    same module - this will be used to create a new transform object.
    Alternatively, you can pass in a previously created object (NOT the
    class). The default is the mask filling transform as it's relatively
    quick and effective. 'paraphrase' may give better (but slower)
    results. Anecdotally, 'generative' seems to provide lower quality
    results, but perhaps by experimenting with hyperparameters it could
    be more useful.
dest: str, Path, or None
    If str or Path, this is where the output file will be saved to
    (directories will be created as needed). If None, nothing will be
    saved and the function will merely return the output DF for you to do
    with as you wish.
n: int
    Number of samples to generate for each raw row.
text_col: str
    Name of column in DF containing the text to augment.
id_cols: Iterable[str]
    Columns containing identifying information such as labels, row_ids,
    etc. These also help us map the augmented text rows to their
    corresponding raw rows.
nrows: int or None
    Max number of rows from the source DF to generate text for. Useful for
    testing (equivalently, you could pass in df.head(nrows) and leave this
    as None).
tfm_kwargs: dict
    Arguments to pass to `transform`'s constructor. These are ignored when
    passing in a transform object rather than a string.
call_kwargs: dict
    Arguments to pass to the __call__ method of `transform` to affect
    the augmentation process.

Returns
-------
pd.DataFrame: DF of generated text with columns `text_col` and `id_cols`.
By default, this will have 5x the rows as your source DF, but this can
easily be adjusted through the `nrows` parameter.

	a
24995	Row 24995: I went, yesterday; she wasn't here ...
24996	Row 24996: I went, yesterday; she wasn't here ...
24997	Row 24997: I went, yesterday; she wasn't here ...
24998	Row 24998: I went, yesterday; she wasn't here ...
24999	Row 24999: I went, yesterday; she wasn't here ...

NLP

`tokenize`[source]

`tokenize_many`[source]

`class` `Vocabulary`[source]

`class` `Embeddings`[source]

`back_translate`[source]

`postprocess_embeddings`[source]

`compress_embeddings`[source]

Data Augmentation

`class` `ParaphraseTransform`[source]

`class` `GenerativeTransform`[source]

`class` `FillMaskTransform`[source]

`augment_text_df`[source]

NLP

tokenize[source]

tokenize_many[source]

class Vocabulary[source]

class Embeddings[source]

back_translate[source]

postprocess_embeddings[source]

compress_embeddings[source]

Data Augmentation

class ParaphraseTransform[source]

class GenerativeTransform[source]

class FillMaskTransform[source]

augment_text_df[source]

`tokenize`[source]

`tokenize_many`[source]

`class` `Vocabulary`[source]

`class` `Embeddings`[source]

`back_translate`[source]

`postprocess_embeddings`[source]

`compress_embeddings`[source]

`class` `ParaphraseTransform`[source]

`class` `GenerativeTransform`[source]

`class` `FillMaskTransform`[source]

`augment_text_df`[source]