Skip to content

Utils

Utility functions used by other roboduck modules.

Functions

colored(text, color)

Add tags to color text and then reset color afterwards. Note that this does NOT actually print anything.

Parameters:

Name Type Description Default
text str

Text that should be colored.

required
color str

Color name, e.g. "red". Must be available in the colorama lib. If None or empty str, just return the text unchanged.

required

Returns:

Type Description
str

Note that you need to print() the result for it to show up in the desired color. Otherwise it will just have some unintelligible characters appended and prepended.

Source code in lib/roboduck/utils.py
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
def colored(text, color):
    """Add tags to color text and then reset color afterwards. Note that this
    does NOT actually print anything.

    Parameters
    ----------
    text : str
        Text that should be colored.
    color : str
        Color name, e.g. "red". Must be available in the colorama lib. If None
        or empty str, just return the text unchanged.

    Returns
    -------
    str
        Note that you need to print() the result for it to show up in the
        desired color. Otherwise it will just have some unintelligible
        characters appended and prepended.
    """
    if not color:
        return text
    color = getattr(Fore, color.upper())
    return f'{color}{text}{Style.RESET_ALL}'

colordiff_new_str(old, new, color='green')

Given two strings, return the new one with new parts in green. Note that deletions are ignored because we want to retain only characters in the new string. Remember colors are only displayed correctly when printing the resulting string - otherwise it just looks like we added extra junk characters.

Idea is that when displaying a revised code snippet from gpt, we want to draw attention to the new bits.

Adapted from this gist + variations in comments: https://gist.github.com/ines/04b47597eb9d011ade5e77a068389521

Parameters:

Name Type Description Default
old str

This is what new is compared to when identifying differences.

required
new str

Determines content of output str.

required
color str

Text color for new characters.

'green'

Returns:

Type Description
str

Same content as new but color new parts in a different color.

Source code in lib/roboduck/utils.py
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
def colordiff_new_str(old, new, color='green'):
    """Given two strings, return the new one with new parts in green. Note that
    deletions are ignored because we want to retain only characters in the new
    string. Remember colors are only displayed correctly when printing the
    resulting string - otherwise it just looks like we added extra junk
    characters.

    Idea is that when displaying a revised code snippet from gpt, we want to
    draw attention to the new bits.

    Adapted from this gist + variations in comments:
    https://gist.github.com/ines/04b47597eb9d011ade5e77a068389521

    Parameters
    ----------
    old : str
        This is what `new` is compared to when identifying differences.
    new : str
        Determines content of output str.
    color : str
        Text color for new characters.

    Returns
    -------
    str
        Same content as `new` but color new parts in a different color.
    """
    res = []
    matcher = difflib.SequenceMatcher(None, old, new)
    for opcode, s1, e1, s2, e2 in matcher.get_opcodes():
        if opcode == 'delete':
            continue
        chunk = new[s2:e2]
        if opcode in ('insert', 'replace'):
            chunk = colored(chunk, color)
        res.append(chunk)
    return ''.join(res)

type_annotated_dict_str(dict_, func=repr)

String representation (or repr) of a dict, where each line includes an inline comment showing the type of the value.

Parameters:

Name Type Description Default
dict_ dict

The dict to represent.

required
func function

The function to apply to each key and value in the dict to get some kind of str representation. Note that it is applied to each key/value as a whole, not to each item within that key/value. See examples.

repr

Returns:

Type Description
str

Examples:

Notice below how foo and cat are not in quotes but ('bar',) and ['x'] do contain quotes.

>>> d = {'foo': 'cat', ('bar',): ['x']}
>>> type_annotated_dict_str(d, str)
{
    foo: cat,   # type: str
    ('bar',): ['x'],   # type: list
}
Source code in lib/roboduck/utils.py
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
def type_annotated_dict_str(dict_, func=repr):
    """String representation (or repr) of a dict, where each line includes
    an inline comment showing the type of the value.

    Parameters
    ----------
    dict_ : dict
        The dict to represent.
    func : function
        The function to apply to each key and value in the dict to get some
        kind of str representation. Note that it is applied to each key/value
        as a whole, not to each item within that key/value. See examples.

    Returns
    -------
    str

    Examples
    --------
    Notice below how foo and cat are not in quotes but ('bar',) and ['x'] do
    contain quotes.
    >>> d = {'foo': 'cat', ('bar',): ['x']}
    >>> type_annotated_dict_str(d, str)
    {
        foo: cat,   # type: str
        ('bar',): ['x'],   # type: list
    }
    """
    type_strs = [f'\n    {func(k)}: {func(v)},   # type: {type(v).__name__}'
                 for k, v in dict_.items()]
    return '{' + ''.join(type_strs) + '\n}'

truncated_repr(obj, max_len=79)

Return an object's repr, truncated to ensure that it doesn't take up more characters than we want. This is used to reduce our chances of using up all our available tokens in a gpt prompt simply communicating that a giant data structure exists, e.g. list(range(1_000_000)). Our use case doesn't call for anything super precise so the max_len should be thought of as more of guide than an exact max. I think it's enforced but I didn't put a whole lot of thought or effort into confirming that.

Parameters:

Name Type Description Default
obj any
required
max_len int

Max number of characters for resulting repr. Think of this more as an estimate than a hard guarantee - precision isn't important in our use case. The result will likely be shorter than this because we want to truncate in a readable place, e.g. taking the repr of the first k items of a list instead of taking the repr of all items and then slicing off the end of the repr.

79

Returns:

Type Description
str

Repr for obj, truncated to approximately max_len characters or fewer. When possible, we insert ellipses into the repr to show that truncation occurred. Technically there are some edge cases we don't handle (e.g. if obj is a class with an insanely long name) but that's not a big deal, at least at the moment. I can always revisit that later if necessary.

Source code in lib/roboduck/utils.py
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
def truncated_repr(obj, max_len=79):
    """Return an object's repr, truncated to ensure that it doesn't take up
    more characters than we want. This is used to reduce our chances of using
    up all our available tokens in a gpt prompt simply communicating that a
    giant data structure exists, e.g. list(range(1_000_000)). Our use
    case doesn't call for anything super precise so the max_len should be
    thought of as more of guide than an exact max. I think it's enforced but I
    didn't put a whole lot of thought or effort into confirming that.

    Parameters
    ----------
    obj : any
    max_len : int
        Max number of characters for resulting repr. Think of this more as an
        estimate than a hard guarantee - precision isn't important in our use
        case. The result will likely be shorter than this because we want to
        truncate in a readable place, e.g. taking the repr of the first k items
        of a list instead of taking the repr of all items and then slicing off
        the end of the repr.

    Returns
    -------
    str
        Repr for obj, truncated to approximately max_len characters or fewer.
        When possible, we insert ellipses into the repr to show that truncation
        occurred. Technically there are some edge cases we don't handle (e.g.
        if obj is a class with an insanely long name) but that's not a big
        deal, at least at the moment. I can always revisit that later if
        necessary.
    """
    def qualname(obj):
        """Similar to type(obj).__qualname__() but that method doesn't always
        include the module(s). e.g. pandas Index has __qualname__ "Index" but
        this funnction returns "<pandas.core.indexes.base.Index>".
        """
        text = str(type(obj))
        names = re.search("<class '([a-zA-Z_.]*)'>", text).groups()
        assert len(names) == 1, f'Should have found only 1 qualname but ' \
                                f'found: {names}'
        return f'<{names[0]}>'

    open2close = {
        '[': ']',
        '(': ')',
        '{': '}'
    }
    repr_ = repr(obj)
    if len(repr_) < max_len:
        return repr_
    if isinstance(obj, pd.DataFrame):
        cols = truncated_repr(obj.columns.tolist(), max_len - 26)
        return f'pd.DataFrame(columns=' \
               f'{truncated_repr(cols, max_len - 22)})'
    if isinstance(obj, pd.Series):
        return f'pd.Series({truncated_repr(obj.tolist(), max_len - 11)})'
    if isinstance(obj, dict):
        length = 5
        res = ''
        for k, v in obj.items():
            if length >= max_len - 2:
                break
            new_str = f'{k!r}: {v!r}, '
            length += len(new_str)
            res += new_str
        return "{" + res.rstrip() + "...}"
    if isinstance(obj, str):
        return repr_[:max_len - 4] + "...'"
    if isinstance(obj, Iterable):
        # A bit risky but sort of elegant. Just recursively take smaller
        # slices until we get an acceptable length. We may end up going
        # slightly over the max length after adding our ellipses but it's
        # not that big a deal, this isn't meant to be super precise. We
        # can also end up with fewer items than we could have fit - if we
        # exhaustively check every possible length one by one until we
        # find the max length that fits, we can get a very slow function
        # when inputs are long.
        # Can't easily pass smaller max_len value into recursive call
        # because we always want to compare to the user-specified value.
        n = int(max_len / len(repr_) * len(obj))
        if n == len(obj):
            # Even slicing to just first item is too long, so just revert
            # to treating this like a non-iterable object.
            return qualname(obj)
        # Need to slice set while keeping the original dtype.
        if isinstance(obj, set):
            slice_ = set(list(obj)[:n])
        else:
            try:
                slice_ = obj[:n]
            except Exception as e:
                warnings.warn(f'Failed to slice obj {obj}. Result may not be '
                              f'truncated as much as desired. Error:\n{e}')
                slice_ = obj
        repr_ = truncated_repr(slice_, max_len)
        non_brace_idx = len(repr_) - 1
        while repr_[non_brace_idx] in open2close.values():
            non_brace_idx -= 1
        if non_brace_idx <= 0 or (non_brace_idx == 3
                                  and repr_.startswith('set')):
            return repr_[:-1] + '...' + repr_[-1]
        return repr_[:non_brace_idx + 1] + ',...' + repr_[non_brace_idx + 1:]

    # We know it's non-iterable at this point.
    if isinstance(obj, type):
        return f'<class {obj.__name__}>'
    if isinstance(obj, (int, float)):
        return truncated_repr(format(obj, '.3e'), max_len)
    return qualname(obj)

load_yaml(path, section=None)

Load a yaml file. Useful for loading prompts.

Borrowed from jabberwocky.

Parameters:

Name Type Description Default
path str or Path
required
section str or None

I vaguely recall yaml files can define different subsections. This lets you return a specific one if you want. Usually leave as None which returns the whole contents.

None

Returns:

Type Description
dict
Source code in lib/roboduck/utils.py
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
def load_yaml(path, section=None):
    """Load a yaml file. Useful for loading prompts.

    Borrowed from jabberwocky.

    Parameters
    ----------
    path : str or Path
    section : str or None
        I vaguely recall yaml files can define different subsections. This lets
        you return a specific one if you want. Usually leave as None which
        returns the whole contents.

    Returns
    -------
    dict
    """
    with open(path, 'r') as f:
        data = yaml.load(f, Loader=yaml.FullLoader) or {}
    return data.get(section, data)

update_yaml(path, delete_if_none=True, **kwargs)

Update a yaml file with new values.

Parameters:

Name Type Description Default
path str or Path

Path to yaml file to update. If it doesn't exist, it will be created. Any necessary intermediate directories will be created too.

required
delete_if_none bool

If True, any k-v pairs in kwargs where v is None will be treated as an instruction to delete key k from the yaml file. If False, we will actually set {field}: None in the yaml file.

True
kwargs any

Key-value pairs to update the yaml file with.

{}
Source code in lib/roboduck/utils.py
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
def update_yaml(path, delete_if_none=True, **kwargs):
    """Update a yaml file with new values.

    Parameters
    ----------
    path : str or Path
        Path to yaml file to update. If it doesn't exist, it will be created.
        Any necessary intermediate directories will be created too.
    delete_if_none : bool
        If True, any k-v pairs in kwargs where v is None will be treated as an
        instruction to delete key k from the yaml file. If False, we will
        actually set `{field}: None` in the yaml file.
    kwargs : any
        Key-value pairs to update the yaml file with.
    """
    path = Path(path).expanduser()
    os.makedirs(path.parent, exist_ok=True)
    try:
        data = load_yaml(path)
    except FileNotFoundError:
        data = {}
    for k, v in kwargs.items():
        if v is None and delete_if_none:
            data.pop(k, None)
        else:
            data[k] = v
    with open(path, 'w') as f:
        yaml.dump(data, f)

extract_code(text, join_multi=True, multi_prefix_template='\n\n# {i}\n')

Extract code snippet from a GPT response (e.g. from our debug chat prompt. See Examples for expected format.

Parameters:

Name Type Description Default
text str
required
join_multi bool

If multiple code snippets are found, we can either choose to join them into one string or return a list of strings. If the former, we prefix each snippet with multi_prefix_template to make it clearer where each new snippet starts.

True
multi_prefix_template str

If join_multi=True and multiple code snippets are found, we prepend this to each code snippet before joining into a single string. It should accept a single parameter {i} which numbers each code snippet in the order they were found in text (1-indexed).

'\n\n# {i}\n'

Returns:

Type Description
str or list

Code snippet from text. If we find multiple snippets, we either join them into one big string (if join_multi is True) or return a list of strings otherwise.

Examples:

text = '''Appending to a tuple is not allowed because tuples are immutable.
However, in this code snippet, the tuple b contains two lists, and lists
are mutable. Therefore, appending to b[1] (which is a list) does not raise
an error. To fix this, you can either change b[1] to a tuple or create a
new tuple that contains the original elements of b and the new list.

```python
# Corrected code snippet
a = 3
b = ([0, 1], [2, 3])
b = (b[0], b[1] + [a])
```'''

print(extract_code(text))

# Extracted code snippet
'''
a = 3
b = ([0, 1], [2, 3])
b = (b[0], b[1] + [a])
'''
Source code in lib/roboduck/utils.py
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
def extract_code(text, join_multi=True, multi_prefix_template='\n\n# {i}\n'):
    """Extract code snippet from a GPT response (e.g. from our `debug` chat
    prompt. See `Examples` for expected format.

    Parameters
    ----------
    text : str
    join_multi : bool
        If multiple code snippets are found, we can either choose to join them
        into one string or return a list of strings. If the former, we prefix
        each snippet with `multi_prefix_template` to make it clearer where
        each new snippet starts.
    multi_prefix_template : str
        If join_multi=True and multiple code snippets are found, we prepend
        this to each code snippet before joining into a single string. It
        should accept a single parameter {i} which numbers each code snippet
        in the order they were found in `text` (1-indexed).

    Returns
    -------
    str or list
        Code snippet from `text`. If we find multiple snippets, we either join
        them into one big string (if join_multi is True) or return a
        list of strings otherwise.

    Examples
    --------
    ```plaintext
    text = '''Appending to a tuple is not allowed because tuples are immutable.
    However, in this code snippet, the tuple b contains two lists, and lists
    are mutable. Therefore, appending to b[1] (which is a list) does not raise
    an error. To fix this, you can either change b[1] to a tuple or create a
    new tuple that contains the original elements of b and the new list.

    ```python
    # Corrected code snippet
    a = 3
    b = ([0, 1], [2, 3])
    b = (b[0], b[1] + [a])
    ```'''

    print(extract_code(text))

    # Extracted code snippet
    '''
    a = 3
    b = ([0, 1], [2, 3])
    b = (b[0], b[1] + [a])
    '''
    ```
    """
    chunks = re.findall("(?s)```(?:python)?\n(.*?)\n```", text)
    if not join_multi:
        return chunks
    if len(chunks) > 1:
        chunks = [multi_prefix_template.format(i=i) + chunk
                  for i, chunk in enumerate(chunks, 1)]
        chunks[0] = chunks[0].lstrip()
    return ''.join(chunks)

parse_completion(text)

This function is called on the gpt completion text in roboduck.debug.DuckDB.ask_language_model (i.e. when the user asks a question during a debugging session, or when an error occurs when in auto-explain errors mode).

Users can define their own custom function as a replacement (mostly useful when defining custom prompts too). The only requirements are that the function must take 1 string input and return a dict containing the keys "explanation" and "code", with an optional key "extra" that can be used to store any additional information (probably in a dict). For example, if you wrote a prompt that asked gpt to return valid json, you could potentially use json.loads() as your drop-in replacement (ignoring validation/error handling, which you might prefer to handle via a langchain chain anyway).

Parameters:

Name Type Description Default
text str

GPT completion. This should contain both a natural language explanation and code.

required

Returns:

Type Description
dict[str]
Source code in lib/roboduck/utils.py
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
def parse_completion(text):
    """This function is called on the gpt completion text in
    roboduck.debug.DuckDB.ask_language_model (i.e. when the user asks a
    question during a debugging session, or when an error occurs when in
    auto-explain errors mode).

    Users can define their own custom function as a replacement (mostly
    useful when defining custom prompts too). The only requirements are that
    the function must take 1 string input and return a dict containing the
    keys "explanation" and "code", with an optional key "extra" that can be
    used to store any additional information (probably in a dict). For example,
    if you wrote a prompt that asked gpt to return valid json, you could
    potentially use json.loads() as your drop-in replacement (ignoring
    validation/error handling, which you might prefer to handle via a langchain
    chain anyway).

    Parameters
    ----------
    text : str
        GPT completion. This should contain both a natural language explanation
        and code.

    Returns
    -------
    dict[str]
    """
    # Keep an eye out for how this performs - considered going with a more
    # complex regex or other approach here but since part 2 is supposed to be
    # code only, maybe that's okay. Extract_code could get weird if gpt
    # uses triple backticks in a function docstring but that should be very
    # rare, and the instructions sort of discourage it.
    explanation = text.partition('\n```')[0]
    code = extract_code(text)
    return {'explanation': explanation,
            'code': code}

available_models()

Show user available values for model_name parameter in debug.DuckDB class/ debug.duck function/errors.enable function etc.

Returns:

Type Description
dict[str, list[str]]

Maps provider name (e.g. 'openai') to list of valid model_name values. Provider name should correspond to a langchain or roboduck class named like ChatOpenai (i.e. Chat{provider.title()}). Eventually would like to support other providers like anthropic but never got off API waitlist.

Source code in lib/roboduck/utils.py
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
def available_models():
    """Show user available values for model_name parameter in debug.DuckDB
    class/ debug.duck function/errors.enable function etc.

    Returns
    -------
    dict[str, list[str]]
        Maps provider name (e.g. 'openai') to list of valid
        model_name values. Provider name should correspond to a langchain or
        roboduck class named like ChatOpenai (i.e. Chat{provider.title()}).
        Eventually would like to support other providers like anthropic but
        never got off API waitlist.
    """
    # Weirdly, env var is set and available but openai can't seem to find it
    # unless we explicitly set it here.
    openai.api_key = os.environ.get('OPENAI_API_KEY')
    res = {}

    # This logic may not always hold but as of April 2023, this returns
    # openai's available chat models.
    openai_res = openai.Model.list()
    res['openai'] = [row['id'] for row in openai_res['data']
                     if row['id'].startswith('gpt')]
    return res

make_import_statement(cls_name)

Given a class name like 'roboduck.debug.DuckDB', construct the import statement (str) that should likely be used to import that class (in this case 'from roboduck.debug import DuckDB'.

Parameters:

Name Type Description Default
cls_name str

Class name including module (essentially qualname?), e.g. roboduck.DuckDB. (Note that this would need to be roboduck.debug.DuckDB if we didn't include DuckDB in roboduck's init.py.)

required

Returns:

Type Description
str

E.g. "from roboduck import DuckDB"

Source code in lib/roboduck/utils.py
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
def make_import_statement(cls_name):
    """Given a class name like 'roboduck.debug.DuckDB', construct the import
    statement (str) that should likely be used to import that class (in this
    case 'from roboduck.debug import DuckDB'.

    Parameters
    ----------
    cls_name : str
        Class name including module (essentially __qualname__?), e.g.
        roboduck.DuckDB. (Note that this would need to be roboduck.debug.DuckDB
        if we didn't include DuckDB in roboduck's __init__.py.)

    Returns
    -------
    str
        E.g. "from roboduck import DuckDB"
    """
    parts = cls_name.split('.')
    if len(parts) == 1:
        return f'import {parts[0]}'
    else:
        lib = '.'.join(parts[:-1])
        return f'from {lib} import {parts[-1]}'