1 Setup

Code

from functools import partial
from datetime import datetime
import logging, string, pandas as pd, sqlparse
from fastcore.all import *
from fastcore.docments import *
from IPython.display import Markdown,display, HTML
import pandas as pd

from pygments import highlight
from pygments.lexers import PythonLexer
from pygments.formatters import HtmlFormatter

def print_function_source(fn):
    fn = print_decorator
    formatter = HtmlFormatter()
    display(HTML('<style type="text/css">{}</style>{}'.format(
        formatter.get_style_defs('.highlight'),
        highlight(inspect.getsource(fn), PythonLexer(), formatter))))

2 Purpose

The purpose of this is to introduce concepts I believe data scientists could benefit from knowing.

I am assuming that the reader knows the basics of programming. I will cover concepts I frequently see that I think are not used enough or appropriately, regardless of how basic or advanced they may be.

3 Comprehensions

Comprehensions in python should be used when possible. They are faster than forloops and require less code when they fit

x = [2,3,4,5]

out=[]
%timeit for i in range(1000000): out.append(i+1)

87.2 ms ± 647 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit [i+1 for i in range(1000000)]

56.8 ms ± 584 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

This is basically special syntax for a forloop, and are useful in a subset of forloops. Basically any time you see the pattern where you initialize something, then modify or build it in the forloop you can likely use a comprehension

out = []
for o in range(5): out.append(o**o)
out

[1, 1, 4, 27, 256]

[o**o for o in range(5)]

[1, 1, 4, 27, 256]

List comprehensions are most common but you can also do tuple comprehension, set comprehension, dict comprehension, or other data types.

set(o**o for o in range(5))

{1, 4, 27, 256}

{str(o):o**o for o in range(5)}

{'0': 1, '1': 1, '2': 4, '3': 27, '4': 256}

A few handy patterns are:

Reversing a dictionary
Combining lists
All unique combos from multiple lists (nested comprehension)

adict = {"a":1,"b":2}
{v:k for k,v in adict.items()}

{1: 'a', 2: 'b'}

x = [1,2,3,4]
y = [5,6,7,8]

[a+b for a,b in zip(x,y)]

[6, 8, 10, 12]

unique_combos = L((a,b) for a in x for b in y)
unique_combos

(#16) [(1, 5),(1, 6),(1, 7),(1, 8),(2, 5),(2, 6),(2, 7),(2, 8),(3, 5),(3, 6)...]

4 Destructuring

Destructured assignments mean to can break up iterables when you assign. This is handy to reduce pointless lines of code.

a,b = 5,6
a,b,c = [],[],{}

Another use is to break up lists to create lists where we take all the first elements out into it’s own list, and the second elements out into their own lists.

I often see this done with multiple list comprehension, doing [o[0] for o in [x,y,z]] to get the first element, then repeating for other elements.

However, we can do this easier with the help of zip and destructured assignments

nested_list = [[1,2,3],[4,5,6],[7,8,9]]
nested_list

[[1, 2, 3], [4, 5, 6], [7, 8, 9]]

first_elements, second_elements, third_elements = list(zip(*nested_list))

print(f"{first_elements=}")
print(f"{second_elements=}")
print(f"{third_elements=}")

first_elements=(1, 4, 7)
second_elements=(2, 5, 8)
third_elements=(3, 6, 9)

5 Fastcore

Fastcore is a great library to know. It’s got a lot of useful features and extensions to the python standard library and it’s designed to be used in live environments like jupyter notebooks.

5.1 Parallel Processing

See this blog post

5.1.1 Docments

Nice way of documenting code concisely and being able to access info from code. It’s concise, easy to manipulate to display how you want, and easy to read. I much prefer it over the large numpy style docstrings that are big string blocks

from fastcore.docments import *

def distance(pointa:tuple,  # tuple representing the coordinates of the first point (x,y)
             pointb:tuple=(0,0) # tuple representing the coordinates of the first point (x,y)
            )->float: # float representing distance between pointa and pointb
    '''Calculates the distance between pointa and pointb'''
    edges = np.abs(np.subtract(pointa,pointa))
    distance = np.sqrt((edges**2).sum())
    return distance

docstring(distance)

'Calculates the distance between pointa and pointb'

docments(distance)

{ 'pointa': 'tuple representing the coordinates of the first point (x,y)',
  'pointb': 'tuple representing the coordinates of the first point (x,y)',
  'return': 'float representing distance between pointa and pointb'}

docments(distance,full=True)

{ 'pointa': { 'anno': <class 'tuple'>,
              'default': <class 'inspect._empty'>,
              'docment': 'tuple representing the coordinates of the first '
                         'point (x,y)'},
  'pointb': { 'anno': <class 'tuple'>,
              'default': (0, 0),
              'docment': 'tuple representing the coordinates of the first '
                         'point (x,y)'},
  'return': { 'anno': <class 'float'>,
              'default': <class 'inspect._empty'>,
              'docment': 'float representing distance between pointa and '
                         'pointb'}}

5.1.2 Testing

Everyone agrees testing is important. But not all testing is equal. The needs for unit testing the google code base are not the same as the needs a data scientist needs for building and deploying models, libraries, and most software.

Fastcore is a great tool for most of my testing needs. Fast and simple enough that I can add tests as I build and as I am exploring and building models. I want testing to enhance my development workflow, not be something I have to painstakingly build at the end.

Sometimes simple assert statements are sufficient, but there’s small annoyances. For example, a small change in type can mean a failed test. Sometimes that change in type should cause a failure, sometimes I’m ok if it’s a different type if the values are the same

from fastcore.test import *

test_eq([1,2],(1,2))

For floating points it has handy functionality for that, which is very common in data science. For example, we may want .1 + .1 + .1 == .3 to be true, because they are close enough based on floating point precision

.1 + .1 + .1 == .3

False

test_close(.1 + .1 + .1, .3)

We can test that something fails, if there are particular situation we want to ensure raise errors.

def _fail(): raise Exception("foobar")
test_fail(_fail)

We can test if 2 lists have the same values, just in different orders (convenient for testing some situations with random mini-batches).

a = list(range(5))
b = a.copy()
b.reverse()
test_shuffled(a,b)

There’s more of course, check out the docs

5.1.3 L

L is a replacement for a list, but with lots of adding functionality. Some of it are functional programming concepts, some is numpy like stuff, and some is just niceities (like cleaner printing).

alist = L(1,2,3,4,3)

alist.sort()
alist.sorted()

(#5) [1,2,3,3,4]

alist.unique()

(#4) [1,2,3,4]

alist.filter(lambda x: x < 3)

(#2) [1,2]

alist.map(lambda x: x * 2)

(#5) [2,4,6,8,6]

5.1.4 AttrDict

Attrdict is another nice thing from fastcore, that makes dictionaries a bit nicer to use.

regdict = {'a':2,'b':3}
adict = AttrDict({'a':2,'b':3})

adict

{'a': 2, 'b': 3}

adict.a

def _fail(): return regdict.a
test_fail(_fail)

6 Logging

Logging is super important. if you log stuff as you work properly you can always look back at what was done previously. Sometimes it’s hard to tell what’s going on as you run and re-run different things. Logging is handy not just in production for debugging, but also as a tool when you are developing. There are many tools to help with logging and visualizing results (for example W&B or tensorboard for deep learning) - but the foundations are good to understand and use too!

logging.basicConfig(filename="./mylog.log")
logger = logging.getLogger()
logger.setLevel(logging.INFO)

def get_current_time(): return datetime.now().strftime('%Y%m%d_%H%M%S')

logger.info    (f'{get_current_time()}|This is an info message')

!head -4 mylog.log

INFO:root:20221106_111500|This is an info message
INFO:root:20221106_111521|Starting the model training process
INFO:root:20221106_111521|Training set has 50 records
INFO:root:20221106_111521|Validtion set has 70 records

def log_stuff(msg,**kwargs): 
    dt = get_current_time()
    logger.info(f"{dt}|{msg}")
    for k,v in kwargs.items(): logger.info(f"{dt}|{k}={v}")

log_stuff('this is what I want to log',
          trainig_set='50 records',
          validation_set='70_records')

7 Higher Order Functions & Callbacks

This is a simple example of what these terms mean:

A higher order function is a function that takes a function as an argument
A callback is a function that is passed in as an argument to a higher order function

def callbackFunc1(s): print('Callback Function 1: Length of the text file is : ', s)
def callbackFunc2(s): print('Callback Function 2: Length of the text file is : ', s)

def HigherOrderFunction(path, callback):
    with open(path, "r") as f: callback(len(f.read()))

HigherOrderFunction("mylog.log", callbackFunc1)
HigherOrderFunction("mylog.log", callbackFunc2)

Callback Function 1: Length of the text file is :  1130
Callback Function 2: Length of the text file is :  1130

This is handy in a lot of situations.

7.1 Filter

Filter is a common higher order function.

L(1,2,3,4,5).filter(lambda x: x>3)

(#2) [4,5]

This is very flexible because we can put filtering logic of any complexity in a function and use that to filter a list of any type.

7.2 Map

Map is another very common higher order function.

L(1,2,3,4,5).map(lambda x: x**2)

(#5) [1,4,9,16,25]

It is again super flexible because we can apply a function of any complexity to have it be applied and modify each element of the list.

L(1,2,3,4,5).map(lambda x: string.ascii_lowercase[x])

(#5) ['b','c','d','e','f']

7.3 Simple Logging

We could make a function for logging, where we can pass a function in that we want to use for logging (ie info vs warning).

def log_stuff(msg,fn=logger.info,**kwargs): 
    dt = get_current_time()
    fn(f"{dt}|{msg}")
    for k,v in kwargs.items(): fn(f"{dt}|{k}={v}")

log_stuff('abcd',a=1,b=55)

!tail -3 mylog.log

INFO:root:20221106_193211|abcd
INFO:root:20221106_193211|a=1
INFO:root:20221106_193211|b=55

log_stuff('something might be awry',fn=logger.critical,a=1,b=55)

!tail -3 mylog.log

CRITICAL:root:20221106_193211|something might be awry
CRITICAL:root:20221106_193211|a=1
CRITICAL:root:20221106_193211|b=55

7.4 File Processor

You can also make a generic file processor that you can pass callbacks to. This file processor can include log statements to log what you’re doing, so you can minimize repeating lots of code. For now, we’ll do a simple processor, and callbacks to clean and format a messy sql file.

def process_file(fpath,callbacks): 
    with open(fpath, "r") as f: contents = f.read()
    for callback in callbacks: contents = callback(contents)
    return contents

7.5 Format and clean SQL file

sql_formatter_cb = partial(sqlparse.format,
                strip_comments=True,comma_first=True,
                keyword_case='upper', identifier_case='lower',
                reindent=True, indent_width=4,)



qrys = process_file('test.sql',[sql_formatter_cb,sqlparse.split])

def sql_pprint(sql): display(Markdown(f"```sql\n\n{sql}\n\n```"))
for qry in qrys: sql_pprint(qry)


SELECT top 25 *
FROM some_table;


SELECT count(1)
FROM another TABLE ;


SELECT date_time
     , mbr_id
     , transactions
     , count(1)
FROM table3
WHERE date_time > '2021-02-02'
GROUP BY 1
       , 2
       , 3;

8 Decorators

Decorators give you a way to add the same functionality to many functions (like inheritance does for classes). You typically use decorator using the @ syntax, which modified the function.

8.1 Silly Simple Example

def add_another(func):
    def wrapper(number):
        print(f"The decorator took over!")
        print(f"I could log the original number ({number}) here!")
        print(f"Or I could log the original answer ({func(number)}) here!")
        return func(number) + 1
    return wrapper
    
@add_another
def add_one(number): return number + 1

So when we use a decorator, the code in the wrapper function is called instead of the original function. Typically the wrapper function calls the original function (otherwise there would be no point in decorating it as you’d just have a new unrelated function).

8.2 Useful Example

For example, maybe you want to print (or log) particular function call times and the args. See this decorator that does just that (and can be used on methods too)

from datetime import datetime

def print_decorator(func):
    def wrapper(*args, **kwargs):
        print(f"{datetime.now()}:{func}:args={args}:kwargs={kwargs}")
        return func(*args, **kwargs)
    return wrapper

@print_decorator
def simple_add(a,b): return a + b

simple_add(2,4)

2022-11-02 14:18:56.635936:<function simple_add>:args=(2, 4):kwargs={}

@print_decorator
def complex_add(a,b,*args,**kwargs): 
    out = a + b
    for arg in args: out = out + arg
    for kwarg in kwargs.values(): out = out + kwarg
    return out

complex_add(5,2,3,foo=6,bar=10)

2022-11-02 14:18:57.716085:<function complex_add>:args=(5, 2, 3):kwargs={'foo': 6, 'bar': 10}

8.3 Use on Existing Functions

What we have seen is applying a decorator to functions we fully define but we can also apply them to previously existing functions like ones we import from a library. This is helpful not just in understanding one way you can extend an existing libraries functionality, but also in understanding what decorators are. They aren’t magical.

Let’s add logging to pd.DataFrame using our existing decorator so we can see when a dataframe is constructed.

LoggingDataFrame = print_decorator(pd.DataFrame)
df = LoggingDataFrame([1,2,3])

2022-11-02 14:53:16.323144:<class 'pandas.core.frame.DataFrame'>:args=([1, 2, 3],):kwargs={}

df.head()

	0
0	1
1	2
2	3

The key thing to notice here is that the @ syntax really isn’t doing anything magical. It’s just passing the function into the decorator and using that as the function definition. It’s just syntactic sugar for a higher order function that takes a function and returns a function.

To understand why this works, think through what our decorator is doing. 1. It’s a function that takes a function as an argument 2. It creates a new function called wrapper. This wrapper function called the argument passed into it, but also has other code. 3. It returns that function as the output

print_function_source(print_decorator)

def print_decorator(func):
    def wrapper(*args, **kwargs):
        print(f"{datetime.now()}:{func}:args={args}:kwargs={kwargs}")
        return func(*args, **kwargs)
    return wrapper

9 Inheritance

Inheritance is the idea that you a class can “Inherit” attributes and methods from other classes.

For example a class could have an attribute a, and it can be used to create a new class to give it that attribute without having to specify it.

9.1 Silly Simple Example

class aClass: a = 2
    
class bClass(aClass): pass
    
aClass.a == bClass.a

True

9.2 Useful Examples

In many cases there are common things we want to inherit in lots of classes. One example is having access to the date. Often you want this for logging, or printing, or any number of things. By subclassing you don’t have to reformat the date each time in your classes.

class DateMinuteMixin:
    date_format='%Y%m%d_%H%M%S'
    dte = datetime.now()

    @property
    def date_str(self): return self.dte.strftime(self.date_format)

Another handy use is to have generic behavior for handling different file types. In this case, we have a mixin where it opens and reads a sql file. Rather than rewriting this code for every class that needs to read a sql file, you can inherit from a class when you need that functionality.

Tip

You can define an abstract property like below to let users know that after inheriting this class, they need to define that property. In this case, they define the sql_filepath, and they get the contents of the file for free via the other methods.

import abc

class SqlFileMixin:
    @abc.abstractproperty
    def sql_filepath(self):
        pass

    @property
    def sql_file(self):
        return open(self.sql_filepath)

    @property
    def query(self):
        return self.sql_file.read()

10 Dunders

import numpy as np
class someClass:
    def __init__(self,a): self.a = a
    def __str__(self): return f"This object's a is : {self.a}"
    def __getitem__(self,idx): return self.a[idx-1]
    def __add__(self,some_class): return list(map(lambda x,y: x + y, self.a, some_class.a))

a = someClass(x)
a.a

[1, 2, 3, 4, 25]

a + a

[2, 4, 6, 8, 50]

a[1]

<__main__.someClass>

print(a)

This object's a is : [1, 2, 3, 4, 25]

11 Iterators/Data Streaming

Iterators are useful when you don’t want to just load all data in memory all at once. They are often defined with yield, but there are other ways.

11.1 Silly Simple Example

def mapper(items,fn):
    for item in items: yield item

it = mapper([2,4,6,8],square)
it

<generator object mapper>

next(it), next(it), next(it)

(2, 4, 6)

You can also process it sequentially in a loop.

for item in mapper([2,4,6,8],square): 
    print(item)

11.2 Useful Example

11.2.1 File Streaming

print_plus = partial(print,end='\n++++++\n')

with open('test.txt', 'rb') as f:
    iterator = iter(partial(f.read, 64), b'')
    print_plus(type(iterator))
    for block in iterator: print_plus(block)

<class 'callable_iterator'>
++++++
b'one\ntwo\nthree\nfour\nfive\nsix\nseven\neight\nnine\nten\neleven\ntwelve\nt'
++++++
b'hirteen\nninety nine thousand nine hundred ninety\nninety nine tho'
++++++
b'usand nine hundred ninety one\nninety nine thousand nine hundred '
++++++
b'ninety two\nninety nine thousand nine hundred ninety three\nninety'
++++++
b' nine thousand nine hundred ninety four\nninety nine thousand nin'
++++++
b'e hundred ninety five\nninety nine thousand nine hundred ninety s'
++++++
b'ix\nninety nine thousand nine hundred ninety seven\nninety nine th'
++++++
b'ousand nine hundred ninety eight\nninety nine thousand nine hundr'
++++++
b'ed ninety nine\n'
++++++