The purpose of this is to introduce concepts I believe data scientists could benefit from knowing.
I am assuming that the reader knows the basics of programming. I will cover concepts I frequently see that I think are not used enough or appropriately, regardless of how basic or advanced they may be.
3 Comprehensions
Comprehensions in python should be used when possible. They are faster than forloops and require less code when they fit
x = [2,3,4,5]
out=[]%timeit for i inrange(1000000): out.append(i+1)
87.2 ms ± 647 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit [i+1for i inrange(1000000)]
56.8 ms ± 584 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
This is basically special syntax for a forloop, and are useful in a subset of forloops. Basically any time you see the pattern where you initialize something, then modify or build it in the forloop you can likely use a comprehension
out = []for o inrange(5): out.append(o**o)out
[1, 1, 4, 27, 256]
[o**o for o inrange(5)]
[1, 1, 4, 27, 256]
List comprehensions are most common but you can also do tuple comprehension, set comprehension, dict comprehension, or other data types.
set(o**o for o inrange(5))
{1, 4, 27, 256}
{str(o):o**o for o inrange(5)}
{'0': 1, '1': 1, '2': 4, '3': 27, '4': 256}
A few handy patterns are:
Reversing a dictionary
Combining lists
All unique combos from multiple lists (nested comprehension)
adict = {"a":1,"b":2}{v:k for k,v in adict.items()}
{1: 'a', 2: 'b'}
x = [1,2,3,4]y = [5,6,7,8]
[a+b for a,b inzip(x,y)]
[6, 8, 10, 12]
unique_combos = L((a,b) for a in x for b in y)unique_combos
Destructured assignments mean to can break up iterables when you assign. This is handy to reduce pointless lines of code.
a,b =5,6a,b,c = [],[],{}
Another use is to break up lists to create lists where we take all the first elements out into it’s own list, and the second elements out into their own lists.
I often see this done with multiple list comprehension, doing [o[0] for o in [x,y,z]] to get the first element, then repeating for other elements.
However, we can do this easier with the help of zip and destructured assignments
Fastcore is a great library to know. It’s got a lot of useful features and extensions to the python standard library and it’s designed to be used in live environments like jupyter notebooks.
Nice way of documenting code concisely and being able to access info from code. It’s concise, easy to manipulate to display how you want, and easy to read. I much prefer it over the large numpy style docstrings that are big string blocks
from fastcore.docments import*def distance(pointa:tuple, # tuple representing the coordinates of the first point (x,y) pointb:tuple=(0,0) # tuple representing the coordinates of the first point (x,y) )->float: # float representing distance between pointa and pointb'''Calculates the distance between pointa and pointb''' edges = np.abs(np.subtract(pointa,pointa)) distance = np.sqrt((edges**2).sum())return distance
'Calculates the distance between pointa and pointb'
Everyone agrees testing is important. But not all testing is equal. The needs for unit testing the google code base are not the same as the needs a data scientist needs for building and deploying models, libraries, and most software.
Fastcore is a great tool for most of my testing needs. Fast and simple enough that I can add tests as I build and as I am exploring and building models. I want testing to enhance my development workflow, not be something I have to painstakingly build at the end.
Sometimes simple assert statements are sufficient, but there’s small annoyances. For example, a small change in type can mean a failed test. Sometimes that change in type should cause a failure, sometimes I’m ok if it’s a different type if the values are the same
from fastcore.test import*
For floating points it has handy functionality for that, which is very common in data science. For example, we may want .1 + .1 + .1 == .3 to be true, because they are close enough based on floating point precision
test_close(.1+.1+.1, .3)
We can test that something fails, if there are particular situation we want to ensure raise errors.
L is a replacement for a list, but with lots of adding functionality. Some of it are functional programming concepts, some is numpy like stuff, and some is just niceities (like cleaner printing).
alist = L(1,2,3,4,3)
(#5) [1,2,3,3,4]
(#4) [1,2,3,4]
alist.filter(lambda x: x <3)
(#2) [1,2] x: x *2)
(#5) [2,4,6,8,6]
5.1.4 AttrDict
Attrdict is another nice thing from fastcore, that makes dictionaries a bit nicer to use.
Logging is super important. if you log stuff as you work properly you can always look back at what was done previously. Sometimes it’s hard to tell what’s going on as you run and re-run different things. Logging is handy not just in production for debugging, but also as a tool when you are developing. There are many tools to help with logging and visualizing results (for example W&B or tensorboard for deep learning) - but the foundations are good to understand and use too! (f'{get_current_time()}|This is an info message')
!head -4 mylog.log
INFO:root:20221106_111500|This is an info message
INFO:root:20221106_111521|Starting the model training process
INFO:root:20221106_111521|Training set has 50 records
INFO:root:20221106_111521|Validtion set has 70 records
def log_stuff(msg,**kwargs): dt = get_current_time()"{dt}|{msg}")for k,v in kwargs.items():"{dt}|{k}={v}")
log_stuff('this is what I want to log', trainig_set='50 records', validation_set='70_records')
7 Higher Order Functions & Callbacks
This is a simple example of what these terms mean:
A higher order function is a function that takes a function as an argument
A callback is a function that is passed in as an argument to a higher order function
def callbackFunc1(s): print('Callback Function 1: Length of the text file is : ', s)def callbackFunc2(s): print('Callback Function 2: Length of the text file is : ', s)def HigherOrderFunction(path, callback):withopen(path, "r") as f: callback(len("mylog.log", callbackFunc1)HigherOrderFunction("mylog.log", callbackFunc2)
Callback Function 1: Length of the text file is : 1130
Callback Function 2: Length of the text file is : 1130
This is handy in a lot of situations.
7.1 Filter
Filter is a common higher order function.
L(1,2,3,4,5).filter(lambda x: x>3)
(#2) [4,5]
This is very flexible because we can put filtering logic of any complexity in a function and use that to filter a list of any type.
7.2 Map
Map is another very common higher order function.
L(1,2,3,4,5).map(lambda x: x**2)
(#5) [1,4,9,16,25]
It is again super flexible because we can apply a function of any complexity to have it be applied and modify each element of the list.
log_stuff('something might be awry',fn=logger.critical,a=1,b=55)
!tail -3 mylog.log
CRITICAL:root:20221106_193211|something might be awry
7.4 File Processor
You can also make a generic file processor that you can pass callbacks to. This file processor can include log statements to log what you’re doing, so you can minimize repeating lots of code. For now, we’ll do a simple processor, and callbacks to clean and format a messy sql file.
def process_file(fpath,callbacks): withopen(fpath, "r") as f: contents = callback in callbacks: contents = callback(contents)return contents
Decorators give you a way to add the same functionality to many functions (like inheritance does for classes). You typically use decorator using the @ syntax, which modified the function.
8.1 Silly Simple Example
def add_another(func):def wrapper(number):print(f"The decorator took over!")print(f"I could log the original number ({number}) here!")print(f"Or I could log the original answer ({func(number)}) here!")return func(number) +1return wrapper@add_anotherdef add_one(number): return number +1
So when we use a decorator, the code in the wrapper function is called instead of the original function. Typically the wrapper function calls the original function (otherwise there would be no point in decorating it as you’d just have a new unrelated function).
8.2 Useful Example
For example, maybe you want to print (or log) particular function call times and the args. See this decorator that does just that (and can be used on methods too)
@print_decoratordef complex_add(a,b,*args,**kwargs): out = a + bfor arg in args: out = out + argfor kwarg in kwargs.values(): out = out + kwargreturn out
What we have seen is applying a decorator to functions we fully define but we can also apply them to previously existing functions like ones we import from a library. This is helpful not just in understanding one way you can extend an existing libraries functionality, but also in understanding what decorators are. They aren’t magical.
Let’s add logging to pd.DataFrame using our existing decorator so we can see when a dataframe is constructed.
The key thing to notice here is that the @ syntax really isn’t doing anything magical. It’s just passing the function into the decorator and using that as the function definition. It’s just syntactic sugar for a higher order function that takes a function and returns a function.
To understand why this works, think through what our decorator is doing. 1. It’s a function that takes a function as an argument 2. It creates a new function called wrapper. This wrapper function called the argument passed into it, but also has other code. 3. It returns that function as the output
Inheritance is the idea that you a class can “Inherit” attributes and methods from other classes.
For example a class could have an attribute a, and it can be used to create a new class to give it that attribute without having to specify it.
9.1 Silly Simple Example
class aClass: a =2class bClass(aClass): passaClass.a == bClass.a
9.2 Useful Examples
In many cases there are common things we want to inherit in lots of classes. One example is having access to the date. Often you want this for logging, or printing, or any number of things. By subclassing you don’t have to reformat the date each time in your classes.
class DateMinuteMixin: date_format='%Y%m%d_%H%M%S' dte = date_str(self): returnself.dte.strftime(self.date_format)
Another handy use is to have generic behavior for handling different file types. In this case, we have a mixin where it opens and reads a sql file. Rather than rewriting this code for every class that needs to read a sql file, you can inherit from a class when you need that functionality.
You can define an abstract property like below to let users know that after inheriting this class, they need to define that property. In this case, they define the sql_filepath, and they get the contents of the file for free via the other methods.
import numpy as npclass someClass:def__init__(self,a): self.a = adef__str__(self): returnf"This object's a is : {self.a}"def__getitem__(self,idx): returnself.a[idx-1]def__add__(self,some_class): returnlist(map(lambda x,y: x + y, self.a, some_class.a))
a = someClass(x)a.a
[1, 2, 3, 4, 25]
a + a
[2, 4, 6, 8, 50]
This object's a is : [1, 2, 3, 4, 25]
11 Iterators/Data Streaming
Iterators are useful when you don’t want to just load all data in memory all at once. They are often defined with yield, but there are other ways.
11.1 Silly Simple Example
def mapper(items,fn):for item in items: yield item
it = mapper([2,4,6,8],square)it
<generator object mapper>
next(it), next(it), next(it)
(2, 4, 6)
You can also process it sequentially in a loop.
for item in mapper([2,4,6,8],square): print(item)
11.2 Useful Example
11.2.1 File Streaming
print_plus = partial(print,end='\n++++++\n')withopen('test.txt', 'rb') as f: iterator =iter(partial(, 64), b'') print_plus(type(iterator))for block in iterator: print_plus(block)
<class 'callable_iterator'>
b'hirteen\nninety nine thousand nine hundred ninety\nninety nine tho'
b'usand nine hundred ninety one\nninety nine thousand nine hundred '
b'ninety two\nninety nine thousand nine hundred ninety three\nninety'
b' nine thousand nine hundred ninety four\nninety nine thousand nin'
b'e hundred ninety five\nninety nine thousand nine hundred ninety s'
b'ix\nninety nine thousand nine hundred ninety seven\nninety nine th'
b'ousand nine hundred ninety eight\nninety nine thousand nine hundr'
b'ed ninety nine\n'