Data Structures and Algorithms I
Data Structures and Algorithms I
slides online: https://mth251.fastzhong.com/
https://mth251.fastzhong.com/mth251.pdf
labs: https://github.com/fastzhong/mth251/tree/main/public/notebooks
🏃♂️ learning by doing, implementing the algo from scratch
🙇🏻♂️ problem solving
if you want to dive deeper into proofs and the mathematics of computer science:
📚 Building Blocks for Theoretical Computer Science by Margaret M. Fleck
⚠️ always seek the best time and space complexity by appling DSA taught in MTH251 & MTH252
⚠️ in principle, only the standard ADT operations allowed to use by default as the solution has to be language indenpendent
⚠️ advanced features and built-in functions from Python not allowed if not clearly asked by the question, e.g. sort/search/find (in)/min(list)/max(list)/set/match … , as the complexity becomes unknown and Python dependent
The TIOBE Programming Community index is an indicator of the popularity of programming languages.
✔ Easy To Learn
✔ Human Readable
✔ Productivity
✔ Cross Platform
The Zen of Python, by Tim Peters
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Readability counts.
Special cases aren’t special enough to break the rules.
There should be one-- and preferably only one --obvious way to do it.
If the implementation is hard to explain, it’s a bad idea.
✔ backend: Python vs. Java, C++, Go, Php
✔ devops: Python vs. Go, Ruby, Shell
✔ test automation: Python vs. Groovy, shell
✔ data engineering: Python vs. Java, C++
✔ data analytics & visualization: Python vs. R, Java, C++
✔ data science & machine learning: Python vs. R, Julia, C++
numbers: int float complex
strings: '' "" ' " \t \n \r \\ etc.
boolean: True False
boolean: True False
None
type conversion/casting: int() float() str() bool() hex() ord()
collections: list tuple set dictionary
list []: a collection of items, usually the items all have the same type
tuple (): a collection which is ordered and unchangeable
set {}: a collection which is unordered and unindexed
dictionary: a set of key: value pairs, unordered, changeable and indexed
variable
statement & comments
control flow
function
error/exception
Procedural vs. OOP vs. FP
OO Principal
class, instance, attributes, properties, method
override vs. overload vs. overwrite
modular programming: function → class → module → package
Modules: Python module (default main module), C module, Build-in module
Packages:
memory, copy vs. deepcopy
help
Cheatsheet:
🧾 Python Crash Course - Cheat Sheets
🧾 Comprehensive Python Cheatsheet
Programming with Mosh
Python Tutorial - Python for Beginners 2020
freeCodeCamp
Learn Python - Full Course for Beginners Tutorial
Python for Everybody - Full University Python Course
Intermediate Python Programming Course
Tech With Tim
A data structure is a way of organizing information so that it can be used effectively by computer
Algorithms provides computer step by step instructions to process the information and solve a problem
Program = Data Structure + Algorithm
- Input - Output - Definiteness - Effectiveness - Finiteness
Example: keyword searching
Data Structure: index
Algorithm: looking up the page no.
Example: adding two vectors
Data Structure: complex number
Algorithm: adding two complex numbers
a + b = (x + yi) + (u + vi) = (x + u) + (y + v)i
Data Structure
Algorithms
Why
✓ deeper understanding of computer system
✓ improve coding skill
✓ coding interview
✓ building powerful framework and library
How
🏃♂️ learning by doing, implementing from scratch
🙇🏻♂️ problem solving
How to measure Performance/Efficiency ?
Time Complexity : by giving the size of the data set as integer N, consider the number of operations that need to be conducted by computer before the algorithm can finish
Space Complexity : by giving the size of the data set as integer N, consider the size of extra space that need to be allocated by computer before the algorithm can finish
Good code:
✔ readability
✔ speed
✔ memory
👉 When: Accessing, Searching, Inserting, Deleting, …
Lef f(n) and g(n) be functions from positive integers to positive integers to positive reals
f = O(g) if there is a constant c > 0 such that f(n) ≦ c·g(n) for large n
f(n) grows no faster than g(n)
e.g.
f(n)=O(4n2+8n+16)→O(n2)→g(n)=n2
O(4n2+8n+16)=O(n2)
Big-O describes the trend of algorithm performance when the data size increases
O(1): | constant |
O(log∗n): | logarithmic |
O(n): | linear |
O(nlog∗n): | linearithmic |
O(n2): | polynomial |
O(2n): | exponential |
O(n!): | factorial |
👉 Master theorem (analysis of algorithms)
O(f)=f
O(c⋅f)=O(f)
O(f+g)=O(max(f,g))
O(f)⋅O(g)=O(f⋅g)
O(f⋅g)≤O(f⋅h) if & only if O(g)≤O(h)
O(xa)≤O(xb) if & only if a≤b
O(ax)<O(bx) if & only if a<b
O(xc)<O(dx) if & only if d>1 (assuming c≥1 and d≥1)
O(log∗x)<O(xc) if & only if c>0
Analysis of Algorithms:
worst case
ignore constant factors, lower-order terms
asymptotic analysis (large input size)
Formally, given functions f (x) and g(x), we define a binary relation: f(x)∼g(x) (as x→∞)
if and only if: x→∞limg(x)f(x)=1
asymptotic analysis is used to build numerical methods to approximate equation solutions: asymptotic expansion, asymptotic distribution, …
To store a list of similar things, example:
A list of names: [“Alex”, “Bob”, “Charles”, “David”] A list of numbers: [1, 2, 3, 4]
Each item in the array referred as “element”
Element Type: same type (array is structured data)
Element Size: fixed
# java String[] cars = {“BMW”, “Toyota”, “Tesla”} // declare & init Integer[] scores = new Integer[10] // declare // init scores[0] = 90 scores[1] = 80
# java String[] cars = {“BMW”, “Toyota”, “Tesla”} // declare & init Integer[] scores = new Integer[10] // declare // init scores[0] = 90 scores[1] = 80
students = [ [“Alex”, “M”, “S1111111A”], [“Bob”, “M”, “S2222222B”], [“James”, “M”, “S3333333C”], ] students[2] → [“James”, “M”, “S3333333C”] Students[1][2] → “S2222222B”
Index | 0 | 1 | 2 |
---|---|---|---|
0 | Alex | M | S1111111A |
1 | Bob | M | S2222222B |
2 | James | M | S3333333C |
str = "HELLO" = ['H', 'E', 'L', 'L', 'O']
data type: char data type size: 2 byte (1 byte = 8 bits, 0000 0000 ~ 1111 1111)
total_size = array_size * data_type_size
array[i].address = base_address + i * data_type_size
👉 O(1)
- what is the good space/slot size for an array?
- when is the good time to expand/shrink the array?
Growth Pattern:
👉 memory management is one of the biggest programming challenges.
Operation | Array | Dynamic Array |
---|---|---|
Accessing | O(1) | O(1) |
Searching | O(n) | O(n) |
Inserting | - | O(n) |
Deleting | - | O(n) |
List [a, b, c, ...] | Dicts {k:v, ...} | Set {a, b, c,} | |||
---|---|---|---|---|---|
mylist.append(val) | O(1) | mydict[key] = val | O(1) | myset.add(val) | O(1) |
mylist[i] | O(1) | mydict[key] | O(1) | ||
val in mylist | O(N) | key in mydict | O(1) | val in myset | O(1) |
for val in mylist: | O(N) | for key in mydict: | O(N) | for val in myset: | O(N) |
mylist.sort() | O(NlogN) |
# .. make a list .. if thing in my_list: # O(N) # ✅ Good # .. make a set .. if thing in my_set: # O(1) # ❌ Bad # .. make a list .. my_set = set(my_list) # O(N) if thing in my_set: # O(1) # ✅ Good # .. make a list .. my_set = set(my_list) # O(N) for many_times: if thing in my_set: # O(1)
# .. make a list .. if thing in my_list: # O(N) # ✅ Good # .. make a set .. if thing in my_set: # O(1) # ❌ Bad # .. make a list .. my_set = set(my_list) # O(N) if thing in my_set: # O(1) # ✅ Good # .. make a list .. my_set = set(my_list) # O(N) for many_times: if thing in my_set: # O(1)
An abstract data type (ADT) is an abstraction of a data structure which provides only the interface to which a data structure must adhere to. The interface does not give any specific details about how something should be implemented - ADT provides implementation-independent view of a data structure.
Programming language provides different data types to implement/represent a specific data structure.
Array - a linear abstract data type
Array - a java data type
Dynamic Array - array with changable size
List - a python data type, more flexible than a dynamic array
ArrayList/Vector - java data type, implementation of List
Sequential Access vs Random Access (such as Array)
LIFO (Last In First Out) sequential collection
Operation | Stack |
---|---|
Accessing | O(n) |
Searching | O(n) |
Inserting | O(1) (push) |
Deleting | O(1) (pop) |
Operation | Queue |
---|---|
Accessing | O(n) |
Searching | O(n) |
Inserting | O(1) (enqueue) |
Deleting | O(1) (dequeue) |
dynamic linear data structure
each item contains data & pointer
data stored in a “Node” class
each item holds a relative position relative to the other items: 1st, 2nd, …, last item
- add(element) - append(element) - insert(position, element) - delete(element) - is_empty() - __len__() - set(position, element) - get(position) - search(element) - index(element) - pop() - pop(position)
Operation | Linked List | Dynamic Array |
---|---|---|
Accessing | O(n) | O(1) |
Searching | O(n) | O(n) |
Inserting | O(1) 🤔 | O(n) |
Deleting | O(1) 🤔 | O(n) |
✔ dynamic, no need to deal with fixed memory size ✖ accessing speed
a linked list where all nodes are connected to form a circle
circular singly linked list, circular doubly linked list, sorted circular linked list
- append(element) - delete(element) - search(element) - is_empty() - __len__()
- first() - last() - before(position) - after(position) - set(position, element) - search(element) - is_empty() - __len__() - add_first(element) - add_last(element) - add_before(position, element) - add_after(position, element)
Recursion is the process of defining a problem (or the solution to a problem) in terms of (a simpler version of) itself
👉 Recursion by definition is a function that calls itself.
Example:
Fibonacci sequence 0, 1, 1, 2, 3, 5, 8, …
Anything with a recursion can be done iteratively (loop)
Max call stack size (stack overflow error)
Tail Call Optimization
Memorization
Fundamental technique to solve problem:
Identifying the base case
Identifying the recursion formula/equation to transform the problem to smaller version
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <title>simple</title> </head> <body> <h1>A simple web page</h1> <ul> <li>List item one</li> <li>List item two</li> </ul> <h2><a href="http://www.suss.edu.sg">SUSS</a><h2> </body> </html>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <title>simple</title> </head> <body> <h1>A simple web page</h1> <ul> <li>List item one</li> <li>List item two</li> </ul> <h2><a href="http://www.suss.edu.sg">SUSS</a><h2> </body> </html>
HTML DOM tree
One root
Max 2 child nodes
One and only one path from root to each node (path)
Max nodes on level: 2l
Max nodes total: 2h+1−1
ROOT → Left → Right
A B D H I E J C F G K
Left → Root → Right
H D I B J E A F C K G
Left → Right → Root
H I D J E B F K G C A
A B C D E F G H I J K
Concept |
|
---|---|
Complexity | Accessing O(1), Searching O(n) |
Notes |
|
Hands-on | dynamic array, stack/queue, binary search, etc. |
Concept | LIFO/FILO |
---|---|
Complexity | Accessing O(n), Searching O(n), Inserting/push O(1), Deleting/pop O(1) |
Notes | Stack implementation by dynamic array or linked list |
Hands-on | function call stack, expression matching, etc. |
Concept | FIFO/LILO |
---|---|
Complexity | Accessing O(n), Searching O(n), Inserting/enqueue O(1), Deleting/dequeue O(1) |
Notes | Queue implementation by dynamic array or linked list |
Hands-on | priority queue, circular queue, job queue, resource pool, etc. |
Concept |
|
---|---|
Complexity | Accessing O(n), Searching O(n), Inserting O(1), Deleting O(1) |
Notes |
|
Hands-on | stack, queue, traverse/reverse/update/merge, etc. |
Concept |
|
---|---|
Complexity |
|
Notes | stored in array or linked nodes |
Hands-on | 4 traversal |
As we know, whenever we are given a sorted Array or LinkedList or Matrix, and we are asked to find a certain element, the best algorithm we can use is the Binary Search. - Shrink the search space every iteration (recursion) - Cannot exclude potential answers during each shrinking
Although the basic idea of binary search is comparatively straightforward, the details can be surprisingly tricky
— Donald Knuth
Further reading: Generalized Binary Search with predicates and main theorem:
https://www.topcoder.com/community/competitive-programming/tutorials/binary-search
In problems where we deal with sorted arrays (or LinkedLists) and need to find a set of elements that fulfill certain constraints, the Two Pointers approach becomes quite useful. The set of elements could be a pair, a triplet or even a subarray.
same direction
reverse direction
Given an array of sorted numbers and a target sum, find a pair in the array whose sum is equal to the given target.
Write a function that reverses a string. The input string is given as an array of characters char[]. Do not allocate extra space for another array, you must do this by modifying the input array in-place with O(1) extra memory.
You may assume all the characters consist of printable ascii characters.
Given the head of a Singly LinkedList, Find k'th node from the end of a linked list.
The Fast & Slow pointer approach, also known as the Hare & Tortoise algorithm, is a pointer algorithm that uses two pointers which move through the array (or sequence/LinkedList) at different speeds.
Fast & Slow Pointers, distance btw them
Speed
Given the head of a Singly LinkedList, write a function to determine if the LinkedList has a cycle in it or not.
In many problems dealing with an array (or a LinkedList), we are asked to find or calculate something among all the contiguous subarrays (or sublists) of a given size.
Given an array, find the average of all contiguous subarrays of size ‘K’ in it.
Given a string, find the length of the longest substring which has no repeating characters.
Example1 Input: String="aabccbb“ Output: 3
Explanation: The longest substring without any repeating characters is "abc".
Example2 Input: String="abbbb“ Output: 2
Explanation: The longest substring without any repeating characters is "ab".
Example3 Input: String="abccde“ Output: 3
Explanation: Longest substrings without any repeating characters are "abc" & "cde".
Also possible to perform via command line:
> # create the env > conda create -n mth251 python=3.8 > # activate the env > conda activate mth251 > # install jupyter > conda install -c conda-forge notebook > # multipledispatch for lab1 > conda install -c anaconda multipledispatch > # start jupyter notebook > jupyter notebook
> # create the env > conda create -n mth251 python=3.8 > # activate the env > conda activate mth251 > # install jupyter > conda install -c conda-forge notebook > # multipledispatch for lab1 > conda install -c anaconda multipledispatch > # start jupyter notebook > jupyter notebook
VS Code now fully integrated with Jupyter notebook, refer to this link:
Jupyter Notebooks in VS Code
Google provides online Jupyter env:
https://colab.research.google.com/
notebooks: https://github.com/fastzhong/mth251/tree/main/public/notebooks
colab: https://colab.research.google.com/github/fastzhong/mth251/blob/main/public/notebooks/lab1.ipynb
lab1.ipynb
Python
Big O
Given an array of integers nums and an integer target, return indices of the two numbers such that they add up to target.
You may assume that each input would have exactly one solution, and you may not use the same element twice.
Example 1
Input: nums = [2,7,11,15], target = 9
Output: [0,1]
Because nums[0] + nums[1] == 9, we return [0, 1]
Example 2
Input: nums = [3,2,4], target = 6
Output: [1,2]
Example 3
Input: nums = [3,3], target = 6
Output: [0,1]
Given a string s containing just the characters '(', ')', '{', '}', '[' and ']', determine if the input string is valid.
An input string is valid if:
Example 1
Input: s = "()[]{}" Output: true
Example 2
Input: s = "(]" Output: false
Example 3
Input: s = "([)]" Output: false
Example 4
Input: s = "{[]}" Output: true
Priority Queue is similar to queue but the element with higher priority can be moved forward to the front. Use exiting Queue class to implement a priority queue (element with lower value has higher priority).
Priority Queue can be used in Printer Jobs or Schedule Tasks.
Circular Queue is a linear data structure in which the operations are performed based on FIFO (First In First Out) principle and the last position is connected back to the first position to make a circle. It is also called "Ring Buffer".
Design your implementation of circular queue.
Implementation of CircularQueue class:
convert binary tree from linked list to array
convert binary tree from array to linked list
check a balanced binary tree
A binary tree's maximum depth is the number of nodes along the longest path from the root node down to the farthest leaf node.
For example: the maximum depth is 3
A binary tree's minimum depth is the number of nodes along the shortest path from the root node down to the nearest leaf node.
For example: the minimum depth is 3
In a complete binary tree, every level, except possibly the last, is completely filled, and all nodes in the last level are as far left as possible. It can have between 1 and 2h nodes inclusive at the last level h.
complete = true:
complete = false:
review linear search, binary search
let us do some exercises 📝 lab5.ipynb
Go to https://www.cs.usfca.edu/~galles/visualization/Search.html to understand how Linear Search & Binary Search is working
Implement Linear Search & Binary Search in Python by yourself:
Implement a Python function to determines if a string is a palindrome, for example, ‘racecar’ and ‘level’ are palindromes
Given a sorted array nums, remove the duplicates in-place such that each element appears only once and returns the new length.
Do not allocate extra space for another array, you must do this by modifying the input array in-place with O(1) extra memory.
Example 1
Input: nums = [1,1,2]
Output: 2, nums = [1,2]
Explanation: Your function should return length = 2, with the first two elements of nums being 1 and 2 respectively. It doesn’t matter what you leave beyond the returned length.
Example 2
Input: nums = [0,0,1,1,1,2,2,3,3,4]
Output: 5, nums = [0,1,2,3,4]
Explanation: Your function should return length = 5, with the first five elements of nums being modified to 0, 1, 2, 3, and 4 respectively. It doesn’t matter what values are set beyond the returned length.
TMA review