Machine Learning — AI

Federicorudolf
12 min readMar 13, 2024
DALL-E image of an abstract conception of Machine Learning.

Both Machine Learning (ML) and Artificial Intelligence (AI), are closely intertwined (and frequently confused with one another) but ML is actually a subset of AI. While AI is a high level concept, Machine Learning is the practice of getting to an Artificial Intelligence with code, and getting a system to learn from input data. We’re going to review the core concepts behind ML and how to use it.

To get started, we first need to define it. Machine Learning is the way we use algorithms to train a program to learn from input data we provide, so that it can extract features and generate accurate predictions, without being explicitly coded for that. Despite being a little basic and maybe falling a little short (especially nowadays with all the hype around this topic) it fits the needs for this article.

As you can imagine, the more data we provide, the better the prediction we get. And it’s due to that, that AI in general, and ML in particular have grown so rapidly lately. The computing breakthroughs (especially in the field of GPUs and how to use these for faster parallel computing) have paved the way for ML models to be trained with more and more data, which is why generative models (the ones that probably need the most data amongst all the ML models) are doing so well. The GPT-4 model is supposably trained with 1.76 trillion (with a T 🤯) parameters.

Technically, the way these models work is by applying statistics, probabilities, and some algebra to map high-dimensional datasets into 2 or 3 dimensions. This allows humans to visualize data in a comprehensive way and understand the relationship between different parameters, and how they influence the whole dataset.

We want to go from something like this:

Image extracted from MathWorks

To something like this (just to illustrate a point, these two images have nothing in common with each other):

Image extracted from MathWorks

The final goal is to get to a system that is able to make predictions for a specific situation given a specific state. The machine creates a model (a mathematical model) that can produce an output based on the patterns that it extracted from the data we provided.

Conceptually, there are three main types of Machine Learning models, supervised and unsupervised learning models, and reinforcement learning, and there are many different algorithms for each:

  1. Supervised Learning: We act as a teacher to the model, and we tell the machine which is the input-output combination we’re expecting, so through its calculations it can decipher underlying patterns in the data we provided. We refer to this data as labeled data, which is, for example, providing an image of a dog and telling the model — “this is a dog” — and repeat this with a large amount of data. When we’ve trained the model with enough data, it will be able to separate a dog from other animals when being shown one, or to predict the price of a used car given its characteristics. With this type of model, we usually get higher accuracy and more versatility (we can use these models for classification, regression or object detection, among others). On the downside, we rely only on the data we provide, which gives us less flexibility (they tend to be designed for a specific task) and they’re not really good at generalizations. They perform rather poorly when being presented with new unseen data.
    Some common algorithms used in this approach are regression analysis, decision trees, k-nearest neighbors, neural networks and SVMs.
  2. Unsupervised Learning: We use clusters or groups of similar data points to narrow the dataset, and build relations and patterns between the single entries of the provided dataset. This approach is great for discovering hidden relationships (e.g., Facebook’s people you might know algorithm) and it’s also really good at working with high-dimensional data (think of high resolution images, that have many layers stacked on top of each other). One caveat is that unsupervised learning algorithms are a little unpredictable, since we don’t really know at first sight how the output was built. Also, the noise (how particularities in the distribution of the data points are referred to) in the dataset we provide can create unreliable results.
    Some algorithms used in this approach are social network analysis, descending dimension algorithms, k-means clustering and dim/feature reduction.
  3. Reinforcement learning: This is an important category of ML algorithms which does well on tasks such as game playing, robotics, or self-driving algorithms. It goes through a set of states from an initial one, and it’s rewarded (or not) at each state. The ML agent should try to either maximize or minimize (depending on what we want) that result when going through all the possible states (we can limit the iterations the agent should take). There is one trade-off we’ll face with these algorithms, which is exploration vs exploitation. This means that we’ll need to balance between finding new moves or states, and calculating the reward or yield on each of them.
    Some useful algorithms used in this approach are Q-learning and deep reinforcement learning.

Now that we’ve established in a broad sense what Machine Learning is, and its differences with Artificial Intelligence at a concept level, we can move on to what is like working with it.

First of all, we need to define some tools to work with:

  1. Programming language: There are many alternatives to use here (practically every programming language that’s out there), but the go-to recommendation is to use Python, due to its large community, reasonable learning curve, and its support. Some other alternatives are Java, C/C++/C#, R, Swift, Javascript and Ruby. Below’s a comparison between them on some relevant topics.
  2. Data source: You need to define where the data is going to come from. Maybe you already have a database that you want to use here, but if that’s not your case, there are many websites that offer free datasets. One option is https://www.kaggle.com/ where not only you can get datasets to work with but also start learning how to program ML models.
  3. Algorithm: The algorithm we choose will be closely related to the problem we face and the structure and amount of data provided. For example, we could have a dataset with a few houses for sale in any given city, and their characteristics and we’d need to predict how much the next house will go for. For that particular problem and dataset, we should probably pick among any of the Supervised Learning algorithms mentioned above. Another example would be that we have a huge dataset with spam emails, and we need to build a tool that will detect the next spam email based on the ones we provide. For this particular problem we could rely on any of the Unsupervised Learning algorithms mentioned earlier, since we need the machine to discover underlying patterns between all the spam emails.
  4. Infrastructure: To begin working in the field usually a personal computer is more than enough (we can bear a couple minutes of waiting for a program to run). Once you start scaling, we can either build our own GPU cluster, or we can rely on distributed computing from a provider like Amazon, Google, or Microsoft.
  5. Code Libraries:
  • Tensorflow: Built and maintained by Google, it supports a range of programming languages such as Python, JavaScript and Swift. It provides most of the algorithms mentioned above, plus a lot of the tools needed to build ML models, like activation functions, data splitting functions, among others.
  • Scikit-Learn: Built by INRIA, made for Python. Much like Tensorflow, it provides practically every ML algorithm out there, and tools to work with them to build a model.
  • Numpy: This library is a must if you want to build any Math related model with Python. It provides practically every math function, especially to work with arrays and matrices. Built by NumFOCUS, made for Python.
  • Pandas: This library provides data structures to better handle and transform the datasets we work with. Mostly used to prepare data before we feed it into the model. Also built by NumFOCUS for Python.
  • Pytorch: This is the Facebook version of Tensorflow. Really intuitive and easy to use, provides all the tools to work with ML models.
  • Keras: This is a particular library that’s integrated into Tensorflow but that was built independently. It focuses on Neural Networks (mostly used for deep learning algorithms, which are a particular subset of unsupervised learning).
Programming Languages comparison

Lastly, I’ll show some code examples for some of the algorithms that we’ve mentioned earlier. I won’t be showing any results, this is just to paint a picture of how the code to build a –simple– model looks like:

Decision Trees Algorithm:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix

# https://www.kaggle.com/fayomi/advertising/data

df = pd.read_csv('./datasets/advertising.csv')

df = pd.get_dummies(df, columns=['Country', 'City'])

del df['Ad Topic Line']
del df['Timestamp']

# print(df.head())

# axis=1 tells the function to remove columns instead of rows
X = df.drop('Clicked on Ad', axis=1)
Y = df['Clicked on Ad']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=10, train_size=0.3, shuffle=True)

model = DecisionTreeClassifier()

model.fit(X_train, Y_train)

model_predict = model.predict(X_test)

# Now we assess the model

print(confusion_matrix(Y_test, model_predict))
# | 325 15 | 15 false positives
# | 33 327 | 33 false negatives
print('--------------------------------')
print(classification_report(Y_test, model_predict))
# Precision = 0.93 Recall = 0.93 F1 = 0.93

K-Nearest Neighbors algorithm:

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix

df = pd.read_csv('./datasets/advertising.csv')

del df['Ad Topic Line']
del df['Timestamp']
del df['Male']
del df['Country']
del df['City']

# print(df.head())

# We'll use standard scaler to reduce the number of variables

scaler = StandardScaler()

scaler.fit(df.drop('Clicked on Ad', axis=1))
scaled_features = scaler.transform(df.drop('Clicked on Ad', axis=1))

# Now we set X and Y values

X = scaled_features
Y = df['Clicked on Ad']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=10, shuffle=True)

# Now we set the algorithm

model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_train, Y_train)

model_predict = model.predict(X_test)

print(confusion_matrix(Y_test, model_predict))
print('--------------------------------')
print(classification_report(Y_test, model_predict))

print(model.predict(scaled_features)[0:10])

Support Vector Machine algorithm:

import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC # SVC = Support Vector Classifier
from sklearn.metrics import classification_report, confusion_matrix

df = pd.read_csv('./datasets/advertising.csv')

del df['Ad Topic Line']
del df['Timestamp']

# We'll use one-hot encoding to parse country and city variables into numeric values

df = pd.get_dummies(df, columns=['Country', 'City'])

X = df.drop('Clicked on Ad', axis=1)
Y = df['Clicked on Ad']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=10)

model = SVC()

model.fit(X_train, Y_train)

model_predict = model.predict(X_test)

# Now we'll use a confusion matrix to evaluate the results from the model.predict against the results in Y_test dataset

# Confusion matrix
# print(confusion_matrix(Y_test, model_predict))

# | [ 124, 22 ] | True positives | False positives
# | [ 68, 86 ] | True negatives | False negatives -> In this case false negatives are a bit higher than expected.

# Classification report
# print(classification_report(Y_test, model_predict))

# Precision: 0.72
# Recall: 0.70
# F1: 0.69

# We can use grid search to find the optimal hyperparameters for this algorithm. There are many hyperparameters in an SVC algorithm, but we'll focus on the C and the gamma
# C controls the cost of missclassification (ignore the misclassified data points). Ignoring data points that cross over the margin can lead to a better fit. The lower the C, the more errors the margin is permitted to ignore. 0 means no penalty on misclassified data points
# gamma refers to the Gaussian radial basis function. Small gamma produces high bias and low variance models. Conversely, large gamma produces low bias and high variance models.

# What grid search does is to list a range of values to test for each hyperparameter. Since we test each hyperparameter it can take a lot of time to run

hyperparameters = {'C': [10, 30, 50], 'gamma': [0.001, 0.0001, 0.00001]}

# From this we get that the optimal C is 50, and the optimal gamma is 0.00001. More than that leads to really similar results with a lot more computational complexity.

grid = GridSearchCV(SVC(), hyperparameters)

grid.fit(X_train, Y_train)

grid_predictions = grid.predict(X_test)

# Confusion matrix
print(confusion_matrix(Y_test, grid_predictions))

# | [ 129, 17 ] | True positives | False positives
# | [ 15, 139 ] | True negatives | False negatives

# Classification report
print(classification_report(Y_test, grid_predictions))


# Precision: 0.89
# Recall: 0.89
# F1: 0.89

Logistic Regression algorithm:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report

# https://www.kaggle.com/tayoaki/kickstarter-dataset
df = pd.read_csv('./datasets/18k_Projects.csv')

del df['Id']
del df['Name']
del df['Url']
del df['Location']
del df['Pledged']
del df['Creator']
del df['Category']
del df['Updates']
del df['Start']
del df['End']
del df['Latitude']
del df['Longitude']
del df['Start Timestamp (UTC)']
del df['End Timestamp (UTC)']
del df['Creator Bio']
del df['Creator Website']

# Now that we've cleaned up the dataset, we can proceed to transform data-points into numerical expressions
df = pd.get_dummies(df, columns=['State', 'Currency', 'Top Category', 'Facebook Connected', 'Has Video'], drop_first=True)

# print(df.isnull().sum())
# We can see here that 4 out of the 36 remaining columns contain missing values.

# Obtaining correlation coefficients with dependent variable (State_successful)
# print(df['State_successful'].corr(df['Facebook Friends'])) # This one has strong correlation (0.15) so we need to fill out missing values with mean
# print(df['State_successful'].corr(df['Creator - # Projects Backed'])) # This one has strong correlation (0.1) so we need to fill out missing values with mean
# print(df['State_successful'].corr(df['# Videos'])) # Since it doesn't have a strong correlation (0.05), and there ar only 101 missing rows, we can remove them
# print(df['State_successful'].corr(df['# Words (Risks and Challenges)'])) # Since it doesn't have a strong correlation (0.007), and there ar only 101 missing rows, we can remove them

# plt.figure(figsize=(12, 6))
# sns.distplot(df['Creator - # Projects Backed'], kde=True, hist=0)
# plt.show()

# We will proceed to remove the empty rows for Facebook Friends column due to it's high variance, and to fill out missing values for Creator - # Projects Backed column

df['Creator - # Projects Backed'].fillna(df['Creator - # Projects Backed'].mean(), inplace=True)
df.dropna(axis=0, how='any', subset=None, inplace=True)
# print(df.columns)

# Now we set the variables
X = df.drop('State_successful', axis=1)
Y = df['State_successful']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=10, shuffle=True)

# Set the algorithm
model = LogisticRegression()
model.fit(X_train, Y_train)

model_predict = model.predict(X_test)

# Confusion matrix
print(confusion_matrix(Y_test, model_predict))

# Classification report
print(classification_report(Y_test, model_predict))

# Now we want to predict on new data
new_project = [
0, # Comments
9, # Rewards
250000, # Goal
15, # Backers
9, # Days Duration
31, # Facebook Friends
110, # Facebook Shares
1, # Creator - Projects Created
0, # Creator - Projects Backed
0, # Videos
12, # Images
8, # Words - description
65, # Words - risks...
0, # FAQs
0, # Currency_AUD
1, # Currency_CAD
0, # Currency_EUR
0, # Currency_GBP
0, # Currency_NZD
0, # Currency_USD
0, # Top Category_Art
0, # Top Category_Comics
0, # Top Category_Crafts
0, # Top Category_Dance
0, # Top Category_Design
0, # Top Category_Fashion
1, # Top Category_Film & Video
0, # Top Category_Food
0, # Top Category_Games
0, # Top Category_Journalism
0, # Top Category_Music
0, # Top Category_Photography
0, # Top Category_Publishing
0, # Top Category_Technology
0 # Top Category_Theater
]

new_pred = model.predict([new_project])
print('Prediction: ', new_pred)

If you want to dig deeper into this topic, and start building your own Machine Learning models, here are a few resources I recommend:

  1. Andrew Ng Machine Learning Course: https://www.youtube.com/watch?v=jGwO_UgTS7I
  2. Scatterplot Press, building a house price prediction model: How to build a prediction model. Free resource as part of a course https://scatterplotpress.teachable.com/p/house-prediction-model
  3. Machine Learning for beginners, Instagram: Instagram channel with Machine Learning related posts. https://www.instagram.com/machinelearning_beginners/?hl=es-la
  4. Mathematics for Machine Learning: This is a book to better understand what goes on behind the scenes. https://mml-book.github.io/book/mml-book.pdf
  5. “Machine Learning with random forests and decision trees: A visual Guide for beginners”: This is a book that goes through these patterns and helps visualize how they work. https://www.amazon.com/Machine-Learning-Random-Forests-Decision-ebook/dp/B01JBL8YVK

We can build a parallelism between Machine Learning as a working field, and any other field like medicine or law. Referring to ML would be much like referring to Law as a whole. We have many branches within it, and many specializations. As we keep branching out in the ML field, we find that there are many more topics to research and work with. On top of that, being that it is actually software, it’s rapidly evolving and growing. Given the topic, I couldn’t wrap up this article without relying on ChatGPT for a punchline, so here it goes:

“As we navigate the expanding landscape of machine learning, it’s clear that our journey is far from over. With each new development, we’re not just coding; we’re crafting the future, piece by piece. So, stay tuned as we continue to explore this ever-changing terrain together, uncovering practical insights and forging new tools in the vast world of ML. There’s so much more to learn, and I can’t wait to see where this path takes us next.”

Thanks for reading!

— Fede

--

--