How to get started with Python and it's machine learning ecosystem

PyTorch, Tensorflow, and CUDA

Overview of ML In Python

As a standalone language, Python is a little lackluster. Don't get me wrong, there are plenty of tasks that can be accomplished using only the built-in libraries (so-called "pure python"), but Python really unlocks its full potential using add-on packages. These packages provide additional functionality tailored to an application or field, such as machine learning.

Oftentimes, a package serves as a bridge to a highly optimized library written in a faster language — such as C. This allows us to call code from the optimized library from our Python script. We can write the logic of our program in an easily digestible language, but benefit from the speed of a lower-level language.

Python's Fundamental ML Libraries

A quick internet search for "python ml libraries" will reveal that there are dozens of tools designed to accomplish machine learning tasks in Python. Many of these libraries are built upon one of the two largest ML packages — PyTorch or Tensorflow. PyTorch and Tensorflow are what I consider to be the "base" machine learning frameworks in Python. They provide the code for fundamental operations (i.e. differentiation and backpropagation) that are used by other ML libraries.

Typically, you will need to choose either PyTorch or Tensorflow, not both. These libraries often conflict with one another, so you should only install one in a virtual environment. Packages that use PyTorch and Tensorflow will often support both frameworks, but you may encounter a package that only works with one or the other.

CUDA - Hardware Acceleration for NVIDIA GPUs

By default, all computation done by PyTorch and Tensorflow will occur on your machines' CPU. This is sufficient for small tasks, but execution time grows beyond reasonable limits when delving into larger projects. Specialized devices — such as GPUs — can drastically speed up calculations used to train neural networks. A library written by the device manufacturer allows other programs to leverage the hardware acceleration provided by the GPU.

CUDA is a parallel computing platform created by NVIDIA for use on it's GPU products. It is widely supported in open-source libraries such as PyTorch and Tensorflow. Oftentimes, enabling GPU acceleration is as simple as a single line of code in your Python script.

Installation

CUDA

Before you install PyTorch or Tensorflow, it is recommended that you install CUDA if you plan to use GPU acceleration. Unlike python libraries, which must be installed in every new virtual environment, CUDA is installed once across your whole system. Depending on your operating system, installing CUDA can be a little tricky. Additionally, you may install CUDA without any issues, then discover that PyTorch or Tensorflow does not yet support that version.

Consequently, it is best to check that the python library you plan to use supports the version of CUDA you have installed for your operating system and architecture. On my Windows 11 machine, I have CUDA 11.8 installed, which you can find here. After the installer exits, you may need to restart your computer for the installation to be complete. You can check that cuda is correctly installed with the following command: nvcc --version

PyTorch

I have found that PyTorch has the best support for Python on Windows. This webpage provides all the information you need to install it in your venv. If you are using same software stack as me (Windows, CUDA 11.8, PIP), you can install it using this command:

pip3 install torch --index-url https://download.pytorch.org/whl/cu118

If you visited the webpage I provided, you will notice that the command they provide also contains torchvision and torchaudio. These are extensions to PyTorch that provide datasets, transforms, and models (e.g. MNIST, cropping, ResNet) that are commonly used in these domains. You don't need them to get started, and can always install them later.

You can test your PyTorch installation and see which GPUs are detected using:

1 import torch
2 
3 print('Import Successful!')
4 if torch.cuda.is_available():
5     x = torch.cuda.device_count()
6     for i in range(x):
7         print(f'GPU {i}: {torch.cuda.get_device_name(i)}')
8 else:
9     print('No GPUs Detected')

Tensorflow

Unfortunately, I have found that Tensorflow has more compatability issues than PyTorch. On the installation homepage, you will find that Tensorflow only works with Python Versions 3.8-3.11, and has no GPU support on macOS. Tensorflow works best in the Linux ecosystem, but unless you have a dedicated machine for software development, you are likely running Windows or macOS on your personal machine.

Since version 2, Tensorflow simply does not work on Windows. You can circumvent this by using the Windows Subsystem for Linux (WSL2), but that comes with its own set of headaches and compatability issues.

For this reason, PyTorch is often preferred in research environments for its ease of use. Tensorflow has a slight edge in large-scale production projects, and is often used in industry. Since this blog mainly concerns R&D, I will be using PyTorch in all of my projects.

Beyond PyTorch

I would be remiss if I didn't mention a few of the other "big players" in Python's machine learning ecosystem. You will likely encounter the following libraries sometime along the road, and I will talk about some of them in the following section:

XGBoost: Library for decision trees, ensemble learning, and gradient boosting
Scikit-Learn: Wide range of data science tools, including clustering, regression, dimensionality reduction, and model selection.
Hugging Face Transformers: Makes a huge-selection of transfer models easily accessible. Part of the Hugging Face ecosystem
OpenCV: Computer vision library with a rich feature set, including many classical operations and techniques that underlie modern CNNs and vision transformers
ONNX: A universal runtime for neural networks. Trained models can be exported to ONNX format and easily run on many platforms (desktop, web, mobile, microcontroller, etc.)
Ray: A suite of libraries for optimization, scaling, and distributed computation. Includes RlLib, a library for training neural networks using reinforcement learning.

Other Data Science Libraries

If you are brand-new to Python, you may want to familiarize yourself with these additional libraries. They are considered Python staples and are widely used across application domains. When working with Python's machine learning ecosystem, you will likely encounter references to these libraries or may even need to use them yourself:

NumPy: Essential for any application regarding vectorization and array computation.
Pandas: Dataframes in Python. Easily load, manipulate, and transform tabular data.
Matplotlib: Fundamental visualization library. Commonly used to create plots and other data visualizations.