Skip to content

Choosing a Python distribution

Glossary

  • Python is a language. It has rules about what is and what is not valid syntax (see the language reference).
  • Python code is more tangible than the language; it is text that meets the requirements of the language. For example, print("hello world").
  • An implementation is required to run code. In the case of Python, this means an interpreter. CPython is the most commonly used Python implementation. Several others are listed at https://www.python.org/download/alternatives/.
  • A distribution is an implementation that has been made available for other people to install (for example, by directly downloading or using a package manager). They typically contain other utilities such as a profiler, debugger, libraries and documentation.
  • Languages, code, implementations and distributions all go through different versions. Versions are usually identified using a version number, often in the "major.minor.patch" format.

Pre-installed Python

Operating systems often come with a distribution of Python pre-installed. This is true of Apocrita

$ python
Python 2.7.5 (default, Aug  7 2019, 00:51:29)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux2

This shows that the Python version is 2.7.5, and that it was compiled with GCC version 4.8.5. You should not use this Python for your projects.

Don't use the built-in Python on Apocrita

  • The built-in version of Python is currently 2.7 but an OS upgrade, e.g. to Centos 8, would change the built-in version to 3
  • Very few Python 2.7 programs are valid Python 3 programs so an upgrade to Python 3 would likely break your code
  • Python 2 has not been supported by the Python Software Foundation since January 2020 so new projects should be written in Python 3

Identifying a distribution

Once we have loaded a Python module with, for example,

module load python/3.8.5

Starting the interpreter from the command line will give the version and compiler information

$ python
Python 3.8.5 (default, Oct 20 2020, 17:13:17)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux

To find out the implementation details, we can use the platform library

>>> import platform
>>> platform.python_implementation()
'CPython'

In this case, we would refer to both the implementation and distribution as "CPython". As we'll show below, the CPython implementation is contained in different distributions.

Distributions on Apocrita

There are currently three Python distributions available on Apocrita: CPython, Anaconda and Intel Distribution for Python. They all use the CPython implementation. You can see available Python modules with

module avail python anaconda intelpython

CPython

CPython is an implementation of Python and also a distribution. It is the reference implementation, is written in C (thus the name "CPython") and comes with the standard library, a large library of commonly used functions and classes written in Python and C.

On Apocrita, this implementation is accessed through python/ modules. For example

module load python

On another computer, it can be installed either by downloading from python.org or using your system's package manager, where it is probably called python or python3.

Why use CPython?

  • It is the "official" and most widely used implementation
  • You can view and modify the source code
  • It is available on a wide array of platforms; from Apocrita to Raspberry Pi

Why not use CPython?

  • There may be other distributions that are more widely used in your field
  • It may not be the quickest implementation
  • It does not come bundled with as many tools and libraries as other implementations

You can find out more about CPython on our Python page.

Anaconda

Anaconda is a distribution of Python and R that is aimed at data science. It bundles the CPython implementation with more than 200 commonly used data science and machine learning libraries. It also includes Jupyter (for making notebooks), Conda (a tool for package and environment management) and Spyder (an IDE).

On Apocrita, this implementation is accessed through the anaconda2/ and anaconda3/ modules. For example

module load anaconda3

On another computer, it can be installed by downloading it from anaconda.com.

Why use Anaconda?

  • It is widely used in the data science community
  • It includes useful software and packages
  • It may be somewhat faster that CPython

Why not use Anaconda?

  • Anything you can do with Anaconda, you can do with CPython (with a bit more work)
  • The performance improvement may be insignificant on your project
  • On your own computer, Anaconda may require considerably more space than CPython, unless you use Miniconda, the minimal installer of conda and Python

You can find out more on our Anaconda page.

Intel Distribution for Python

Intel Distribution for Python is a distribution that includes some proprietary high performance libraries. Like Anaconda, it includes its own versions of commonly used data science libraries such as NumPy and SciPy and comes with both the conda and pip commands for package management.

On Apocrita, this implementation is accessed through the intelpython/ modules. For example

module load intelpython

On another computer, there are several ways to install Intel Distribution for Python. The recommended method is to download it from Intel.

Why use Intel Distribution for Python?

  • It may be quicker than CPython for some tasks
  • It includes utility and performance tools such as Conda and MKL

Why not use Intel Distribution for Python?

  • Anything you can do with Intel Distribution for Python, you can do with CPython or Anaconda (with a bit more work)
  • The performance improvement, while large for some tasks, may be insignificant on your project
  • It is the least widely used of the three distributions, so could make collaboration harder

scikit-learn

If you are using scikit-learn, you should set the USE_DAAL4PY_SKLEARN environment variable to YES, TRUE or 1 as specified here. This is mentioned when you install scikit-learn but is easily missed.

conda channels

  • You should prefer the Intel conda channel when installing scientific libraries
  • You can do this explicitly when using the conda install command (e.g. you can install numpy with conda install -c intel numpy)
  • You can tell conda to prefer the Intel channel by increasing the channel priority with conda config --prepend intel
  • You can confirm that the Intel version of a package has been installed with conda list

Choosing a distribution

Finally, we offer some thoughts on choosing a distribution for your project.

Speed

The speed difference between distributions is often only slight. For a small number of cases, Intel Distribution for Python is considerably faster because of its optimised DAAL and MKL libraries. If you think that your application could benefit from this, it is highly recommended that you compare the performance of Intel Distribution for Python with the performance of either CPython or Anaconda, to make sure that you are getting an improvement.

Given that the performance gained by changing distribution is mostly modest, your time may be better spent on other optimisation techniques such as vectorisation and parallelisation. These techniques can be used with all the Python distributions mentioned here.

Compatibility and reproducibility

When you write code on your machine, you want to know that you can run it on Apocrita or share it with a colleague quickly and reliably. We recommend making a new environment for each project you work on. The details of this environment can be saved to a file and kept under version control. The conda command that comes with Anaconda and Intel Distribution for Python offers a slightly more thorough way to do that by saving your environment to a .yml file with conda env export (see our Anaconda docs for more information). However, the same end result can be accomplished using pip, virtualenv and CPython. The most important thing is not which tool you use for this but that you use one at all.