Choosing a Python distribution¶
- Python is a language. It has rules about what is and what is not valid syntax (see the language reference).
- Python code is more tangible than the language; it is text that meets the
requirements of the language. For example,
- An implementation is required to run code. In the case of Python, this means an interpreter. CPython is the most commonly used Python implementation. Several others are listed at https://www.python.org/download/alternatives/.
- A distribution is an implementation that has been made available for other people to install (for example, by directly downloading or using a package manager). They typically contain other utilities such as a profiler, debugger, libraries and documentation.
- Languages, code, implementations and distributions all go through different versions. Versions are usually identified using a version number, often in the "major.minor.patch" format.
Operating systems often come with a distribution of Python pre-installed. This is true of Apocrita
$ python Python 2.7.5 (default, Aug 7 2019, 00:51:29) [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux2
This shows that the Python version is 2.7.5, and that it was compiled with GCC version 4.8.5. You should not use this Python for your projects.
Don't use the built-in Python on Apocrita
- The built-in version of Python is currently 2.7 but an OS upgrade, e.g. to Centos 8, would change the built-in version to 3
- Very few Python 2.7 programs are valid Python 3 programs so an upgrade to Python 3 would likely break your code
- Python 2 has not been supported by the Python Software Foundation since January 2020 so new projects should be written in Python 3
Identifying a distribution¶
Python distribution module file conflicts
To prevent errors when running the Python interpreter, we have designed the Python distribution modules to produce an error when two or more are loaded into the same environment.
Once we have loaded a Python module with, for example,
module load python/3.8.5
Starting the interpreter from the command line will give the version and compiler information
$ python Python 3.8.5 (default, Oct 20 2020, 17:13:17) [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux
To find out the implementation details, we can use the
>>> import platform >>> platform.python_implementation() 'CPython'
In this case, we would refer to both the implementation and distribution as "CPython". As we'll show below, the CPython implementation is contained in different distributions.
Distributions on Apocrita¶
There are currently three Python distributions available on Apocrita: CPython, Anaconda and Intel Distribution for Python. They all use the CPython implementation. You can see available Python modules with
module avail python anaconda intelpython
CPython is an implementation of Python and also a distribution. It is the reference implementation, is written in C (thus the name "CPython") and comes with the standard library, a large library of commonly used functions and classes written in Python and C.
On Apocrita, this implementation is accessed through
python/ modules. For
module load python
On another computer, it can be installed either by downloading from
python.org or using your system's
package manager, where it is probably called
Why use CPython?
- It is the "official" and most widely used implementation
- You can view and modify the source code
- It is available on a wide array of platforms; from Apocrita to Raspberry Pi
Why not use CPython?
- There may be other distributions that are more widely used in your field
- It may not be the quickest implementation
- It does not come bundled with as many tools and libraries as other implementations
You can find out more about CPython on our Python page.
Anaconda is a distribution of Python and R that is aimed at data science. It bundles the CPython implementation with more than 200 commonly used data science and machine learning libraries. It also includes Jupyter (for making notebooks), Conda (a tool for package and environment management) and Spyder (an IDE).
On Apocrita, this implementation is accessed through the
anaconda3/ modules. For example
module load anaconda3
On another computer, it can be installed by downloading it from anaconda.com.
Why use Anaconda?
- It is widely used in the data science community
- It includes useful software and packages
- It may be somewhat faster that CPython
Why not use Anaconda?
- Anything you can do with Anaconda, you can do with CPython (with a bit more work)
- The performance improvement may be insignificant on your project
- On your own computer, Anaconda may require considerably more space than CPython, unless you use Miniconda, the minimal installer of conda and Python
You can find out more on our Anaconda page.
Intel Distribution for Python¶
Intel Distribution for Python is a distribution that includes some proprietary high performance libraries. Like Anaconda, it includes its own versions of commonly used data science libraries such as NumPy and SciPy and comes with both the conda and pip commands for package management.
On Apocrita, this implementation is accessed through the
modules. For example
module load intelpython
On another computer, there are several ways to install Intel Distribution for Python. The recommended method is to download it from Intel.
Why use Intel Distribution for Python?
- It may be quicker than CPython for some tasks
- It includes utility and performance tools such as Conda and MKL
Why not use Intel Distribution for Python?
- Anything you can do with Intel Distribution for Python, you can do with CPython or Anaconda (with a bit more work)
- The performance improvement, while large for some tasks, may be insignificant on your project
- It is the least widely used of the three distributions, so could make collaboration harder
If you are using scikit-learn, you should set the
environment variable to
1 as specified
This is mentioned when you install scikit-learn but is easily missed.
- You should prefer the Intel conda channel when installing scientific libraries
- You can do this explicitly when using the
conda installcommand (e.g. you can install numpy with
conda install -c intel numpy)
- You can tell conda to prefer the Intel channel by increasing the channel
conda config --prepend intel
- You can confirm that the Intel version of a package has been installed
Choosing a distribution¶
Finally, we offer some thoughts on choosing a distribution for your project.
The speed difference between distributions is often only slight. For a small number of cases, Intel Distribution for Python is considerably faster because of its optimised DAAL and MKL libraries. If you think that your application could benefit from this, it is highly recommended that you compare the performance of Intel Distribution for Python with the performance of either CPython or Anaconda, to make sure that you are getting an improvement.
Given that the performance gained by changing distribution is mostly modest, your time may be better spent on other optimisation techniques such as vectorisation and parallelisation. These techniques can be used with all the Python distributions mentioned here.
Compatibility and reproducibility¶
When you write code on your machine, you want to know that you can run it on
Apocrita or share it with a colleague quickly and reliably. We recommend making
a new environment for each project you work on. The details of this environment
can be saved to a file and kept under version control. The
conda command that
comes with Anaconda and Intel Distribution for Python offers a slightly more
thorough way to do that by saving your environment to a
.yml file with
conda env export (see our Anaconda docs for
more information). However, the same end result can be accomplished using
virtualenv and CPython. The most important thing is not which tool you use
for this but that you use one at all.