I am a PhD student in Computational Linguistics. As such I have both the need to experiment with deep learning frameworks and little money to build a powerful deep learning machine.
Unlike my last build, which was centered around an open source deep learning stack provided by AMD, this build is designed to be as cheap as possible.
There are some things that I included in this build that I just had lying around and some things that I had to buy to assemble it. I will break down the cost for both types of materials.
Building a deep learning machine in an old gaming computer case would probably be ideal. However, those kinds of systems have held their value better than I initially expected. In addition, ram was a bit of a concern for systems old enough to be within my price range: gaming systems from 2011 or so almost always seemed to be built with 8 GB of ram on the high end. Of course, this can be expanded, but the highest memory capacity I could achieve for a $80 motherboard seemed to be about 32 GB. I wanted to have the flexibility to work with large datasets in the future.
In addition, I am very familiar with old server hardware (due to the number of non-deep learning oriented servers I maintain). I started seeking out used workstations like the HP z620, z820, Lenovo Thinkstation s30, or the Dell Precision T3600.
I ended up setting on the HP z620 in particular because:
For the GPUs in this machine, I went with the GTX 1070. This was for two primary reasons. The first being that this is the cheapest processor available from NVIDIA with 8 GB of VRAM. VRAM is where the models and data live on the GPU. Having more VRAM directly relates to the size of the models you can run as well as the speed with which you can train (as training speed is highly dependent upon batch size and larger batches require more memory). The second reason is that the GTX 1070 typically only required a single 8 pin power connector. Provided that the z620 can supply two 8 pin connectors, something like the 1080ti is only practical if I have no desire to expand to more than one GPU.
The first GPU I selected was a cheap Gigabyte model that I bought off of an online forum.
The second GPU, which I bought after using the machine for a while was a blower-design HP OEM card that I bought on Ebay. This was slightly more expensive due to having to pay tax.
Data takes up a lot of space and I wanted to have no shortage of spinning metal to put my data on. The z620 came with one 500 GB drive. I had a number of 1 terabyte 2.5" hard drives around from a failed attempt at using them in my Proliant DL380 machine (however, the temp sensors on these drives did not agree with this machine).
The z620 has two 5.25" bays. Only the top one has a CD drive. I bought a caddy from IcyDock that enabled me to put four 2.5" drives in one 5.25" bay.
Component | Cost |
---|---|
Chassis | 200 |
GTX 1070 | 200 |
GTX 1070 | 215 |
IcyDock Chassis | 20 |
However, CUDA is proprietary, only works on NVIDIA GPUs, and requires proprietary linux drivers to work. Many people, myself included have objections to the monopolistic hold that NVIDIA has established on the deep learning infrastructure market and object to their non-open practices. In addition, using CUDA can be a flat out pain on the administration side. In my experience, the CUDA utilities integrate poorly with package managers. I have had a number of issues with removing CUDA or replacing it with a new version where installation added a large number of additional programs but removal only uninstalled a couple programs.
AMD HIP/ROCm is slightly more picky than CUDA with regard to the hardware it will run on. RX 50 GPUs, RX 40 GPUs and the R9 3*0 series are not able to run on older cpus where pcie v3 atomics are not supported. Newer GPUs like the Vega 56, Vega 64, Vega Founders Edition and Radeon VII are able to run in a mode without PCIE v3 atomic support with a performance penalty.
CPUs with PCIE v3 atomic support include all Ryzen CPUs as well as all Intel CPUs from Haswell on (e.g. all Intel processors greater than 4000). For more information on supported hardware check out this page
All RGB was merely an accident of pricing. I just went with the best performance for the money. In addition, a blower style vega 56 was used instead of a free flowing Vega 56 like those by PowerColor as the case has relatively poor airflow. Getting hot air out of the case was deemed much more important for prolonged workload performance.
]]>The main problem was that we needed a system that would work for both English and German data. In both cases, we were using social media data.
I'll make another post explaining why this text cleanup needed to be done and how Yass works as well as some qualitative examples of the performance achieved.
The main point of this post is that PyPy drastically sped up the stemming process.
PyPy is an alternative python engine that compiles code instead of running on top of the PVM (Python Virtual Machine). Similar to the JVM, python's PVM is a layer that runs bytecode generated by the interpreter. Despite the fact that the python interpreter is written in C, it still has a large amount of overhead.
PyPy, uses just-in-time compilation to get to machine code. Just-in-time compilation is a method where only the functions that are used are actually ran through the compiler. Ordinary compiled languages like C or go have separate compilation and execution stages. For example,
in C you could run gcc myprogram.c -o myprogram
to generate a static binary that can then be run using ./myprogram
. PyPy and other compiled languages don't have a separate compile and execute steps. The program is compiled as it is being executed, compiling only the pieces that need to be compiled. Typically, the function compiled is specific to the arguments being called. For example calling this python function:
def simple_mul(num1, num2):
"""
This function simply multiplies two numbers together and
is just meant to be an example
"""
return num1 * num2
as simple_mul(3, 9)
would compile a version where both arguments are integers while
simple_mul(3.2,9)
would compile a version of the function where the first argument is a
float and the second, an integer.
This highly specific compilation is part of what makes languages like julia and PyPy so fast.
An issue early on with PyPy was support for packages that utilized python bindings to c programs. These types of programs are common in datascience and statistics packages for python as this is the best way to get high performance out of python. Libraries like numpy and scipy were not functional on the PyPy engine.
However, recent advances have made these packages work quite well.
Unfortunately, scikit-learn does not work at the moment. Once this package works, I will switch completely over to PyPy for all projects. As it is right now, I use pypy for data extraction and preprocessing and then use vanilla python for the classification start.
The PyPy engine ran the code about twice as fast as the raw python version. With the full dataset of 15616 unique tokens, the python implementation took 114.459 seconds or about 18.5 minutes. In contrast the PyPy engine took 548.275 seconds or about 9 minutes. This required no changes to the original python implementation.
A number of different vocabulary sizes were also tried to see how the performance results changed.
During execution, I noticed that file IO consumed a much larger portion of the total execution time for the PyPy version. The actual logic of the code took a smaller proportion.
However, I did not quantify this. In the future, I would like to write a version of this program in julia with benchmarking of the initial file read, calculation of the distances between every pair of words, construction of the minimum spanning tree and then writing the results out to a file. Doing so will clarify where the performance profiles of python, PyPy and julia differ.
This gitea repository holds the code used.
The plot was generated using the Gadfly package for julia.
]]>