This is an old revision of the document!
You've got yourself some storage attached to your gigabit network - now what? How do you make use of it? How do you access it?
There are several ways in which you can access that storage. Or should I say there are several protocols and software packages which enable you to access that storage. Each has advantages and disadvantages in terms of performance and convenience. This is what this article is about.
If you transfer one small file over the network the key point in choosing one of these technologies is convenience. If that file is very large (maybe a film, a CD or a DVD) you'll perform sequential access. Most of the different types of technologies offer very good performance for sequential access (with some exceptions though - see below) provided that the hardware (network card, CPU and HDD) can sustain the load. In this scenario choosing one of different technologies is again convenience.
The most interesting part though is when you have to transfer lots of files (maybe small ones). You may want to copy all your documents somewhere else or maybe move your photo or mp3 collection. Or if you're a brave one you might one to compile the kernel tree residing on your network attached storage. There are several real world scenarios that involve accessing a large number of files as fast as possible. This is the most interesting scenario because there are big performance differences between different technologies.
We all love benchmarks. Those little numbers that tell you that what you bought is 3% faster than competition. You may never notice the difference but the important part is that you got the best. In our case performance and convenience differences are so big that you don't really need some lab benchmarks to see the differences. This is why there won't be many numbers in this article.
However we did perform many tests. The tests involved copying a large tree of relatively small real files to and from our tumaBox over and over again. We used our /usr and /usr/lib partitions - that's right: real files from the real world. Of course we took necessary precautions such as clearing caches in order to avoid fake performance improvements. The reasons we don't put the actual numbers in the article are: 1. we don't think it's important whether you get 50 or 52.5 percent difference (nor even 20 or 30 percent) - the important thing is that there is a huge performance difference; 2. even though we took precautions to get real results we didn't perform a scientific benchmark by the book. So feel free to throw your stones and discard everything we say here.
This is the first type of sharing storage over the network. It basically means that client computers see the network attached storage as a tree of directories and files. Then they can issue commands to the server to read or write those files.
The key feature of this type of sharing is convenience. The protocols are very well supported by all major operating systems and you can mount the virtual the entire remote storage tree on your local computer and access it as a local resource. One other important feature of this type of sharing is that many clients can read and write remote files at the same time. The server takes care of keeping the data consistent all the time. However this comes at a cost because this mechanism implies locking, something that is done transparently by the server but which decreases performance most notably in our scenario: accessing a very large number of relatively small files.
This is one of the most used protocols. Chances are that you have already used it even without knowing this. It's the protocol used by default by all Windows variants (although in different flavours) but it's also accessible on any other major operating system. On Windows it's very easy to use: if you access something from your “Network Neighborhood” you're already using it. On Windows you can also mount it as a drive (D:, X: or whatever you like). You can also mount it and access it in a variety of ways on all other major operating systems. This is pretty much as easy as it gets.
The performance for sequential transfer (large files) is quite good - you can easily saturate your gigabit network card or even your old HDD. The down side is that accessing lots of small files is slow compared with other alternatives. In this scenario you might have difficulties saturating a 100 Mbps ethernet card and not because of the HDD. The performance problems come from the inherent network latencies combined with the protocol ping-pong over the network and the locks that need to be aquired and released.
The main competitor for SMB/CIFS is NFS (see below) - the competitor from the Unix world. There's a continuous battle on which has better performance. From our experience it depends on actual conditions. Over the years we got mixed results with different hardware, protocol variants and configuration options. But in our current particular tests SMB performs much lower on writing (think 30-40%) and a bit lower on reading (think around 10%) than NFS.
You may say that this is a huge performance difference but this doesn't matter much to us. We heavily use computers and our tumaBox and on a daily basis we don't see this difference as having an impact on our productivity. Throw a different bunch of files to them and the results may be very different. We don't usually copy /usr for our work.
What is more important from our point of view is the difference in accessing the files. Yes, you can mount both SMB and NFS in your local tree but with SMB you mount the remote tree in the name of some user (by supplying username and password) and then all operations you do on the remote files appear as performed by that user. You can see this a security feature or it may get you into a permission nightmare.
This is the alternative to SMB from the Unix the world. This is why with NFS you can mount the remote tree in the local tree and the local user permissions are enforced on the remote tree as well. From the local user perspective it's really like adding another directory with some extra storage. It's a very convenient way of using remote storage on Unixes and Linuxes and all their siblings but one may argue that this might have some security implications. Knowing these implications we can take the necessary precautions to make NFS a safe thing to use.
On the performance level NFS is very similar with SMB. Again single large file transfers can saturate network or HDD. Again transfering lots of small files has poor performance. In our tests it did perform better than SMB especially on the write side but again it may depend on the particular setup and scenario. The reason why we don't care much about the performance difference between SMB and NFS is because they both perform so much poorly on small files than other alternatives. So if you have to transfer lots of small files stay away from both.
All these rely on ssh, which (as the name suggests) was meant to provide secure shell access. But it actually provides much more than this: encryption for transfering files, forwarding internet traffic and so on. SFTP is actually the part that provides file transfers with an ftp like access. SCP is a very simple way to transfer files. FUSE is actuallly a piece of software to mount filesystems in userspace. What it actually means is that by writing a FUSE plugin you can make any filesystem available without modifying the kernel (or writing a driver). It has a SFTP plugin that can be used to mount SFTP remote filesystems just as you would with SMB. Last but not least rsync is a very powerful program that can transfer files over ssh. The power comes from it's flexibility and features. It is a much more powerful solution than scp or sftp.
Provided with a powerful CPU transfer over ssh is the fastest solution available. When I say that I mean not only for large file transfers but especially for small files. When using sftp transfer with on the fly compression you can even achieve transfer rates higher than the actual maximum network speed (for highly compressible data). On the fly compression is a standard feature of ssh which is absent by default in all other solutions. If you mention also that all transfers are encrypted by default this solution becomes a clear winner on the security part as well. None of the other solutions provide encryption by default and some lack much more than this on the security side.
On our particular set of tests transfering /usr data with rsync was 10 times faster than SMB or NFS. With SSHFS (FUSE with SSH plugin) we achieved only about 5 times the performance of SMB or NFS. In our tests we didn't use compression. If you transfer even smaller files the performance is expected to increase much more compared with SMB or NFS. The tremendous performance is achieved because of the protocol design but there's no point in going into details about this.
There are some caveats though. First is that if you read carefully you saw something about CPU power. All transfers over ssh are encrypted - this was the whole purpose of ssh in the first place. Encryption requires a lot of CPU power. And it gets even better: there are 2 CPUs involved for each transfer - the server one and the client one. If either one does not have enough power to sustain the transfer speed it becomes the bottleneck and it limits the entire transfer speed according to it's power. Some new CPUs try to help in this situation because they provide AES hardware acceleration (AES-NI instructions), AES being the most popular encryption block cipher. When you use such CPUs both on server and client the performance is expected to increase tremendously when using AES encryption. If at least one of the CPUs does not support this instruction set or you want to use a different encryption algorithm then your CPU might easily become the bottleneck.
One popular way to ease the pain when you don't have AES-NI on both sides is to use the Arcfour protocol aka RC4 (either arcfour, arcfour128 or arcfour256 in ssh Ciphers option - performance difference between them is marginal). All benchmarks point this one to be the fastest ssh encryption protocol without hardware acceleration. We used it ourselves in our tests (arcfour128). The downside in this case is that arcfour is currently considered insecure. This doesn't mean that any kid can instantly see your traffic but don't rely on NSA wasting too much time on decrypting your transfered files.
When you don't necessarily want the protection offered by ssh encryption (maybe you're on the local network in your bunker) you might want to drop it altogether and only use the tremendous speed. Unforturnately some people didn't consider this to be a very good idea so there's no current way in official ssh to drop encryption entirely. There are however some third party patches for this which you can use at your own risk.
When we get to the accessibility aspect there's only one way to compare SSH family to SMB or NFS: SSHFS over FUSE. This is the only way to mount your remote tree in your local tree over ssh and enjoy the performance (of powerful CPUs). All other ways (rsync, scp, sftp) are linux commands which provide an entirely different experience. They're great for transfering a bunch files fast but not for every day browsing. There are also file manager plugins and even standalone products that you could use to browse and transfer files in a quick and convenient way. But the key difference with these is that the remote files and folders are perceived locally as remote resources. This means that if for instance you click on a film in these file managers the entire film will be first copied locally (don't hold your breath - it will take quite some time) and then it will start playing. With all other systems (SMB, NFS, SSHFS etc.) When you click on that film it starts playing immediately and data is transfered in background as needed while you watch the film.
Even more than this FUSE is not available on all operating systems (although there are several attempts to implement something similar). When it comes to file manager plugins or stand alone programs you will probably find several of these for your operating system of choice but you won't get the user experience consistency you get from using something that mounts as a native resource.
SSHF does a good job but using it on daily basis seams cumbersome to us. First of all it's permission handling looks more like SMB (but with some notable differencies). You are logged in on the remote machine and perform all actions in the name of that user. Furthermore this means that is difficult to share the remote tree with some other local users: they each have to mount the remote tree in their own user space using their own credentials. When you look how it handles connection problems you soon start to look for alternative solutions. NFS is no stranger in this department either but the chances of some eventual recovery are somewhat bigger if it's your lucky day.