Reproducible Computing for Large & Complex Datasets
Introduction
This website facilitates the dissemination of statistical computations which are based on large and/or complex datasets (LCD). This website, www.FreeStatistics.net, has the same goals as the sister website at www.FreeStatistics.org but uses a fundamentally different system design: the technological approach that is used to solve the technical problems of large & complex data structures, is based on distributed storage and computing rather than a central repository.
The following features are currently available:
The scope of this project is related to the following scientific disciplines:
- Neuroimaging (EEG, fMRI, MEG, etc.)
- Finance (high frequency time series)
- Marketing (scanner data)
- Political sciences (international databases)
- etc.
Getting Started
To participate in this project there are organizational and technical requirements that should be met. Please, read these requirements carefully before you start. Questions and suggestions can be sent to resa at pandora dot be.
The organizational requirements are as follows:
- The use of FreeStatistics.net services is free of charge but this does not imply that there are no costs involved because we use a peer-to-peer protocol to share large amounts of data. This implies that you are expected to participate in the network and contribute (at least) a reasonable amount of your upstream bandwidth and local computer storage. Of course, it is also possible that your organization (university) makes a contribution to our project by hosting one or several Seedboxes.
Let us consider a first example. Suppose you are only interested in obtaining data that is made available through our network. In this case you are free to download a Torrent and associated data. However, after you have obtained the data you should leave the BitTorrent client open for a reasonable amount of time so that your machine can serve as a seeder.
Here is a second example. Suppose your research team wishes to collaborate with other teams and share large amounts of data, for multiple studies, possibly spanning over several years. In this case it is necessary to contribute to the peer-to-peer network in a structural manner (not just through occasional connections of PCs at your labs) by hosting one (or several) storage servers which act as Seedboxes. A Seedbox is very easy to setup and maintain in Linux. We will provide detailed information about the requirements of Seedboxes on this website (coming soon). The Installation & Configuration instructions are available here.
- The use of FreeStatistics.net services is limited to sharing data which you are legally allowed to share. The administrator has the right to ban end-users who violate legal property rights.
The technical & system requirements are as follows:
-
If you wish to share scientific data through our BitTorrent Tracker you must arrange all the datafiles and accompanying documents in a dedicated folder. Everything in this folder will be part of the Torrent, including subfolders. Make sure that this folder (and the contents) does not change during the period that the data is shared. If you need to make changes or corrections, you have to replace the old Torrent and restart the entire upload process.
-
You need to download and install a BitTorrent client software on your machine. For Linux, Unix, and Mac users we recommend the software called Transmission which is available free of charge at www.transmissionbt.com (but there are many other good clients available: see overview). Windows users might want to have a look at uTorrent which received a high rating from CNET (see webpage).
-
You need to have sufficient bandwidth to upload/download the data. Some Internet Service Providers limit the amount of data that is transmitted during a period of time (X GB per month). If you do not have enough bandwidth to upload/download the entire datafolder in a (relatively) short period of time, you may want to configure your BitTorrent client in such a way that transmission speed is limited.
-
Some Internet Service Providers (and academic institutions) limit the use of BitTorrent traffic for a variety of reasons. It is always a good idea to ask you local administrator about this. Sometimes it is necessary to configure the firewall (or other system software) to allow the transmission of BitTorrent traffic. Again, this may require the help from qualified experts or administrators. If there are security concerns, it may be possible to allow traffic which is tracked by our dedicated Tracker (while disallowing all the others): the so-called announce URL for our system is http://www.freestatistics.net/tracker.php/announce
How to?
Here you will find some illustrated tutorials about our peer-to-peer network: