P2P Backup System w/o SPOF for Work Group

This is also one of my half-completed ideas years ago. It was recalled recently by two stories:

Stories 

#1: One of my team mates lost his Outlook email archive due to a mistaken operation. He is very upset because all of his emails in past two years are gone, unrecoverablly. Suggested Outlook mail file is 2GB, but people can easily drive it up to 10GB and often notice this too late. As what he said in MSN signature, “my archive, 555…”

#2: While I was browsing one of my favorite bloggers – Brad Abrams’s blog, ran into a shocking post, Help! My hard disk crashed. He can afford a new disk, but he need the data – especially “the pictures of the kids recent birthday party”. In the comments, people suggested various ways to recover it, unfortunately looks like no luck.

This leads to the most basic questions: why people don’t backup data even if they know they are in the risk of losing them? Probably because of:

1) The chance is very tiny

2) Current backup solutions are not “good enough”

For 1), although it might be true, but remember the result might be cataclysmic. Let us take a look at what is going on about 2).

Existing Backup Solutions

Traditional C/S – Most of current backup solutions are based on traditional C/S architecture. Assume one use 40% of 500GB disk drive on average; it is still a big capacity for enterprise with thousands of staff. Backup time is also another concern. Let me illustrate this with some math. For a backup server with two 1-Gbit NICs, its max throughput is about 200MB. If backup window(people are sleeping) is 10 hours, it means it can back up about 7.2TB(2*100*3600*10MB) data. Obviously it is hardly scalable along with business expansion.

Online backup – People can turn to AWS S3, or Azure SQL Data Services to simplify things on their own side. But this doesn’t make things even better if you look at current network speed. Calculating data is much easier than moving data around.

My P2P Backup System

p2p backup comes with several inherent advantages in handy:

1) No additional hardware expense. People often have at least 30% capacity left; we can leverage that for backup space.

2) No obvious network bottleneck.

Some people might also list “no single point of failure”, but that is not true. Let us take a look at how BT works, a general P2P network.

Basically here are what happens in P2P network - peers need talk to a central coordination server(i.e., Tracker in BT system) to get info about other peers, and then talk to individual peers about real data exchange. These two steps are essential. Single Point of Failure(SPOF) can happen in the coordination server. You may argue that you can enhance its reliability by techniques like mirroring, but does it look like over-heavyweight? We have many reliable services running by full time IT guys, why do we bother to re-invent the wheel? Several services can be utilized, such as AD, Sharepoint, exchange, which allow customized data writing. Data needed written into a central place would be:

1) Peer list by <IP, port>

2) Other meta data such as backup network name, Software RAID level

Of course, we need do compression, encryption, incremental backup, scheduled backup in the solution. I may do investigation later and update the post.

Also see other backup solutions

https://www.storegrid.com/index.html - select a folder, then backup, traditional

www.streamload.com - Large net disk, Upload your files to streamload, then view it, or email it to someone else.

https://www.beinsync.com/ - Access you computer: install a mini PHP web server onto it.

https://base.google.com/base/about.html - In about 15 minutes, your item will have a unique web address and be visible to the world.

https://www.openomy.com/ - openomy is an online file storage system designed to be a platform for Web 2.0 applications, built by Ian Sefferman