Compression through lookups
This topic contains 7 replies, has 2 voices, and was last updated by miguel89 4 months, 2 weeks ago.
January 11, 2016 at 8:09 pm #5978
Jeremy Swartwood (Kaltkinrikl)
I don’t know enough about your algorithm, it’s abilities, and the length of a book references.
Could something like this be used to create a compression format?
Existing systems use lookups to compress commonly used words/phrases in a file.
1. Take 3200 characters of a file.
2. Find the shelf/book/page/etc that it is on.
3. Create a new file with the reference data separated with some character.
1. read file and find real 3200 character output
2. create file with real characters
But, if your shelf/book/page is more than 3200 characters, there would be no compression.January 11, 2016 at 9:43 pm #5979
The reference data has much mor characters than 3200. So this not a way to compress data but to inflate.January 12, 2016 at 1:42 am #5982
Yeah, it definitely doesn’t perform any compression. But – if you were to use something like bit.ly to save the URLs…it wouldn’t be performing any compression, but someone else would be doing your storage for you.March 26, 2017 at 8:53 am #14158
Gertjan van den Broek
I had a very similar idea to Jeremy, and I’ve been thinking on it for a very long time now.
I agree it’s not really compression, it’s more like storing all your information in a generated sequence, where all you need is the adress of the first and last character in that sequence.
It might inflate your data when working with 3200 characters.
But if you ramp up the sequence to say something of the size of a few gigatbytes (10^9)
Then wouldn’t adresses start becoming smaller than the data they respresent?
I’d love to give this thing a shot, but sadly I don’t know what kind of algorythms I’d need to use to reverse find a gigabyte worth of data in an infinite pool of data.March 27, 2017 at 8:30 am #14189
Your address only links to the content of one single page, which has 3200 characters, which means that if you want to link mulitple gigabytes, it will require lots of different pages, so lots of different addresses which are all above 3200 characters in length.
If you, on the other hand, had a version of the library, where every page has gigabyte lengths, you would have way more pages than before (Every single character in length you add to the pages, will multiply the amount of possible pages by the amount of characters you have. That means the amount of pages will higher and higher, and thus you will need longer and longer addresses to map all of those pages.
You will find that the shortest possible address format that uniquely identifies a page of 3200 bytes among all others, is 3200 bytes in length. If the page size is 10 gigabyte, your shortest possible address format will be… 10 gigabyte. Do you see the problem here?April 3, 2017 at 7:12 am #14549
It’s such an intriguing idea though. If an algorithm existed that could shuffle the 29^3200 values so that it’s 100% effective and the size of the resulting set is exactly 29^3200. For any given string there would be a chance that it can be compressed by up to 99.9%. But that probably breaks logic or something. Of course on the other hand shorter messages would occasionally get inflated.April 3, 2017 at 7:41 am #14552
Well, there is certainly some strings that can be “compressed” like this in the library, for example 0-w1-s1-v01:1 is way shorter than the string it represents. This is essentially because leading zeros are emitted. So, to get a sort-of compression going, you’d need to generate a special library of babel like formula, that would have this string early on.
This may be possible, but is unplausible, because the formula would probably be bigger than the string in the first place, which is a dealbreaker since you’d have to store/send the formula along the “compressed” data, and generating a special formula for every string you want to compress sounds like a lot of work. Conventional compressing algorithms are faster, more reliable, and, presumably, more effective.April 5, 2017 at 5:34 pm #14607
>So, to get a sort-of compression going, you’d need to generate a special library of babel like formula, that would have this string early on.
Like you say, that would defeat the whole purpose. The idea is that the magic formula doesn’t have to change. And then when your string is 3200 characters long, there’s still always a 50% chance that the key is shorter than 1600 characters.
Probably impossible. Still fun to think about. Of course most things can already be compressed greatly with existing algorithms, because data has patterns.