Zip File Parser

mountainyman · 08-06-2006 01:15PM #1

Hello,

I need to write a parser in VB 6 that will read text from a zip file. does anyone know of any documentation that discusses what the data stored in a zip file looks like?

If so thanks; if not thanks anyway.

MM

aidan_walsh · 08-06-2006 01:32PM

http://www.pkware.com/business_and_developers/developer/popups/appnote.txt

mountainyman · 08-06-2006 01:37PM

thanks very much

bonkey · 08-06-2006 01:37PM

http://www.google.com/search?hl=en&q=visual+basic+zip

Seriously.

mountainyman · 08-06-2006 02:42PM

Thanks Bonkey,
You obviously don't know what a parser is or what I am trying to do but thanks for your help.

Read<>Write

Thanks Again

bonkey · 08-06-2006 03:57PM

mountainyman wrote:

Thanks Bonkey,
You obviously don't know what a parser is or what I am trying to do but thanks for your help.

Read<>Write

Thanks Again

I missed where you said parser. D'oh. My (dumb) bad.

mountainyman · 08-06-2006 04:09PM

bonkey wrote:

I missed where you said parser. D'oh. My (dumb) bad.

Anyhoo if anyone is reading I need to page the file across and read it without unzipping.

Of course if I could unzip it that'd be too easy.

I think that this may be impossible (not is impossible, may be impossible)

After all just because you can parse a text file this does not imply that you can parse a zip file. The contents of the text file are nothing like the contents of a zip file.

Now there may be an API call that I can refer to in order to read the zip and this is my only hope.

Thanks for Listening.

Hobbes · 08-06-2006 04:46PM

mountainyman wrote:

Now there may be an API call that I can refer to in order to read the zip and this is my only hope.

Thanks for Listening.

XP supports opening zip files as file folders, so its possible its accessible via an API call that way.

Although probably not much help the ability to parse a zip file is built into the Java API kit.

had a quick look.. this may help you find what you need.

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dv_vstechart/html/vbcompii.asp

mountainyman · 09-06-2006 07:54AM

Thanks for your help guys .net is not an option unfortunately. But I will certainkly examine the msdn stuff.

thanks

bonkey · 09-06-2006 09:06AM

OK...having made my initial blunder, I'm getting totally confused here, and I'm not at all sure what it is you want/need to do.

Using a Windows API call (should one exist) to manipulate a ZIP file is no different to using a 3rd-party library to manipulate a zip file.

If thats what you want to do, then ok...but your description of what to do said that you wanted to write a parser to read the text in a zip. You later clarified this to say you want to read the text without unzipping.

It doesn't matter if you use PKZIP, inbuilt-windows functionality, some other 3rd-party library or a hand-rolled solution, the only way you can read something that has been compressed is to decompress it....and unless you write the hand-rolled solution, you're not writing a parser, you're using one.

You can decompress on the fly, in memory, and never write a file from the output, but you're still applying the decompression routine - you're still unzipping.

If what you want to do is have the functionality to unzip, then what you want is a Visual Basic library with the Zip / Unzip functionality already built in. Thats the type of stuff my original suggestion would point you towards. One example I found by quickly refining the search was :

http://www.vbaccelerator.com/home/VB/Code/Libraries/Compression/Zipping_Files/article.asp
http://www.vbaccelerator.com/home/VB/Utilities/VBPZip/Info-ZIP_Unzip_DLL_(Renamed_vbuzip10_dll).asp

If this isn't what you want, because you're not supposed to use someone else's Zip functionality - because you're supposed to write your own parser - then there is simply no way you can use a Windows API or other API call to do it, because thats exactly the same.

jc

mountainyman · 09-06-2006 03:44PM

Hi All

it is obviously me that isn't explaining things properly. I currently have a parser written in VB6 which takes a text file of a pretty weird format and processes it into a CSV.

This weird output is an ancient proprietary system of my client's and is used to model interactions of molecules.

My client now wants to change their product so that it will zip up the weirdly formatted text file as these files are huge. Over 2 gb in some cases.

The reason they want to do this is that they are modelling in more complex way and using a visual display. So where 2 years ago they woudl manually generate one file now they run batches that generate say ten.

The files are larger because they are now asking more complex questions.

So basically heavier usage of the system. They are using the parser more than they thought they would.

I would like to work with these larger numbers of weird files to import and reformat them.

In order to do that I would like to
A) Find the Zip file

read a chunk of that file
C) process that chunk
D) repeat till EOF

If this is impossible without unzipping I will be narked.

If it is impossibel without unzipping can I:
A) Unzip a part of the zip file

Brocess that part
C) destroy that part
D)repeat till EOF

I hope that is clearer.

MM

aidan_walsh · 09-06-2006 03:53PM

Basically, zip files look like this:

Header
Entry
-Entry details
-Entry data <- This is where the actual compressed file is
Entry
...
EOF.

What you want to do is open the relevant entry, get at the data inside, and move on. But decompressing a 2GB file is going to take time.

The only way I can think of off hand that doesn't require decompressing the entire file is to process buffered sections of the compressed data in memory, do as much processing as you can on those, and then move on. Obviously, though, this is really only practical as a read-only solution - without a way of mapping the data you will have to decompress the entire file if you want to edit the data.

Hobbes · 09-06-2006 03:54PM

I think what might be better to do is to convert that CSV file into some kind of binary format and parse that.

That should cause a huge drop in size. Also the earlier link I posted shows you how to zip the contents of a file within the file and not just into a zipfile.

aidan_walsh · 09-06-2006 03:57PM

Hobbes wrote:

I think what might be better to do is to convert that CSV file into some kind of binary format and parse that.

That should cause a huge drop in size. Also the earlier link I posted shows you how to zip the contents of a file within the file and not just into a zipfile.

Well, yeah, assuming the company doesn't have anything else that also acts on the files as they are. But I guess they wouldn't be zipping them in that case...

GreeBo · 09-06-2006 03:57PM

well I guess it depends on the structure of a zip file.
From my intepretation of zipping, part of zip file tells you nothing.
You need to decompress the whole thing to get something sensible.

Looking at this prblem from another perspective, can you modify whatever is generating these files to just create more, smaller files that are easier to work with?
Are the files/results sequential?
Can they be volumised?

tibor · 09-06-2006 03:57PM

You have a system that is constantly doing reads/writes to large text files,
which is leading to heavy load on your system, possibly impacting other things running on it?

And your solution is to compress these large files?
And read/write from the compressed large files in an attempt to reduce load?

Are you serious?

edit: if that is *REALLY* what you want, zlib FTW

Hobbes · 09-06-2006 06:50PM

aidan_walsh wrote:

Well, yeah, assuming the company doesn't have anything else that also acts on the files as they are. But I guess they wouldn't be zipping them in that case...

Just create a program to turn back into the insanely large files again. Tibor has a point though.

Gosh · 09-06-2006 07:08PM

HTH

http://www.icsharpcode.net/OpenSource/SharpZipLib/Default.aspx

Torak · 10-06-2006 11:44PM

mountainyman wrote:

In order to do that I would like to
A) Find the Zip file
read a chunk of that file
C) process that chunk
D) repeat till EOF

If this is impossible without unzipping I will be narked.

It IS impossible without unzipping.. as mr bonkey said if you want to interpret the data then it has to be unzipped.

It is however (regardless of the difficulties involved) possible to process the contents of an entry in a zip file whilst it is being decompressed rather than storing it all in memory throw away after processing a chunk.

let us call this ability a stream. Note that the ZipInputStream in java is not what you want as it gives you an entry at a time. You need a similiar product which gives you entries in chunks that are configurable.
potential result from google

I have no idea if the program does what you ask. If it does not then chances are that you will need to follow the very first link and learn the decompression algorithm and implement in a fashion that allows a visitor to access each chunk...

tibor wrote:

You have a system that is constantly doing reads/writes to large text files,
which is leading to heavy load on your system, possibly impacting other things running on it?

And your solution is to compress these large files?
And read/write from the compressed large files in an attempt to reduce load?

Are you serious?

You assume too much. It is quite possible and indeed probable all things considered, that the issue is storage of the large files rather than processing of them. The solution which is required then is to remove the storage/transport issue without overly impacting the processing time.

tibor · 11-06-2006 12:28PM

Torak wrote:

You assume too much.

Do I?

mountainyman wrote:

So basically heavier usage of the system. They are using the parser more than they thought they would.

Torak · 11-06-2006 05:17PM

tibor wrote:

Do I?

mountainyman wrote:

So basically heavier usage of the system. They are using the parser more than they thought they would.

Absolutely, -- and hence, perhaps, because the parser is used far more in far more intensive fashion, the output is larger and therefore causing storage problems..

tbh I don't want to bicker with you and I apologise that my initial response was written somewhat rudely. It is no excuse, however I was quite tired...

My thought process stands -- although at first glance it seems to be what you are saying, i believe that there is enough information in what is not said directly to determine that the real cause of the problem is not a lack of clock cycles.

I could be completely wrong..

bonkey · 12-06-2006 08:59AM

Torak wrote:

It is quite possible and indeed probable all things considered, that the issue is storage of the large files rather than processing of them. The solution which is required then is to remove the storage/transport issue without overly impacting the processing time.

If thats the case, then my immediate reaction is that buying a bigger disk array is probably cheaper than paying a programmer to write an application to use existing disk more efficiently.

As a second option, I'd use NTFS file-compression, rather than using a system like ZIP. This way, the stuff is stored in a compressed format, which can be massively successful with large text files, but can still be read from / written to as though it were a regular file.

Similarly, if the problem is having enough memory in the machine to handle loading one complete file (thus leading to the "block" approach), my immediate thought would be first to buy more memory. My second thought would be to buy disk, store the files uncompressed or NTFS-compressed, and write your code to block-access the uncompressed files.

While many people argue that throwing hardware at a problem is the wrong way to solve it, sometimes hardware *is* the problem.

jc

Torak · 12-06-2006 09:27AM

bonkey wrote:

If thats the case, then my immediate reaction is that buying a bigger disk array is probably cheaper than paying a programmer to write an application to use existing disk more efficiently.

true, as long as transportation of the large files is not also an issue.. different physical locations are a fact of life and it may be an issue here..

perhaps network bandwidth in the office is being chewed up transporting the files across the network and it is now taking 2 hours every morning just to get the application to start as everybody starts the app at the same time..

who knows really.

bonkey wrote:

As a second option, I'd use NTFS file-compression, rather than using a system like ZIP. This way, the stuff is stored in a compressed format, which can be massively successful with large text files, but can still be read from / written to as though it were a regular file.

true.. makes sense.. and more than worth investigating i'd imagine

bonkey wrote:

Similarly, if the problem is having enough memory in the machine to handle loading one complete file (thus leading to the "block" approach), my immediate thought would be first to buy more memory. My second thought would be to buy disk, store the files uncompressed or NTFS-compressed, and write your code to block-access the uncompressed files.

unless an amount of this work is performed on peoples workstations in which case this might require a full upgrade of every system on everybodies desk.. The inexpensive memory upgrade can quickly become a nightmare for everybody involved.

bonkey wrote:

While many people argue that throwing hardware at a problem is the wrong way to solve it, sometimes hardware *is* the problem.

absolutely...

My point was that I assumed that the OP had done his research at least to some degree correctly and that his post on here was either concerned with evaluating a specific solution as an option or implementation after exhausting other options.

The comment previously posted by tibor was based on the statement "heavier use of the system" could result in a hell of a lot of symptoms. simplifying it in the manner specified was an assumption and there are many questions required to be asked to get to the root cause..

that's all..

respectfully,
T

Zip File Parser

Comments