473,405 Members | 2,176 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,405 software developers and data experts.

Database of PDF files

Ok, so I'm brand new to Access - 2007 (though working my way through an online learning package from work), and thought I would chekc to see what I want to do is POSSIBLE before spending days/weeks trying to achieve it.

I'm just looking for a 'That sounds possible', maybe with 'but realyl hard for a noob' or 'and this link might help', not the requierd code - I'm keen to learn a new thing so want to have SOME understanding of what I'm making...

Goal: To make finding the required files easier than looking in two reference books and finding the right CD.

Proposed solution: Copy PDFs from CDs - around 4,000 per disc, 6 discs. Create database linking to each file. Over time populate database with relevant info to make it more easily searchable.

I realise this is a failry big job, I have time at the moment and no rush. This is an improvment, no essential work. I have already potentially found code to do the imports (the 'List files to table' solution from http://allenbrowne.com/ser-59alt.html). Just want to know from people who know this stuff if this is a ridiculous idea, or if it's got legs!

Thanks for any pointers!
Mar 6 '13 #1

✓ answered by Oralloy

Gentlemen,

I have to thank you all for an interesting thread.

[Seth] I believe that you can just force a back-stroke between the path and the file name, regardless, as doubled backstrokes are gracefully handled in the windoze file-system. I tested the following example under windoze-7:
Expand|Select|Wrap|Line Numbers
  1. notepad C:\\Users\\Oralloy\\Desktop\\desktop.ini
[neobrainless] Acrobat files are, in fact, machine readable, but you probably aren't up to the task. What do you think the Adobe Reader does with one? You can look up the file format specification, which (when I dealt with it) was over 1000 pages long.

[neobrainless] I think that you can put together a reasonable prototype within a man-week's effort. This includes:
  1. Database architecture - straightforward, but non-trivial.
    • Your table structure should be straightforward.
    • A good architecture is non-trivial.
    • Normal form is your friend, trust me on this.
    • A well thought out schema simplifies coding; should you find yourself writing lots of code to clue tables together, you will want to re-examine your data model.
  2. Data security model
    • This is a non-trivial topic for any successful application.
    • One goal is to prevent inadvertent data destruction.
    • Another goal (probably not in your case) is to limit data visibility.
  3. Likely you will require forms for
    • Document Search - find documents matching a list of words.
    • Document Display - once the search is done, here is the list of documents.
    • Data Entry - Enter document, title, and keyords - provides error checking - hides the database structure (schema)
  4. Structured (pretty) reports
  5. Bulk data entry method(s)
    • You cannot reasonably enter thousands of documents manually.
    • Provides repeatability, when revised keyword extraction methods are identified.

And, finally, a story from my sordid history:
Once upon a time, when I was between jobs and reading craigslist, I came upon an advertisement for someone to write a web page using Python for $200.

Being curious, and wanting to learn this wonderful new language, I decided that I would apply for the work. At the outset I realized that I was going to be learning on the customer's project, and thus I was willing to do the work at a rate that was essentially gratis.

This web page was supposed to connect to a SQL database and display the result of a query.

Oralloy sends his resume to Mr. Client and is asked to come over to discuss the details of the project.

The next morning, we meet at Mr. Client's place to discuss what needs to be done.

First off, what is the page supposed to do? Fairly straightforward - a call center is supposed to be automatically informed that a customer's payment has cleared, so that they can call the customer back and complete the details of the work paid for. Ok, simple enough - database query and periodic "meta refresh" of a page.

Do you have the database information?

No, but the hosting site does. You will have to figure out how to connect to the database.

Ok, no problem, do you have the details of the schema?

You have to figure that out.

What do you mean?

You have to set up the database so that you can get the correct information from the payment page to the call center.

Ok - who do I talk to about the payment page and how those data are processed?

You have to write that.

So I actually have to build two pages - the payment page of your Internet site and the call-center page?

Right.

--snip--

If you do a good job, you will be a prime candidate for site maintenence.

--snip--

Thank you - this task is pretty big. Just to be sure we don't miss anything, can you work up a complete list of requirements tonight, so we can look at storyboards in the morning?

Of course. I'll e-Mail it tonight.

Thank you. Have a lovely day.

Good bye.

--time passes--

Around midnight I recieve an e-Mail providing a list the web-page requirements. (You'd have thought my polite walk-out would have given Mr. Client a clue) There were seventeen requirements on this list describing what ammounted to a comprehensive application, including three levels of access security and financial transactions.

My reply was something on the order of "At this time, I am unable to support the required level of effort to complete your web-page in a timely fashion."
The moral of the story being that CLEAR REQUIREMENTS are a necessary part of understanding and building a successful project.

The sad part is that Mr. Client didn't seem to have a clue about what was really needed when he posted his request for help. When I walked into his office, the man honestly seemed to think that the level of effort required was about two days from an entry level programmer. And the fact that he presented that list of seventeen (SEVENTEEN!) non-trivial requirements after what was (apparently) hours of hard work on his part, was shocking and bold faced.

So the other moral is simply - don't get caught up in a project where you cannot succeed.

Cheers and Good Luck,
Oralloy

p.s. If you need to discuss PDF file data extraction, let's do that on a different thread.

16 2112
Seth Schrock
2,965 Expert 2GB
This isn't a ridiculous idea. In fact, this wouldn't have to be very technically difficult. However, there is always room for it to grow to do many things.

Since you are new to access/relational databases and you are coming from Excel which is a flat database, here is a link that you should really look into. By following these "rules", you will save yourself many headaches in the future.
Database Normalization and Table Structures
You can also do a web search about database normalization. I have found that sometimes different perspectives can help me understand the issue better.
Mar 6 '13 #2
If you try to get into Access and VBA programming, that's a nice thing to work on. You'll find a lot of examples on the internet which will help you to make it work.
Otherwise I would have recommended a desktop search tool :-)

The question I have to you is: Are you just trying to solve a concrete problem or do you want to use your concrete problem and learn the concepts of software development?
Mar 6 '13 #3
mshmyob
904 Expert 512MB
It's never a ridiculous idea if you need it or it will help you accomplish something more efficiently.

I don't see it being very difficult therefore I believe a nice little app to start learning with.

Good luck and have fun.

cheers,
Mar 7 '13 #4
zmbd
5,501 Expert Mod 4TB
Your idea absolutely has merit... in fact, there are already off-the-shelf solutions on the market that do just what you describe. I would suggest that you take a look at the current electronic document management systems. Some are available for little to no cost for personal use and then they increase in price quite quickly.

If none of these off-the-shelf solutions work for you, or are cost prohibitive, at least you will have an starting point as to the table structure and user interface. Just a heads up though... you need a really firm understanding of database design for such a huge undertaking.

Another option, should you have it, Office OneNote and the "generic" competitors to OneNote. These may already have the functionality that you're after without haveing to invest a lot of time in the VBA/SQL universe.
Mar 7 '13 #5
Thanks for the comments!

I'm doing this for work, so I doubt I'll be able to use any of the free options. I don't see Office OneNote anywhere on this machine, so guess that's no good.

I'm kinda caught between finding the lazy option and learning a load of new stuff! I think I hsould talk to my manager, there could be different software the company would prefer to use if a database is gonna be built (it's not the brief, but looking at the task it seems ridiculous to continue hunting for the right CD for various things).

Anyway, thanks again for the pointers, I'll have a think and a chat with the boss and see what to do, I may be back with lots of questions at some point!

Cheers :)
Mar 7 '13 #6
ADezii
8,834 Expert 8TB
@neobrainless
The most logical approach as I see it is to:
  1. Insert each CD into the Drive.
  2. Read all *.pdfs in each CD and populate a Table with at least the Name of each *.pdf (including the Absolute PATH to it) along with a Unique Identifier for each CD.
  3. Every *.pdf, along with the ID of the CD to which it belongs, is now contained within a single Table.
  4. From this point the Database can easily be expanded but the hard work is done. Better yet, this process is fully automated.
P.S. - This was your initial approach to the problem via referencing Allen Browne's Code. Should you decide to go this route, I'm sure that we can assist you.
Mar 7 '13 #7
Well, after a busy spell with the work that I'm actually supposed to be doing, I'm finally having a look at this again.

The first issue I want to solve (there's loads I need ot do, barely started, but this is step one for me to sort), I know exactly what I want to do - just can't get my head around the change between VBA in Excel and Access...

The import code I used is almost perfect, except I want to have hyperlinks that open each file individually, at the moment the code gives me the file names in one column and the FOLDER path in another.
Either I want to adjust the import code I have from Allen Browne, so that I have jsut one column with the FULL path to each file, which I can make a hyperlink, or do a slightly bodgy after-fix copy/pasting the file name's onto the end of the folder path.

Sadly my Access VBA skills are woefully lacking - I can't even RUN the first attempt code for the bodgy version, and can't work out which bit of the import code to change - I'm sure it should be fairly simple, but I'm lost!

If anyone can point me right for this I would be very grateful! If possible, at least an attempt to explain the code would be great, cause that way I can learn!

Cheers, Rohan
May 15 '13 #8
Oralloy
988 Expert 512MB
Rohan,

I realize that I am a bit late in responding to you here, but it seems that you are spinning your wheels here.

zmbd likely had the right of it in his post of March 7 - you are essentially creating a Document Management System (DMS) using Microsoft Access (Access).

As you are aware, Microsoft Access is licensed software, so you will have to have a legal copy on all systems that your DMS application will be run on.

Since you haven't told us about your meta-requirements (# users, licencing, etc) and available resources, it is difficult to make a reasonable recommendation.

My experience is that the cost of re-implementing an established technology (in your case, a DMS) is very high, relative to implementing an available open-source or commercial package.

Sit down and try to figure out what your real requirements are, before starting any serious coding effort on this project. Also, what will your users' expectations of the application's behaviour and polish be? Access 2007 is a high-quality application implementation platform, but it does have its limits and costs.

Gather the honest requirements, then spend your efforts wisely.

Regards,
Oralloy
May 15 '13 #9
mshmyob
904 Expert 512MB
If he uses the runtime then he does not need to concern himself with licensing issues. The Runtime starting with Access 2007 is free and has no licensing restrictions.

You still need at least one legit copy to develop your app with.

cheers,
May 15 '13 #10
Oralloy
988 Expert 512MB
mshmyob,

That is good to know. Thank you.

There is a huge cost difference between one copy total and one copy for every system.

Still, the software cost is nothing compared to the cost of building a new application from the ground, up.

We are both in the dark, however, without a clear and comprehensive understanding of the project requirements.

Regards,
Oralloy
May 15 '13 #11
Hi, thanks for the responses, the number of users is going to be maybe 10, but everyone will have easy access to a PC with Access on as it's only specialist PCs here that DON'T have full version of Office on - so the cost is just my time. And as this project is not too time limited (I am making an improvement to a working process - currently its takes AGES to find any of these archived files, I'm trying to speed that up) so I am working on it in periods of downtime - so as far as my manager is concerned my time isn't a cost as I'm filling time that would otherwise be waiting for work. Also, the cost benefit of saved time once I'm finished should cover it, along with the fact I'm still in training and as such my time isn't worth as much as the others at the moment.

As for time it'll take, it shouldn't be that much, should it? The lengthiest process is going to be going through the files and adding the relevant data from the PDF to the database (or whatever programme is used) so they can be searched - and that will be required regardless of method used as the files are definitely NOT machine readable. Also, the requirements SHOULD be fairly simple - just a searchable list of relevant tags with a link to the appropriate file. From what I've learned about Access this should be fairly simple once I've got the data imported and the links sorted out...?
May 16 '13 #12
Seth Schrock
2,965 Expert 2GB
I don't want to get this thread off the original topic (possibly a new thread asking how to open files from inside the database would be a good idea), but if you already have the file name and the folder path in the database (separated is fine), then it is very simple to use that information to open the files even though it isn't an hyperlink. All you would have to do is concatenate the two fields together and then you have the full file path. You would probably need to test if the right-most character in the folder path was a back slash (\) before you concatenate them together, but that isn't hard and then you can use an API to open the file.

So, in answer to your question about how much time it would take to create the database, the answer is it depends on how fancy you make it. A very simple version might take less than a day, or you could make it really nice and have it take much longer.
May 16 '13 #13
Ok, I'll remember not to hijack my own threads in future, Seth!

Thanks for the tip - I did manage to sort it by copypasta to excel and a macro there, but the concatenate idea is a much more satisfying option - I'll try and get that working! (Especially with another 5+ loads of 2,000+ files to import once I've got it all up and running!)

Cheers, Rohan
May 16 '13 #14
Oralloy
988 Expert 512MB
Gentlemen,

I have to thank you all for an interesting thread.

[Seth] I believe that you can just force a back-stroke between the path and the file name, regardless, as doubled backstrokes are gracefully handled in the windoze file-system. I tested the following example under windoze-7:
Expand|Select|Wrap|Line Numbers
  1. notepad C:\\Users\\Oralloy\\Desktop\\desktop.ini
[neobrainless] Acrobat files are, in fact, machine readable, but you probably aren't up to the task. What do you think the Adobe Reader does with one? You can look up the file format specification, which (when I dealt with it) was over 1000 pages long.

[neobrainless] I think that you can put together a reasonable prototype within a man-week's effort. This includes:
  1. Database architecture - straightforward, but non-trivial.
    • Your table structure should be straightforward.
    • A good architecture is non-trivial.
    • Normal form is your friend, trust me on this.
    • A well thought out schema simplifies coding; should you find yourself writing lots of code to clue tables together, you will want to re-examine your data model.
  2. Data security model
    • This is a non-trivial topic for any successful application.
    • One goal is to prevent inadvertent data destruction.
    • Another goal (probably not in your case) is to limit data visibility.
  3. Likely you will require forms for
    • Document Search - find documents matching a list of words.
    • Document Display - once the search is done, here is the list of documents.
    • Data Entry - Enter document, title, and keyords - provides error checking - hides the database structure (schema)
  4. Structured (pretty) reports
  5. Bulk data entry method(s)
    • You cannot reasonably enter thousands of documents manually.
    • Provides repeatability, when revised keyword extraction methods are identified.

And, finally, a story from my sordid history:
Once upon a time, when I was between jobs and reading craigslist, I came upon an advertisement for someone to write a web page using Python for $200.

Being curious, and wanting to learn this wonderful new language, I decided that I would apply for the work. At the outset I realized that I was going to be learning on the customer's project, and thus I was willing to do the work at a rate that was essentially gratis.

This web page was supposed to connect to a SQL database and display the result of a query.

Oralloy sends his resume to Mr. Client and is asked to come over to discuss the details of the project.

The next morning, we meet at Mr. Client's place to discuss what needs to be done.

First off, what is the page supposed to do? Fairly straightforward - a call center is supposed to be automatically informed that a customer's payment has cleared, so that they can call the customer back and complete the details of the work paid for. Ok, simple enough - database query and periodic "meta refresh" of a page.

Do you have the database information?

No, but the hosting site does. You will have to figure out how to connect to the database.

Ok, no problem, do you have the details of the schema?

You have to figure that out.

What do you mean?

You have to set up the database so that you can get the correct information from the payment page to the call center.

Ok - who do I talk to about the payment page and how those data are processed?

You have to write that.

So I actually have to build two pages - the payment page of your Internet site and the call-center page?

Right.

--snip--

If you do a good job, you will be a prime candidate for site maintenence.

--snip--

Thank you - this task is pretty big. Just to be sure we don't miss anything, can you work up a complete list of requirements tonight, so we can look at storyboards in the morning?

Of course. I'll e-Mail it tonight.

Thank you. Have a lovely day.

Good bye.

--time passes--

Around midnight I recieve an e-Mail providing a list the web-page requirements. (You'd have thought my polite walk-out would have given Mr. Client a clue) There were seventeen requirements on this list describing what ammounted to a comprehensive application, including three levels of access security and financial transactions.

My reply was something on the order of "At this time, I am unable to support the required level of effort to complete your web-page in a timely fashion."
The moral of the story being that CLEAR REQUIREMENTS are a necessary part of understanding and building a successful project.

The sad part is that Mr. Client didn't seem to have a clue about what was really needed when he posted his request for help. When I walked into his office, the man honestly seemed to think that the level of effort required was about two days from an entry level programmer. And the fact that he presented that list of seventeen (SEVENTEEN!) non-trivial requirements after what was (apparently) hours of hard work on his part, was shocking and bold faced.

So the other moral is simply - don't get caught up in a project where you cannot succeed.

Cheers and Good Luck,
Oralloy

p.s. If you need to discuss PDF file data extraction, let's do that on a different thread.
May 16 '13 #15
Oralloy: Thanks for the well thought out response!

First point is: I said the files aren't machine readable because they are scans of handwritten documents - some of which I have struggled to read correctly - rather than because I didn't think the file type was useable.

From what I have worked out so far, your list of what is needed seems about right. I just need to get on with it. A weeks work to set it up is probably a bit hopeful for where I'm at understanding Access, but time isn't critical at this point.

Thanks again for the pointers - and the words of warning! I have a tendancy to run in a bit headlong so they're a good reminder to cool my heels and think it through a bit! The comment about succedding is exactly the reason for my original question - but it sounds like it's do-able, might just take a while - which is fine!

Cheers, Rohan
May 16 '13 #16
Oralloy
988 Expert 512MB
Rohan,

I'm worried that hand-indexing 4000+ documents is a task which no one will want to do or even fund. It would be a shame to build a lovely tool and then find your employeers don't want to use it.

I'm not sure of the quality of available, open-source OCR (optical character recognition) software, but it might be something to consider. After all, indexing large numbers of written documents is a problem that many, many small and mid-sized companies have to deal with. Heck, you might even find a reasonably affordable commercial system that provides good quality recognition.

My personal experience with OCR is from 25 years ago, and the quality of the technology at that time was pretty frightening. Still, for cost reasons, to index thousands of documents we needed to use automated methods, realizing that there would be mistakes and problems.

Now days, the capability of OCR is fantastic. For example, I've seen bank ATMs correctly read checks that were in handwriting worse than my own.

Regardless of how things turn out, you will have learned how to build applications in Access, and that is an valuable accomplishment in its self.

Have Fun!
Oralloy
May 20 '13 #17

Sign in to post your reply or Sign up for a free account.

Similar topics

8
by: Gail Zacharias | last post by:
I am investigating the possibility of using pgsql as the database in an application. I have some unusual requirements that I'd like to ask you all about. I apologize in advance if my terminology is...
1
by: Victor Spång Arthursson | last post by:
Well, bought a new disk to my Powerbook yesterday and reinstalled Mac OS X. I'm in the middle of restoring everything now, and need to know where the database files where located. I thougt...
0
by: UDBDBA | last post by:
Hi: Currently, MAXFILOP cfg parameter is set to maximum value allowed for AIX 64bit: Max DB files open per application (MAXFILOP) = 1950 Still, we see "Database files closed"...
1
by: mikemalin | last post by:
I've done some research on the net and think I have an answer, but I'm just looking for some other opinions. I would like to create an access database file to distribute to others. I would use...
5
by: willington79 | last post by:
Hello all, We have a payroll application that runs with MS Access. It is comprised of several different database files and one ms access mdb file for each client. This means there can be from...
4
by: Richard Finnigan | last post by:
Hi I'm having real difficulties getting a very simple access database file to show its data in visual studio data controls with various error messages occuring. I've changed the permissions on...
1
by: lord.zoltar | last post by:
Hi, I'm wondering how I deploy a database with an application. I know that I can require SQLServer Express to be installed, using the ClickOnce deployment system, but I also need to know how I can...
3
by: jason | last post by:
I've been playing around with new (for 2.0) membershp functionality. I was able to build a simple login form that secures a directory on a project I built locally on my development desktop. ...
1
by: orenbt | last post by:
Hi, I am new to SQL express and try to solve the 4GB size limitation. Is there a possibility to create a new database file every time I get to the limit? How can I do that with C#? how can I...
5
by: Glen Buell | last post by:
Hi all, I have a major problem with my ASP.NET website and it's SQL Server 2005 Express database, and I'm wondering if anyone could help me out with it. This site is on a webhost...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.