jeudi 14 décembre 2017

Comparing pdf documents with python

I want to write a program in python and check before if it is feasable since I am a beginner.

Background information:

There is a report (.docx) that includes many summaries (text + numbers) from many scientific papers (.pdf). I always have to check if the numbers in the report (with all the summaries) correspond to the numbers mentioned in the scientific article. Since I have to do this a lot I want to try to write a program in python.

The program should do the following:

  1. Count number of all scientific articles (including .pdf/.docx/.txt and excluding folders) collected in a folder and listed alphabetically like "author_year.pdf"

  2. If number of files is:

# 0, then generate 0 random non-recurrent numbers (max number is determined by the highest counted number)
# 1, then generate 1 random non-recurrent numbers (max number is determined by the highest counted number)
# <9, then generate 2 random non-recurrent numbers (max number is determined by the highest counted number)
# <15, then generate 3 random non-recurrent numbers (max number is determined by the highest counted number)
# <26, then generate 5 random non-recurrent numbers (max number is determined by the highest counted number)
# <51, then generate 8 random non-recurrent numbers (max number is determined by the highest counted number)
# <91, then generate 13 random non-recurrent numbers (max number is determined by the highest counted number)
# <151, then generate 20 random non-recurrent numbers (max number is determined by the highest counted number)
# <281, then generate 32 random non-recurrent numbers (max number is determined by the highest counted number)

  1. Count the respective scientific article based on the generated random-non-recurrent numbers and list the respective authors. E.g. if there are 67 scientific pdf articles listed in the folder, 13 random non-recurrent numbers are generated - e.g. 2, 4, 13, 28, 34, 37, 45, 58, 61, 62, 64, 66, 67. Now 2 corrensponds to the second scientific pdf article, 4 to the fourth,13 to the thirteenth, a.s.o.).

  2. Look in the report for the respective authors paragraphs (each authors summary is written in one paragraph) and compare the numbers mentioned in this paragraph in the report with the numbers mentioned in the corresponding scientific article of the author. For simplicity all numbers should be checked and the programm should display any discrepancies.

My question: Is this idea feasable and realizable using Pyhton?




Aucun commentaire:

Enregistrer un commentaire