automate google ocr using python – demo in tamil

automate google ocr using python – demo in tamil


Hello, In this video, let us see how we can automate google OCR using python OCR means extracting text from Images, PDF files etc so far, typing was the only way. Now google gives a free OCR goto http://drive.google.com Let us take on example image Let us extract the text from this image Select New->upload file->select image Now select the image file. Right click it and open it with google doc The image will be OCRed and displayed with the original image It is easy to do like this for single image. But an usual pdf can have 50-60 pages It will be tough to use this method to OCR huge pdf files I automated this task using Python Programming language Let use see how to do this. The source code is in gitbub.com/tshrinivasan/google-ocr-python It is a very small program But we need a python library – gdcmdtools This is used to interact with google drive api via python let us install gcmdtools I placed the pdf files in the folder ‘demo’ Extract the tarball Install it with “sudo python setup.py install” command now it is insalled Check their documentation on how to use it. Let us see how to set up the API with google. goto http://console.developers.google.com Create a new project Give any name for the project Now we created a project. We have to enable few API’s in the project. Goto API in the project page Enable 2 APis Drive API Fusion Table API Goto Credentials Enable Oauth credentials Give a product name We got client id and client secret Let us download it. it is a json file we can use that with the gdauth script gdauth.py it will give a url open that url with browser choose a google account Allow access to google drive via this program copy the secret code paste in the terminal Oauth is done from now, we can access google drive via commandline using gdcmdtools For example Let us upload a file gdput.py and mention the type of the file use ocr as file type then filename Now it is uploading the file. OCR it and give the required links Click the text plain link It asks to save the file. OK The image is OCRed and given as plain text Now we uploaded a single file with command line Let us see how to work on the PDF files We have a PDF file for example, shrini-articles-malaigal.pdf It has 6 pages We have to convert this to 6 individual images We can automate this, make a new folder Let us convert as images Linux gives imagemagick for this convert is a commandline tool used for this It is used to convert filetype, file size etc it can convert a pdf into images density is for DPI OCR needs 300 DPI min Give PDF file name, Quality Then filename format This converts 6 pages into 6 images Then, we can upload them all see the 6 images see this url to get the program for bulk upload It just automates the gdput.py Download the script I already have this. goto the folder with images copy the script there run as python google-ocr.py it calls gdput.py as we did manually but for all the images, automatically No big stuff 🙂 it loops via each file and uploads it downloads it as a text file with same name code is very small only. loop over all jpg files inputs for gdput.py upload it use gdget to download as text file Then, merging all text files as a single file The time taken depends on the file size and internet speed gdcmdtools is the great tool behind this. Thanks to all its creators We can automate all tasks related to google drive using this gdcmdtools google ocr supports 200 languages like tamil Convert all your images and documents as text, as soon as possible We dont know how long google can give this service They can shut down it anytime Now, we got all 6 files uploaded check ocr-result.txt we got individual text files too, along with one merged file 6 images are now in one single text file You can download the script and use. We have more things to do to improve it Improve the code Ask for a foldername, create a folder in goole drive and upload the images there. ODT file keeps the formatting better than the text file no line in between paragraphs in text files We have to give line break manually ODT keeps formatting. but still searching on merging odt files Fork the repo and contribute Mail me if you have any queries Let us improve it further. Thanks.

10 comments

  1. tamil la pesunathu nalla irunthuchu , pages english la irutha inum better ah irrukum nu feel panara, I really appreciate the work done

  2. வாழ்த்துக்கள். மிகவும் அருமையான பதிவு 👏👏👏

  3. தூய தமிழில் பதிவிடுவதை தவிர்த்து எளிமையான தமிழை பயன்படுத்துங்க நண்பா. தூய தமிழ எல்லா தரப்பினரும் புரிந்து கொள்ள முடியாது.

  4. இந்த நிகழ்படப் பாடத்தில் காட்டப்படும் இணையப்பக்கங்களின் முகவரிகளையும் தந்தால் பலருக்கும் உடன் காண வசதியாக இருக்கும். தேடும் நேரம் மீதமாகும். எ. கா. https://github.com/tshrinivasan/google-ocr-python

Leave a Reply

Your email address will not be published. Required fields are marked *