[Solved] Search for PDF files with content:
[Solved] Search for PDF files with content:
I tried to search for PDF files containing some text in pages.
I installed Adobe PDF iFilter 64 v11.0.01
With 1.5.0.1295a x64 I receive: Querying ... 0 objects even if the text is present in PDF.
I installed Adobe PDF iFilter 64 v11.0.01
With 1.5.0.1295a x64 I receive: Querying ... 0 objects even if the text is present in PDF.
Last edited by w64bit on Fri Jan 21, 2022 3:28 pm, edited 2 times in total.
Re: Search for PDF files with content:
1. What search-query did you use/
2. What is defined under Menu:Tools > Options > Indexes > Content?
2. What is defined under Menu:Tools > Options > Indexes > Content?
Re: Search for PDF files with content:
D: *.pdf content:text
no index file content
I am trying to avoid index file content and to use search inside the files/querying when I need, even it takes longer. It is working with DWG files, as I have DWG ifilter installed with AutoCAD.
no index file content
I am trying to avoid index file content and to use search inside the files/querying when I need, even it takes longer. It is working with DWG files, as I have DWG ifilter installed with AutoCAD.
Re: Search for PDF files with content:
What OS are you using?
-I would like to test my end.
The Adobe PDF iFilter might not like the COM multithreaded concurrency model, please try disabling content_pdf_ifilter_coinit_multithreaded:
Everything might be getting stuck on a specific file.
Please try running Everything in verbose debug mode:
-I would like to test my end.
The Adobe PDF iFilter might not like the COM multithreaded concurrency model, please try disabling content_pdf_ifilter_coinit_multithreaded:
- In Everything, type in the following search and press ENTER:
/content_pdf_ifilter_coinit_multithreaded=0 - If success content_pdf_ifilter_coinit_multithreaded=0 is shown in the statusbar for a few seconds.
- Please restart Everything, type in the following search and press ENTER:
/restart-now
Everything might be getting stuck on a specific file.
Please try running Everything in verbose debug mode:
- In Everything, from the Tools menu, under the Debug submenu, check Console.
- From the Tools menu, under the Debug submenu, check Verbose.
- Perform your pdf content search.
- What is shown in the Debug console when Everything gets stuck showing Querying...
Re: Search for PDF files with content:
Win 10 21H2 x64
/content_pdf_ifilter_coinit_multithreaded=0 did not help. Same Querying ... 0 objects.
Debug for content_pdf_ifilter_coinit_multithreaded=1
/content_pdf_ifilter_coinit_multithreaded=0 did not help. Same Querying ... 0 objects.
Debug for content_pdf_ifilter_coinit_multithreaded=1
Last edited by void on Thu Jan 20, 2022 9:54 am, edited 1 time in total.
Reason: remove logs
Reason: remove logs
Re: Search for PDF files with content:
EDIT:
I mixed up a couple of things. Removed my answer to avoid sending other people reading this in the wrong direction ...
I mixed up a couple of things. Removed my answer to avoid sending other people reading this in the wrong direction ...
Re: Search for PDF files with content:
Thank you for the debug logs w64bit,
From memory, w64bit runs as the true admin.
The issue is Everything is not finding content in PDF files at all.
The iFilter fails to load my file stream.
I'm unsure of the reason...
Everything 1.5.0.1296a adds more debug information.
Could you please try running this Everything version in verbose debug mode again:
failed to load stream <filename> <failure-reason>
From memory, w64bit runs as the true admin.
It looks like the Query eventually completes (after ~60 seconds)failed to load stream
The issue is Everything is not finding content in PDF files at all.
The iFilter fails to load my file stream.
I'm unsure of the reason...
Everything 1.5.0.1296a adds more debug information.
Could you please try running this Everything version in verbose debug mode again:
- In Everything, from the Tools menu, under the Debug submenu, check Console.
- From the Tools menu, under the Debug submenu, check Verbose.
- Perform your pdf content search.
- What is shown in the Debug console after the query completes?
failed to load stream <filename> <failure-reason>
Re: Search for PDF files with content:
With:
Adobe PDF iFilter 64 v11.0.01
content_ifilter=1
content_pdf_ifilter_coinit_multithreaded=0
content_ifilter_coinit_multithreaded=0
=> failure-reason 80004005
If I uninstall Adobe PDF iFilter and use Windows default iFilter, searching does not get stuck but fail to find all PDF files. Some of them are missing from result list.
It seems that with Windows default iFilter it can find PDF files by typing first 3 letters of a word, but nothing if I type 4 or more letters.
Adobe PDF iFilter 64 v11.0.01
content_ifilter=1
content_pdf_ifilter_coinit_multithreaded=0
content_ifilter_coinit_multithreaded=0
=> failure-reason 80004005
If I uninstall Adobe PDF iFilter and use Windows default iFilter, searching does not get stuck but fail to find all PDF files. Some of them are missing from result list.
It seems that with Windows default iFilter it can find PDF files by typing first 3 letters of a word, but nothing if I type 4 or more letters.
Re: Search for PDF files with content:
Thank you for the debug info w64bit,
80004005 is a generic error code.
It's common issue with the Adobe PDF ifilter.
Everything is already using IPersistStream.
I am testing this my end and will get back to you.
Could you please send me a PDF file where a 4 letter word doesn't match (with the default iFilter) to support@voidtools.com
It's most likely a bad break injecting a space or newline.
80004005 is a generic error code.
It's common issue with the Adobe PDF ifilter.
Everything is already using IPersistStream.
I am testing this my end and will get back to you.
Could you please send me a PDF file where a 4 letter word doesn't match (with the default iFilter) to support@voidtools.com
It's most likely a bad break injecting a space or newline.
Re: Search for PDF files with content:
Everything 1.5.0.1297a fixes an issue with the Adobe PDF iFilter not loading correctly.
The Adobe PDF iFilter is loading a dll dependency from the current directory.
Everything 1.5 previously prevented this type of dll loading.
The Adobe PDF iFilter is loading a dll dependency from the current directory.
Everything 1.5 previously prevented this type of dll loading.
Re: Search for PDF files with content:
this file has a problem with first e
- Attachments
-
- text - Pieces.zip
- (58.33 KiB) Downloaded 513 times
Re: Search for PDF files with content:
Thank you for the test1.pdf sample.
This appears to work fine for me on Windows 10 21H1 with the stock PDF iFilter.
Do you have any Search options enabled under the Search menu?
Could you please send the verbose debug output when searching content in this file?:
This appears to work fine for me on Windows 10 21H1 with the stock PDF iFilter.
Do you have any Search options enabled under the Search menu?
Could you please send the verbose debug output when searching content in this file?:
- In Everything, from the Tools menu, under the Debug submenu, check Console.
- From the Tools menu, under the Debug submenu, check Verbose.
- Search for:
Test1.pdf content:test - What is shown in the Debug console after the query completes?
Re: Search for PDF files with content:
Another trick with Everything 1.5 that might be helpful here:
- In Everything, search for:
Test1.pdf dotall:regex:content:(.*) - Show the Regular Expression Match 1 column.
- What is shown for you in this column?
Re: Search for PDF files with content:
No Search options enabled under the Search menu.
Test1.pdf dotall:regex:content:(.*) + Regular Expression Match 1 => nothing found in search list
debug.txt attached
It seems that it has to do with my fresh install of Win 10 21H2 x64 from a clean ISO, dated dec 2021.
I don't remember this PDF problem on my previous install of Win 10 21H2 obtained from 20H2 x64 + all updates done by WU.
Test1.pdf dotall:regex:content:(.*) + Regular Expression Match 1 => nothing found in search list
debug.txt attached
It seems that it has to do with my fresh install of Win 10 21H2 x64 from a clean ISO, dated dec 2021.
I don't remember this PDF problem on my previous install of Win 10 21H2 obtained from 20H2 x64 + all updates done by WU.
Last edited by void on Fri Jan 21, 2022 11:46 am, edited 1 time in total.
Reason: removed debug logs
Reason: removed debug logs
Re: Search for PDF files with content:
Thanks for the debug logs.LoadIFilter Test1.pdf 80004005
The PDF iFilter straight up fails to load. (Generic Failure error code)
The stock PDF iFilter does not like running in a STA thread.
Please try re-enabling content_pdf_ifilter_coinit_multithreaded:
- In Everything, type in the following search and press ENTER:
/content_pdf_ifilter_coinit_multithreaded=1 - If success content_pdf_ifilter_coinit_multithreaded=1 is shown in the statusbar for a few seconds.
- Please restart Everything, type in the following search and press ENTER:
/restart-now
Re: Search for PDF files with content:
I checked, added and corrected PDF PersistentHandler registry entries and now it's all OK with Win 10 default IFilter.
Thank you very much.
Thank you very much.
Re: [Solved] Search for PDF files with content:
Thanks for the update w64bit,
I am glad to hear PDF content searching is working now.
I am glad to hear PDF content searching is working now.
Re: [Solved] Search for PDF files with content:
Hi,
I would like to show the files that have the result of "failed to load stream".
Is this possible?
In my case, they are the one's that I'm looking for, basically, dirty or bad or textless PDF's. They usually have this attribute of not being able to be loaded.
I saw error 8004807a in the logs in my case.
I would like to show the files that have the result of "failed to load stream".
Is this possible?
In my case, they are the one's that I'm looking for, basically, dirty or bad or textless PDF's. They usually have this attribute of not being able to be loaded.
I saw error 8004807a in the logs in my case.
Re: [Solved] Search for PDF files with content:
Everything doesn't have a search function that will find PDF files that fail to load.
Everything 1.5 might help to find bad PDF files with the new PDF properties:
Thank you for the suggestion.
Everything 1.5 might help to find bad PDF files with the new PDF properties:
- In Everything 1.5, right click the result list column header and click Add columns....
- Click application/pdf on the left.
- Select all properties and click OK.
- Examine these new columns in Everything for your PDF files. (eg: search for *.pdf )
A missing File Signature will definitely indicate a bad PDF file.
Thank you for the suggestion.
Re: [Solved] Search for PDF files with content:
Thanks, but in this case, the file shows application/pdf as File SIgnature and content-type just fine.
But in the log it says
It seems that the ifilter (content search) fails, but then it reverts to searching the file as raw text literal file contents as a fall back, because it returns the actual raw file content in my regex search, whereas a pdf file with actual body content text (in this case, appended with -text.pdf) will return body text and not raw text, see screenshot:
Here is the source file that I'm trying to detect that it doesn't have any content in it. My usual search of regex:content:\A\s*\z or regex:content:^$ doesn't work for this case, because it returns the raw contents of the it seems.
Update:
Oh, I think this would work: if I search for regex:content:^\%PDF or content:%PDF then it finds this file that fails as an actual pdf, but searches the file as binary/bytes/raw.
Update2:
But this doesn't really solve the problem, because it still causes a full index search of the whole file for those files that are "valid" pdf's, just to be able to search for the regex that I'm looking for.
I guess what I want to do is not index the whole file contents, but just search the the start of the pdf of those that have an issue, or just only the start of the pdf's that have full text, so that the content search fails quicker on the big files with no match and then moves onto the next file.
Here's the source pdf file: https://www.duq.edu/Documents/theology/ ... 1_2009.pdf
------------
But in the log it says
Code: Select all
failed to load stream F:\Seminary\TheologyBible\theolibrary.shc.edu\www.duq.edu\documents\theology\_pdf\faculty-publications\Bulletin_of_Ecumenical_Theology_21_2009.pdf 8004807a
Here is the source file that I'm trying to detect that it doesn't have any content in it. My usual search of regex:content:\A\s*\z or regex:content:^$ doesn't work for this case, because it returns the raw contents of the it seems.
Update:
Oh, I think this would work: if I search for regex:content:^\%PDF or content:%PDF then it finds this file that fails as an actual pdf, but searches the file as binary/bytes/raw.
Update2:
But this doesn't really solve the problem, because it still causes a full index search of the whole file for those files that are "valid" pdf's, just to be able to search for the regex that I'm looking for.
I guess what I want to do is not index the whole file contents, but just search the the start of the pdf of those that have an issue, or just only the start of the pdf's that have full text, so that the content search fails quicker on the big files with no match and then moves onto the next file.
Here's the source pdf file: https://www.duq.edu/Documents/theology/ ... 1_2009.pdf
------------
Last edited by defza on Mon Apr 18, 2022 1:58 pm, edited 1 time in total.
Re: [Solved] Search for PDF files with content:
Error 3 is path not found.
Is the U: drive online?
Does Everything have access to your U: drive? (are you running Everything as an administrator?)
Does forcing a rebuild from Tools -> Options -> Indexes -> Force Rebuild.
Is the U: drive online?
Does Everything have access to your U: drive? (are you running Everything as an administrator?)
Does forcing a rebuild from Tools -> Options -> Indexes -> Force Rebuild.
Re: [Solved] Search for PDF files with content:
Your example pdf file can't be indexed as it contains no searchable text at all.
Try to select some text from it in your PDF tool and you will see.
If I run my PDF-XChange OCR tool on it it makes it fully searchable and can be indexed with Everything.
Also its size shrinks this way from 9.1 MB to 1.1 MB
Btw. the orignal has some structural errors which can be fixed.
Try to select some text from it in your PDF tool and you will see.
If I run my PDF-XChange OCR tool on it it makes it fully searchable and can be indexed with Everything.
Also its size shrinks this way from 9.1 MB to 1.1 MB
Btw. the orignal has some structural errors which can be fixed.
Re: [Solved] Search for PDF files with content:
Sorry that was my mistake, ignore that part, i'ved edit it out now. It was a copy of the file on a non-existant disk.
Yes, I know, because that's exactly the type of files that I'm trying to find with an everything search.Your example pdf file can't be indexed as it contains no searchable text at all.
Re: [Solved] Search for PDF files with content:
Actually, I've got a global filter on showing results only from the available online: only files... strange that it was trying to access the file on that drive?
Seems that maybe content searches are ignoring the online attribute?
Re: [Solved] Search for PDF files with content:
I use a modified script from NotNull to create a list of files which need OCR.
It uses pdftotext tool and creates a file need_ocr.txt in the dir with the PDFs.
Currently it adds an unwanted space on the end of every name in the list.
It uses pdftotext tool and creates a file need_ocr.txt in the dir with the PDFs.
Currently it adds an unwanted space on the end of every name in the list.
Code: Select all
@echo off
setlocal
rem echo on
pushd "%~dp0"
cls
::____________________________________________________________
::
:: SETTINGS
::____________________________________________________________
::
chcp 1252
set OUT-List=.\need_ocr.txt
del %OUT-LIST%
::____________________________________________________________
::
:: ACTION!
::____________________________________________________________
::
for %%X in (*.pdf) do (
echo. [%%X]
C:\Tools\xpdf-tools\pdftotext.exe -simple "%%X" .\checkthis.txt
for %%C in (checkthis.txt) DO if %%~zC LSS 25 ( echo %~dp0%%X>>"%OUT-List%" )
del checkthis.txt
)
pause
goto :EOF
Re: [Solved] Search for PDF files with content:
online: also matches files where the online status is unknown.Actually, I've got a global filter on showing results only from the available online: only files... strange that it was trying to access the file on that drive?
Seems that maybe content searches are ignoring the online attribute?
I will change online: in the next alpha update to match only files that known to be online.
Re: [Solved] Search for PDF files with content:
I believe Everything itself can find all pdf's that are just images or are corrupt this with these search two queries:horst.epp wrote: ↑Mon Apr 18, 2022 5:42 pm I use a modified script from NotNull to create a list of files which need OCR.
It uses pdftotext tool and creates a file need_ocr.txt in the dir with the PDFs.
Currently it adds an unwanted space on the end of every name in the list.
Code: Select all
@echo off setlocal rem echo on pushd "%~dp0" cls ::____________________________________________________________ :: :: SETTINGS ::____________________________________________________________ :: chcp 1252 set OUT-List=.\need_ocr.txt del %OUT-LIST% ::____________________________________________________________ :: :: ACTION! ::____________________________________________________________ :: for %%X in (*.pdf) do ( echo. [%%X] C:\Tools\xpdf-tools\pdftotext.exe -simple "%%X" .\checkthis.txt for %%C in (checkthis.txt) DO if %%~zC LSS 25 ( echo %~dp0%%X>>"%OUT-List%" ) del checkthis.txt ) pause goto :EOF
1. regex:content:\A\s*\z (finds the pdf's with no content i.e. just whitespace returned) or regex:content:^$
2. regex:content:^\%PDF (This works by trying to search the content, when it fails, it reverts to raw binary search, and then finds the pdf header. This will only happen with files for which the ifilter/system pdf search returns an error)