Finding a file containing only one type of character (nul) doesn't match all the right files regex:content

Found a bug in "Everything"? report it here
Post Reply
defza
Posts: 30
Joined: Thu Apr 18, 2019 12:49 pm

Finding a file containing only one type of character (nul) doesn't match all the right files regex:content

Post by defza »

If I search for

Code: Select all

regex:content:^\x00+$
on certain folder/file, then it only finds one file, but the many others are never matched.
All the files are just a bunch of nul characters, no newlines, nothing except nul characters.

Here are 2 samples that I expected to match but are not matching: https://drive.google.com/drive/folders/ ... sp=sharing
defza
Posts: 30
Joined: Thu Apr 18, 2019 12:49 pm

Re: Finding a file containing only one type of character (nul) doesn't match all the right files regex:content

Post by defza »

The solution is to add the "binary:" option, ie.

Code: Select all

regex:binary:content:^\x00+$

(which is a bit weird, as the \x is a binary character essentially, so not sure why it wouldn't match without the binary: switch)
void
Developer
Posts: 16680
Joined: Fri Oct 16, 2009 11:31 pm

Re: Finding a file containing only one type of character (nul) doesn't match all the right files regex:content

Post by void »

binary: is the correct answer.

binary: will treat your search and the file content as a byte stream.

Without binary:, Everything will treat the file as text.
You can still match NUL characters in text mode.

The issue occurs because there is a system iFilter for tff files and this iFilter is returning empty content when the file contains all NULs.

Alternative searches:
regex:binarycontent:^\x00+$
regex:utf8content:^\x00+$

(treat the content as binary or UTF-8).
therube
Posts: 4955
Joined: Thu Sep 03, 2009 6:48 pm

Re: Finding a file containing only one type of character (nul) doesn't match all the right files regex:content

Post by therube »

I'll note that on a long Query, even though the progress dialog comes up, it may not be apparent that the "progress" is being displayed, as in, there is not enough progress for the progress bar to display anything.

(And if you don't realize that, you might try Canceling the Query at the 'Querying... 0 items' dialog, but that is for not ;-).)


Does regex:binarycontent:^\x00+$ "kick out" on a non-match, or does it continue to check the remainder of a file, regardless?
(I assume the latter.)


Thinking that something like
size:>0 regex:binarycontent:^\x00+$
is not going to be a very efficient search method on a large data set.


grep -L -z -e . ...

https://askubuntu.com/questions/1066057 ... ome-others
(Now let me remember how to do a "filelist", so I can see what it returned...
Ah, it was the 'ol right-click context menu, Edit File List Slot..., that I wanted.)

Code: Select all

TimeThis :  Command Line :  grep -R -L -z -e . * > ALLNUL
TimeThis :    Start Time :  Mon May 27 13:00:00 2024
TimeThis :      End Time :  Mon May 27 13:07:05 2024
TimeThis :  Elapsed Time :  00:07:05.897
So 7 minutes to search my K: drive, & assuming that it is searching everything, & finding expected results.
(Compared against, what would seem to be an hours long ? search in Everything.)

(Granted, you do need to add the volume name to all files [in ALLNUL, the input to Everything's File List], s/^/K:/, & also need switch the slashes [as output by grep], s/\//\\/g.)


("File List" does not support forward slashes, / :()
void
Developer
Posts: 16680
Joined: Fri Oct 16, 2009 11:31 pm

Re: Finding a file containing only one type of character (nul) doesn't match all the right files regex:content

Post by void »

Thank you for your post therube,



Everything 1.5.0.1379a improves searching for content.

I have added optimized search paths for the following searches:

regex:binarycontent:^\x00+$

(filled with the same byte character - one or more)

regex:binarycontent:^\x00*$

(filled with the same byte character - zero or more)

regex:binarycontent:\x00

(contains the specified single byte character)

regex:binarycontent:\x00\x00\x00...

(contains the specified multiple byte characters)



The above searches will no longer read the entire content and will instead read chunks at a time.
\x00 can be replaced with any byte character (excluding a-zA-Z which must be handled the old/slow way for case comparisons)
\x00 can be replaced with any byte character (including a-zA-Z if you use the case: search modifier or binary: search modifier)



I recommend using binary:content: instead of binarycontent:
binary:content: will treat your search and the content as a byte-stream.
binarycontent: will treat the search as text and the content as ASCII/ANSI/UTF-8/UTF-16/UTF-16BE/UTF-16+1/UTF-16BE+1
regex:binarycontent: will now treat the search as a byte stream if there's no a-z or A-Z characters in your search.
case:regex:binarycontent: is also treated the same as binary:regex:content:



1379a fixes an issue with content: not searching binary content correctly.


I'll note that on a long Query, even though the progress dialog comes up, it may not be apparent that the "progress" is being displayed, as in, there is not enough progress for the progress bar to display anything.
I will consider showing 1% instead of 0%

Thanks for the suggestion.


Does regex:binarycontent:^\x00+$ "kick out" on a non-match, or does it continue to check the remainder of a file, regardless?
(I assume the latter.)
Old versions would read the whole file then perform the regex search.
1379a will be a little smarter and read chunks of data until a hit is found. (if no hit is found the entire file is read)

binarycontent: will also try to convert the binary file to ansi text, UTF-8 text, UTF-16 text etc...

Use binary:regex:content: to treat the search and content as a byte stream.


("File List" does not support forward slashes, /
File lists do support /
Unfortunately, enabling Tools -> Options -> Search -> Replace forward slashes with backslashes will make the impossible to match.
therube
Posts: 4955
Joined: Thu Sep 03, 2009 6:48 pm

Re: Finding a file containing only one type of character (nul) doesn't match all the right files regex:content

Post by therube »

Thinking that something like size:>0 regex:binarycontent:^\x00+$ is not going to be a very efficient search method on a large data set.
Not any more.
Now under 1 minute :-).
Post Reply