support for fuzzy logic search ?

General discussion related to "Everything".
Post Reply
Andreas Sachse
Posts: 5
Joined: Sat Feb 29, 2020 2:24 pm

support for fuzzy logic search ?

Post by Andreas Sachse »

maybe in the next release ?

thx
void
Developer
Posts: 16698
Joined: Fri Oct 16, 2009 11:31 pm

Re: support for fuzzy logic search ?

Post by void »

I have experimented with fuzzy searching for Everything 1.5.
I've found fuzzy searching for filenames to be not very useful. (unless you sort by relevance)
I've tried Levenshtein distance/soundex/metaphone.

What might be useful is a dictionary suggestions.
For example, you search for 'curiousity' Everything will suggest did you mean 'curiosity'.
The problem here is the Windows dictionary API is not very useful and my own dictionary would use more disk space than the Everything executable. (ie: bloat)

I've added options to ignore spaces and punctuations for the next major release. This seems the most useful, eg:
spiderman
spider-man
spider.man
spider man
are all equal (when ignore spaces and ignore punctuation is enabled from the Search menu).

I am looking into user defined Synonyms lists and will look into supplying some basic localized Synonyms lists.
eg:
and
&
an
'n
would be a user-defined list of words to treat as equal.
Link
Posts: 21
Joined: Thu Nov 03, 2011 10:08 pm

Re: support for fuzzy logic search ?

Post by Link »

What issue did you have with Levenshtein distance? Ive played around with that one and it works. Just very slow (at least my implementation) unless you do fancy optimizations. Everyone have more than one core now. Parallelization search could speed it up.
You could try Damerau–Levenshtein distance. With the way Everything stores strings it may be fast.
void
Developer
Posts: 16698
Joined: Fri Oct 16, 2009 11:31 pm

Re: support for fuzzy logic search ?

Post by void »

Performance was fine.

It doesn't work so well for millions of filenames.. To many unwanted results. Even when tuned to 1 edit per 9+ characters.

For example, I might search for "tonic" and get 100,000 "sonic" results with 100 "tonic" results.
Everything would need a ranking system to make Levenshtein distance useful, eg: show "tonic" results first.
aviasd
Posts: 135
Joined: Sat Oct 07, 2017 2:18 am

Re: support for fuzzy logic search ?

Post by aviasd »

I've been using Fzf for fuzzy searching, which uses the Smith-Waterman Algorithm

Matched results were quite good and performance was fast (<1s) on my file list (~5.5M files).
Maybe it's worth checking out.
The following code was used to test (Powershell):

Code: Select all

$export="FileList.txt"
"Exporting"
es -sort-path-ascending -export-txt $export
"Loading results"
$file=gi $export
[IO.File]::ReadAllText($file,[text.encoding]::Default) | fzf
Note: fzf needs to be on %PATH%
Note2 This implementation does not handle typos.
nspp
Posts: 10
Joined: Tue Oct 27, 2020 8:57 am

Re: support for fuzzy logic search ?

Post by nspp »

The sample you give is not the most relevant when searching files with aproximate memory or not native language names or when doing typos i.e
search for Zpider instead of Spider Skizzy instead of Squeezy also when you mispell a word inverting some letters like Mna instead of man using just list of synonym is not sufficient in this case.

The fuzzy search algorithms like in fzf or agrep is able to give weighted results.
therube
Posts: 4967
Joined: Thu Sep 03, 2009 6:48 pm

Re: support for fuzzy logic search ?

Post by therube »

Heh.
Could someone clue me in on how to use fzf?

What to do, what to expect, & how to "interact" with it from there?

Typing 'fzf' displays the files in the current directory.
Up/down arrow moves the "selector" up/down. (Sometimes display was not correct at that?)

But, then what? What am I supposed to do, what is supposed to happen, what am I supposed to see?

Oh. It interactively filters the results as you type in.
(But in my case, at least when running sandboxed, Sandboxie, the screen does not repaint correctly, so it was never clear to me that it was filtering results.)

OK. So now what?
wason92
Posts: 8
Joined: Tue May 10, 2022 1:35 pm

Re: support for fuzzy logic search ?

Post by wason92 »

Fzf is basically a simple filter; It takes an input and you filter it - (using fuzzy logic by default, though you can change it to be exact by default)
You can pipe stdout to it, or have it read a file.
Most people use it like 'dir /b /s /a |fzf'
You pipe a list of all the files under the current path to fzf, then filter it.
That's fine, but it can take awhile what you can do is combine it with es.exe

Export a list
es.exe -exporttxt some\file - on my machine this takes about 3.2 seconds for 2m files
then just fzf that file
fzf < some\file
This avoids the pipe which can take a very long while if you have a large filelist.

If you use clink , there's a nice fzf plugin
https://github.com/chrisant996/clink-fzf
You can easily change this to use es instead of dir to get something like this
ctrl+s pops up fzf with a list of every file from everything with a preview window
https://streamable.com/cjy8h5

also, 3 seconds is a long time to wait each you want to find something, so you can just run es.exe -exporttxt some\file every minute or something to get a relatively up to date list of files to search
ChrisGreaves
Posts: 684
Joined: Wed Jan 05, 2022 9:29 pm

Re: support for fuzzy logic search ?

Post by ChrisGreaves »

void wrote: Mon Apr 06, 2020 4:28 am I've tried Levenshtein distance/soundex/metaphone.
I am late to the party, but at least I did a forum search for "Soundex" before sounding off.
Soundex was magic when first I met it, a lovely example of its use against three Aussies in Paris trying to flummox the British Airways Clerk.

Just ten minutes ago Online Soundex coded "Greaves" into G612.

I was thinking it might help narrow down duplicate files names of MP3 tracks, and it wouldn't take much (hint, hint!) to persuade me to experiment with music tracks:-
Soundex_01.png
Soundex_01.png (41.74 KiB) Viewed 7337 times
Of course there is a difference between trying to corral a herd of tracks when the first keyword is already in the filter.

Then I wondered how Soundex might be applied to text content within a file, rather than the name of the file.

I suppose that someone who understands regex already has a Soundex encoder hidden away somewhere?
Cheers, Chris
ChrisGreaves
Posts: 684
Joined: Wed Jan 05, 2022 9:29 pm

Re: support for fuzzy logic search ?

Post by ChrisGreaves »

void wrote: Tue Apr 07, 2020 10:26 am It doesn't work so well for millions of filenames..
... but on just 19,190 MP3 music tracks, Soundex or similar might trim the results down to a size from which the user could save some time.

Now that I think of it "sonata" reduced to "S530", if I then turned around and fed "S530" in and asked Everything to return all names that had mapped to "S530", that might uncover a slew of errors in my file names. Like "somata", "sonada" and the like.
Please note that I am not the one keying in these names; they are often keyed in by point-amassing YouTubers around the world.

Too it might help with names with diacritics.
Cheers, Chris
P.S. Do these two responses count as two more votes for Fuzzy? :twisted: :twisted:
void
Developer
Posts: 16698
Joined: Fri Oct 16, 2009 11:31 pm

Re: support for fuzzy logic search ?

Post by void »

I will consider adding soundex:/metaphone: search functions.

Thank you for the suggestion.
void
Developer
Posts: 16698
Joined: Fri Oct 16, 2009 11:31 pm

Re: support for fuzzy logic search ?

Post by void »

Everything 1.5.0.1339a adds a soundex: search modifier.

When enabled, Everything will match files/properties by SQL soundex.

The whole name is matched.
Only A-Z, a-z letters are currently supported.

For example:
soundex:david
soundex:carpenter
soundex:artist:michael

Soundex



I'll add metaphone support eventually.
metaphone will support ignoring diacritics. (I should support ignoring diacritics in soundex too, but this is non-standard)
ChrisGreaves
Posts: 684
Joined: Wed Jan 05, 2022 9:29 pm

Re: support for fuzzy logic search ?

Post by ChrisGreaves »

void wrote: Thu Mar 02, 2023 6:50 amWhen enabled, Everything will match files/properties by SQL soundex.
Thanks Void. I have d/l Everything-1.5.0.1339a.x64-Setup and will install it after lunch.
That is after my walk to the PO to collect the new laptop that still won't have arrived! :lol: If only Canada Post could deliver parcels as fast as you deliver new features ... :roll:
Cheers, Chris
void
Developer
Posts: 16698
Joined: Fri Oct 16, 2009 11:31 pm

Re: support for fuzzy logic search ?

Post by void »

Everything 1.5.0.1340a improves soundex:

trailing vowels are ignored. (davide will now match david)
added support for soundex format, for example: soundex:d13
added nodiacritics support.
added highlighting.
meteorquake
Posts: 496
Joined: Thu Dec 15, 2016 9:44 pm

Re: support for fuzzy logic search ?

Post by meteorquake »

I'd be quite interested from experienced people re the fuzzy search I created for my own purposes, as I've found it extremely simple and effective - how it compares to other methods, and does it already exist (I'm sure someone will have thought of it before me) and have a name? at heart it just counts up the two letter combinations found in the target, so abcde has sequences ab bc cd de. The resulting score will consequently diminish according to how much letters are altered (e.g. Helllo Worrld) or blocks rearranged (e.g. World Helllo).

I provided JS of the whole and part match functions at viewtopic.php?p=65661#p65661

David
avi
Posts: 30
Joined: Sat Aug 19, 2023 6:06 pm

Re: support for fuzzy logic search ?

Post by avi »

void wrote: Thu Mar 02, 2023 6:50 am Everything 1.5.0.1339a adds a soundex: search modifier.

When enabled, Everything will match files/properties by SQL soundex.

The whole name is matched.
Only A-Z, a-z letters are currently supported.

For example:
soundex:david
soundex:carpenter
soundex:artist:michael

Soundex



I'll add metaphone support eventually.
metaphone will support ignoring diacritics. (I should support ignoring diacritics in soundex too, but this is non-standard)
1. This option (soundex) is very useful! Is there a way to make it the default?
2. Could you please add support for letters in the Hebrew language to this?
Thanks!
ChrisGreaves
Posts: 684
Joined: Wed Jan 05, 2022 9:29 pm

Re: support for fuzzy logic search ?

Post by ChrisGreaves »

void wrote: Thu Mar 16, 2023 5:06 amEverything 1.5.0.1340a improves soundex:
(((I think that this is called a tautology: Everything improves everything!)))
I have been lurking in this topic and thinking about diacritical characters, Hebrew, Greek of course and so this is my logical/theoretical take:-

SOUNDEX as I heard of it/met it in 1973 was used by airline ticketing systems, and as such it was based on the 26-symbol roman alphabet A-Z, majuscules only. For that reason I have capitalized SOUNDEX in this paragraph to signal that SOUNDEX is an algorithm based on the 26 capital letters of the roman alphabet. My understanding is that even within this definition there are varieties of the algorithm, but they all make use of only those 26 capital roman letters.

We might expand such an algorithm to include minuscules, using 26+26 letters of the roman alphabet and call such an algorithm Soundex
Soundex as such would not be the same as the SOUNDEX algorithm, because SOUNDEX recognizes only 26 capital letters.

I can see rational/logical reasons for multiple top-level variations on the original algorithm, some based on Celtic populations (so making use of the apostrophe to make Charles Yelverton O'Connor happy), and that variant algorithm might recognize the hyphen-dash character as well as the apostrophe.

The Greek alphabet has upper- and lower-case letters, but I lack knowledge of how to write SOUNDEX in Upper-case Greek, or Soundex in a mixture of upper- and lower-case Greek. Perhaps someone can oblige? I worked on/in APL back then, but what do I know about Greek?!!

But never have I worked in Hebrew; although i feel confident in asking avi for help here. My level of ignorance is appalling; does Hebrew use upper- and lower-case alphabetic symbols?.

By now the floodgates are open, and I suggest that there can be as many variations of the soundex algorithm as there are sets of symbols. A version of soundex that uses only the 20 non-vowel symbols of the original SOUNDEX ought to be doable. ("a vowel will not be encoded unless it is the first letter")

All of this suggests that a suitable algorithm would need to know what character set is driving the beast, and we could identify each supported version by its character set:-
Soundex-ABCDEFGHIJKLMNOPQRSTUVWXYZ
Soundex-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyx
Soundex-ABCDEFGHIJKLMNOPQRSTUVWXYZ'-
Soundex-ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ
and so on. (Yes, we might use a soundex algorithm to make concise identifiers ...)

Where does that leave Void and Everything? I doubt that the algorithm could be driven simply by the end-user supplying a set of symbols. After all, the algorithm has to know to ignore-vowels-unless-the-first-letter. And of course those of us running machine shops will want a set of symbols that correspond the the product "numbers" of our inventory.

But then, I can see the advantage of a supply of most-common-symbols sets in Everything. I'd love to find that a soundex variation could just suck in the diacritics in the folder and file names of my 19,686 music tracks. OK. I'd not get the Chinese characters, but an expanded alphabet of 26+26+diacritics would be helpful in resolving duplicates.

Just a few thoughts.
Cheers, Chris
void
Developer
Posts: 16698
Joined: Fri Oct 16, 2009 11:31 pm

Re: support for fuzzy logic search ?

Post by void »

Only A-Z letters are supported with soundex:

soundex: discussion
soundex:
https://en.wikipedia.org/wiki/Soundex

I can't use metaphone in Everything due to license issues.
I will look into support for other algorithms.
avi
Posts: 30
Joined: Sat Aug 19, 2023 6:06 pm

Re: support for fuzzy logic search ?

Post by avi »

ChrisGreaves wrote: Mon Jun 24, 2024 3:56 pm
void wrote: Thu Mar 16, 2023 5:06 amEverything 1.5.0.1340a improves soundex:
(((I think that this is called a tautology: Everything improves everything!)))
I have been lurking in this topic and thinking about diacritical characters, Hebrew, Greek of course and so this is my logical/theoretical take:-

SOUNDEX as I heard of it/met it in 1973 was used by airline ticketing systems, and as such it was based on the 26-symbol roman alphabet A-Z, majuscules only. For that reason I have capitalized SOUNDEX in this paragraph to signal that SOUNDEX is an algorithm based on the 26 capital letters of the roman alphabet. My understanding is that even within this definition there are varieties of the algorithm, but they all make use of only those 26 capital roman letters.

We might expand such an algorithm to include minuscules, using 26+26 letters of the roman alphabet and call such an algorithm Soundex
Soundex as such would not be the same as the SOUNDEX algorithm, because SOUNDEX recognizes only 26 capital letters.

I can see rational/logical reasons for multiple top-level variations on the original algorithm, some based on Celtic populations (so making use of the apostrophe to make Charles Yelverton O'Connor happy), and that variant algorithm might recognize the hyphen-dash character as well as the apostrophe.

The Greek alphabet has upper- and lower-case letters, but I lack knowledge of how to write SOUNDEX in Upper-case Greek, or Soundex in a mixture of upper- and lower-case Greek. Perhaps someone can oblige? I worked on/in APL back then, but what do I know about Greek?!!

But never have I worked in Hebrew; although i feel confident in asking avi for help here. My level of ignorance is appalling; does Hebrew use upper- and lower-case alphabetic symbols?.

By now the floodgates are open, and I suggest that there can be as many variations of the soundex algorithm as there are sets of symbols. A version of soundex that uses only the 20 non-vowel symbols of the original SOUNDEX ought to be doable. ("a vowel will not be encoded unless it is the first letter")

All of this suggests that a suitable algorithm would need to know what character set is driving the beast, and we could identify each supported version by its character set:-
Soundex-ABCDEFGHIJKLMNOPQRSTUVWXYZ
Soundex-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyx
Soundex-ABCDEFGHIJKLMNOPQRSTUVWXYZ'-
Soundex-ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ
and so on. (Yes, we might use a soundex algorithm to make concise identifiers ...)

Where does that leave Void and Everything? I doubt that the algorithm could be driven simply by the end-user supplying a set of symbols. After all, the algorithm has to know to ignore-vowels-unless-the-first-letter. And of course those of us running machine shops will want a set of symbols that correspond the the product "numbers" of our inventory.

But then, I can see the advantage of a supply of most-common-symbols sets in Everything. I'd love to find that a soundex variation could just suck in the diacritics in the folder and file names of my 19,686 music tracks. OK. I'd not get the Chinese characters, but an expanded alphabet of 26+26+diacritics would be helpful in resolving duplicates.

Just a few thoughts.
Cheers, Chris
In Hebrew there is something similar to small letters in English, but it is only used when writing by hand in a notebook and the like, but it is not used on computers.
avi
Posts: 30
Joined: Sat Aug 19, 2023 6:06 pm

Re: support for fuzzy logic search ?

Post by avi »

void wrote: Mon Jun 24, 2024 11:48 pm Only A-Z letters are supported with soundex:

soundex: discussion
soundex:
https://en.wikipedia.org/wiki/Soundex

I can't use metaphone in Everything due to license issues.
I will look into support for other algorithms.
If it is possible, with the option to make it the default (with a setting in "Advanced"?), that would be great!
This is especially necessary in Hebrew, because in Hebrew there are letters "י" and "ו" that some write and some omit them, for example there are those who write "ביאור" and there are those who write "באור", there are those who write "שלחן" and there are those who write "שולחן", and many more. (You can do a search for "ב*אור" and "ש*לחן" but because there are so many such words, it would be more helpful if there was such a default option).
I also use a program called "Fluent Search" a lot, and choose there your search engine, and if there was such a default setting in "Everything" it would be useful there too.

On this occasion I would like to personally thank you for your software, I use it a lot!
ChrisGreaves
Posts: 684
Joined: Wed Jan 05, 2022 9:29 pm

Re: support for fuzzy logic search ?

Post by ChrisGreaves »

avi wrote: Tue Jun 25, 2024 4:43 pmIn Hebrew there is something similar to small letters in English, but it is only used when writing by hand in a notebook and the like, but it is not used on computers.
Thank you avi. So for purposes of this thread and other discussion of computer-based analysis of text, we could consider Hebrew to have but the one set of symbols, and it might be wrong to label those Hebrew symbols as Majuscules, because there are no minuscules.
Thanks again, Chris
Post Reply