Performance for hashes (SHA-256 and others)

Discussion related to "Everything" 1.5 Alpha.
Post Reply
tirael
Posts: 7
Joined: Mon Sep 30, 2024 9:19 am

Performance for hashes (SHA-256 and others)

Post by tirael »

First of all, I would like to note the excellent quality of your program Everything!

First question.

Are there any plans to introduce the ability to set the number of threads for individual folders/disks that will be used to index properties?

A real-life example: there are several network storage devices, each of which has several network folders accessible via Samba. Some devices work well with a number of simultaneous requests of about 10-20. Other devices work well only with one thread at a time - if you make, for example, 10 requests simultaneously, the response time increases by 50 or more times. And finally, there are some devices that provide maximum performance with 4-6 simultaneous requests.

In my case, the main time cost is getting SHA-256 file hashes.

It seems that the ability to set the number of threads individually for each device could increase the performance of obtaining hashes.

And the second question.

Will it be possible to add a new property for folders: calculate a hash based on existing (previously calculated) hashes of all files inside the folder?

At the moment, the folder hash can be calculated:
* based on file names, including the folder name - this allows you to search for identical copies by name, but excludes the ability to search for copies with different names of base (root, parent) folders
* based on file hashes - but for each folder, all file contents are read anew. For example, if you add a folder containing the following structure to the indexed ones:

my-app/
├─ X/
│ ├─ Y/
│ │ ├─ Z/
│ │ │ ├─ file_1_terabyte
├─ file_1_megabyte


then file_1_terabyte will be read from disk at least 3 times - to get the SHA-256 hash of folders X, X/Y and X/Y/Z. And if you add the SHA-256 property for files, then file_1_terabyte will be read a fourth time.

It seems that it would be great if there was an option to calculate the folder hash based on a list of file hashes that have already been calculated - this way you would have to read the contents of each file only once.
void
Developer
Posts: 16745
Joined: Fri Oct 16, 2009 11:31 pm

Re: Performance for hashes (SHA-256 and others)

Post by void »

Thank you for your feedback tirael,


Are there any plans to introduce the ability to set the number of threads for individual folders/disks that will be used to index properties?
Yes.
Everything 1.5 -> Tools -> Options -> NTFS/Folders -> Volume/Folder -> Right click -> Advanced -> Threads -> Multiple threads.

Set the maximum number of indexing threads with Tools -> Options -> Advanced -> index_max_threads
Set the maximum number of property request threads with Tools -> Options -> Advanced -> content_max_threads

Everything will automatically use multiple threads for SSDs.


Will it be possible to add a new property for folders: calculate a hash based on existing (previously calculated) hashes of all files inside the folder?
Everything 1.5 has folder name and folder data hashes.

Please try one of the of following properties:

Folder Data and Names SHA-256
Folder Data SHA-256
Folder Names SHA-256

Hashes will match 7zip.



Thank you for the suggestions.
tirael
Posts: 7
Joined: Mon Sep 30, 2024 9:19 am

Re: Performance for hashes (SHA-256 and others)

Post by tirael »

Thank you for your reply!
Everything 1.5 -> Tools -> Options -> NTFS/Folders -> Volume/Folder -> Right click -> Advanced -> Threads -> Multiple threads.
I know about this setting, and I meant its expansion of functionality.
Will it be possible to set a different number of threads for each folder? For example, for \\server\share1 - 2 threads, for \\server\share2 - 8 threads, for \\super-server\supershare - 20 threads?


Folder Data and Names SHA-256
Folder Data SHA-256
Folder Names SHA-256
Of these properties, only "Folder Data SHA-256" is relevant for comparison strictly by content. But, as I mentioned above, all files will be read as many times as there are folder nesting levels.

For example, if a 1 terabyte file is at the 20th nesting level, then it will take a total of 20 (twenty) times to read the file, i.e. read 20 terabytes, to calculate the hashes for all folders. This, of course, is completely suboptimal, and therefore cannot be used for large volumes (more than hundreds of gigabytes).

Will it be possible to change the algorithm for calculating folder hashes so as not to make meaningless multiple reads of the same file? Or add a new property that would be calculated using the new algorithm.
void
Developer
Posts: 16745
Joined: Fri Oct 16, 2009 11:31 pm

Re: Performance for hashes (SHA-256 and others)

Post by void »

Will it be possible to set a different number of threads for each folder?
Currently, no.

I will consider an option to set the number of threads per folder.


Will it be possible to change the algorithm for calculating folder hashes so as not to make meaningless multiple reads of the same file?
Currently, no.

I will look into caching the sha256 value for each file.

For now, gather the information on the root folder only.
tirael
Posts: 7
Joined: Mon Sep 30, 2024 9:19 am

Re: Performance for hashes (SHA-256 and others)

Post by tirael »

Thank you!
Post Reply