How to find similar names and remove?

Debugger · Post by **Debugger** » Tue Nov 05, 2024 3:09 pm

I have tried various BEST OF BEST tools to delete similar mp3 files. But each time with lamentable results. Countless python scripts, all sorts of programs. But I can never properly find and remove duplicates.

Suppose I have two folders in one folder I have files with 320kb/s (false quality converted from 64bit to 320kb/)

and in the other actual 320kb

The names are different except artist 1 title and artist 2 are the same, but all other names are different in filename, it is not possible to sort in Everything by name, because it sorts only Path and not names (regardless of the path!).

I have tried everything, and it is impossible to find among the 6000 names, only those which are actually duplicates, even without deleting even inferior quality, unless there is better quality in the other paths.

So a manual search will take me maybe many months, maybe years, even if I want to compare the 6000 names with other names.

And so I could get back e.g. 60Gb of additional disk space, and so have to have many duplicates, only in different qualities.
How to find similar names and remove. A very complicated situation
With only one character in the name, it complicates matters.

And especially if there are different artists remixing a given song title in brackets.

^(g:\\【Tracks独家】\\|i:\\music\.com Electro\\)([^\\]*) - ([^\\]*)\s*$(.*?)$\.mp3$
------------------

import os
import re
from difflib import SequenceMatcher

def similar(a, b):
"""Funkcja obliczająca podobieństwo dwóch napisów."""
return SequenceMatcher(None, a, b).ratio()

def extract_key_info(filename):
"""Funkcja ekstrakcji kluczowych informacji: artysta, tytuł i remikser z nazwy pliku."""
match = re.match(r'^(.*? - .*?)(?:\s*$.*?$)?\.mp3$', filename)
if match:
return match.group(1) # Zwraca 'artysta - tytuł'
return None

# Upewnij się, że podajesz poprawne ścieżki do folderów
folder_high_quality = r'I:\PATH' # Upewnij się, że ten folder istnieje
folder_low_quality = r'G:\【PATH】' # Folder o niższej jakości

files_to_delete = []

# Wczytaj pliki z folderu wysokiej jakości
try:
high_quality_files = [f for f in os.listdir(folder_high_quality) if f.endswith('.mp3')]
print(f'Pliki w folderze wysokiej jakości: {high_quality_files}') # Wypisz pliki wysokiej jakości
except OSError as e:
print(f'Błąd podczas wczytywania plików z folderu wysokiej jakości: {e}')
exit(1)

# Wczytaj pliki z folderu niskiej jakości
try:
low_quality_files = [f for f in os.listdir(folder_low_quality) if f.endswith('.mp3')]
print(f'Pliki w folderze niskiej jakości: {low_quality_files}') # Wypisz pliki niskiej jakości
except OSError as e:
print(f'Błąd podczas wczytywania plików z folderu niskiej jakości: {e}')
exit(1)

# Słownik do przechowywania kluczowych informacji o plikach wysokiej jakości
high_quality_keys = {extract_key_info(f): f for f in high_quality_files if extract_key_info(f)}

for low_filename in low_quality_files:
low_path = os.path.join(folder_low_quality, low_filename)
low_key = extract_key_info(low_filename)

if low_key in high_quality_keys:
# Znaleziono podobny plik, więc oznacz do usunięcia
files_to_delete.append(low_path)

# Zapisz pliki do usunięcia w pliku z kodowaniem UTF-8
with open('files_to_delete.txt', 'w', encoding='utf-8') as f:
for file in files_to_delete:
f.write(f"{file}\n")

print(f'\nPliki do usunięcia zapisano w pliku files_to_delete.txt: {len(files_to_delete)} plików.')

Post by **therube** » Tue Nov 05, 2024 4:17 pm

You ought to post links to your other threads on this matter.

BEST OF BEST tool

Which are?

How do you determine "false quality" (which 64->320 certainly is) vs. real 320?
Are they simply in separate directories or... ?

Examples of actual name pairs?

it is not possible to sort in Everything by name, because it sorts only Path and not names (regardless of the path!)

Explain?

actually duplicates

Actual duplicates in what manner, hash, or by "sound" or... ?

Debugger · Post by **Debugger** » Tue Nov 05, 2024 4:30 pm

Example duplicate files

H:\Low\叶一茜 - 风浪(衍 Electro Mix国语女).mp3
h:\hIGH\99659-叶一茜 - 风浪(衍 Electro remix)[music.].mp3

So it should remove Path Low, because the files are the same just that in Low, they have poorer quality.

No tool detects such duplicates:

Duplicate Cleaner Pro
Easy Duplicate Finder
dupeGuru
Duplicate Audio Finder

and others: ALWAYS 0 results!

It should find duplicates based on 3 important details no matter how the file name is structured and what else in these names exists

黎恩 - 雨再来(Dj小 rmx
0876-黎恩 - 雨再来(Bas rmx

黎恩 - example Artist 1
雨再来 - Title
(Bas - Artist 2

xxx - xxx(xxx
Artist 1 - Title(Artist 2

Post by **therube** » Tue Nov 05, 2024 5:28 pm

And you're wanting to dupe on what?
On part of the name, so 叶一茜 - 风浪(衍 Electro Mix国语女).mp3 and 99659-叶一茜 - 风浪(衍 Electro remix)[music.].mp3

And then, after that, on the "sound" (or bit rates or directory location)?
Is partial name dupe sufficient, so, 叶一茜 - 风浪(衍 Electro?
Or name & bitrate, or name & bitrate & directory?

Duplicate Cleaner, Similar Audio, with an (audio) Match threshold of xx/100 won't find those two songs to be "the same", or are they too different to be the "same" (similar), even if they are the "same"?
(I don't know how Similar Audio works, how it determines "sameness"?)

?

Czkawka 6.0.0 is ready to download with some interesting changes:
- Add finding similar audio files by content - allows to find remixes, a little changed or shortened versions of music

https://github.com/qarmin/czkawka

(I've downloaded it, at some point, never gotten around to really messing with it.)

?

audiomatch

https://old.reddit.com/r/Python/comment ... d_similar/

https://superuser.com/questions/11586/s ... ng-to-them

https://dsp.stackexchange.com/questions ... rity-check

Debugger · Post by **Debugger** » Tue Nov 05, 2024 6:02 pm

I have tried all the TOP tools on the internet, it only takes one character to not compare anything correctly from a file, all the tools that people propose are completely useless, no matter what method I use, it cannot find just what needs to be removed.

Everything cannot sort by name (considering two folders e.g. PATH:Folder|PAHT2:Folder 2

Instead of sorting by name (regardless of PATH).

This sorts by PATH instead of name

So even in Everything I cannot compare two paths and files with similar names.

This must work alternately e.g.

C:Folder1
D:Folder2
C:Folder1
D:Folder2

THEN I CAN COMPARE BY TWO FOLDERS WITH SIMILAR NAMES, BUT IT DOESN'T WORK IN eVERYTHING...,

incans · Post by **incans** » Tue Nov 05, 2024 7:58 pm

I think what you are looking for is a "fuzzy match" on the file basename only.

For what it's worth the later Enterprise versions of Microsoft Excel (I think from 2013 and up) have a basic fuzzy search capability in the PowerQuery plugin, although it's not very well documented or easy to control. So, in principle if you can get your cadidate file lists exported into CSV, you could then "have a go" if you have access to the appropriate Excel license.

In my limited experience of playing with the Excel feature (in a different context to yours) I found it was remarkably hard to get a fuzzy match set up to reliably capture names that are "obviously" related as far as a human observer is concerned, without introducing a high level of false positives.

It's a surprisingly tricky task. My suggestion of a possible approach would be something like-

Extract your "candidate" files into a structured table including key parameters such as path (your "(Low)" directory for example); filename; encoding; file size; artist; album; covert art etc.
Preprocess the table content to remove common confusing factors, especially in file and artist names (e.g. "Remix", "Featuring"), and/or harmonise the way common terms are represented. E.g. "Junior" and "Jr." in an artist name mean the same to a human, but look very little alike to a fuzzy search algorithm, so these things need to be harmonised before the comparison
Feed the result into a tool with a decent fuzzzy match capability, possibly as an extension to one of the well known database tools such as MySQL/MariaDB

As I said it's a tricky task, and you'd think one of the "Music library management" tools like MediaMonkey would have come up with a dedicated capability to handle it, but I know MediaMonkey (which I happen to use) only has very basic capabilities based on finding exactly matching (identical) file names across folders, or some kind of audio comparison that i'm guess will not be able to match the same track in a different encoding or bitrate.

voidtools forum

How to find similar names and remove?

How to find similar names and remove?

Re: How to find similar names and remove?

Re: How to find similar names and remove?

Re: How to find similar names and remove?

Re: How to find similar names and remove?

Re: How to find similar names and remove?