Converting Recorded Sound to MIDI using Fast Fourier Transforms

An Android App named FFT midi converter attempts something quite ambitious - converting any recording into MIDI format. To find out more about MIDI and the standard it's defined by, see my Blog Article on Pitch Shift Calculations. Effectively MIDI is an abstracted representation of music (or sound) formed from discrete tones. To turn the complex audio signal of a live recording into MIDI will always be quite an approximate process - that will deal in assumptions, abstractions and inaccuracies. The App itself runs various band-pass filters over the initial recording to attempt to isolate the frequencies with the largest amplitudes (cutting out as much of the background noise as possible).

Once these adaptive noise filters have been run over the recorded sound data, the entire recording is then processed through a Fast Fourier Transform (FFT) algorithm - this will be very familiar to any electrical or Digital Signal Processing (DSP) engineers, but perhaps a bit esoteric to regular users. I'll attempt to explain it briefly:

The waveform depicted in Figure 1 at the top is a classic slice-in-time representation of the recorded sound signal (or waveform)'s amplitude. The input sample rate (by default) of the phone's microphone (the phone in this case is a Nexus 5) is 8000 Hz, or 8000 samples per second. The waveform at the top is a snippet in time (it changes in real-time to reflect the waveform currently being recorded - and the tone that it corresponds to is extracted), and it has been calculated to correspond to 179.7Hz by an instantaneous FFT. The FFTs are calculated with a window size of 1024 samples - which means that the entire recording is divided into 1024 sample chunks that are processed using FFTs, in near-real time - and the output simplified waveform is then displayed - which is what can be seen in Figure 1 top half.

The bottom half is effectively a histogram of the magnitude of each frequency component in that 1024 sample block. Each frequency corresponds to a tone. In the case of Figure 1, it can be seen that there is a large peak at 179.7 Hz, so this is then deemed to be the primary frequency (and hence tone) for that 1024 sample segment (there are 8000 samples per second, so in terms of time this tone represents 0.128 seconds). FFTs are used to convert raw signal data (amplitude vs. time) in the Frequency Domain, where it's constituent frequencies can be represented as a histogram, like in the bottom half of Figure 1.

Figure 1: FFT Midi Converter Recording Screen

Figure 2: After recording a clip of sound, it can then be processed into a MIDI format

Figure 3: This MIDI format can then be played, and if desired - saved.

The MIDI sound file will probably not sound very much like whatever was recorded, as it has no capacity for varying loudness, timbre etc. but if a very simple melody is played on a clean instrument (e.g. piano with low reverb), it will convert it faithfully into the constituent tones (musical notes) that it comprises. Look at Figure 3 - behind the popup player; the FFT histogram shows multiple peaks - and this means multiple tones being played polyphonically.

What can the MIDI files be used for? Well, music production in the digital age relies on discrete tones and time intervals, and the MIDI sequence can be directly imported onto an instrument in any decent Music Production Software - thus then reconverting it back into something truly musical with all of the dimensions that make music enjoyable (timbre, loudness, reverb, contour and even spatial location).

incidentNormal._github.io

Navigation

Converting Recorded Sound to MIDI using Fast Fourier Transforms