Intelligent task vs Creative task
intelligent task
- ability to create an object success metric that we can use to evaluate the quality of and algorithm
- object detection, speech recognization . . .
- objective success metric
- well-defined problem description (formalized)
creative taks
- dont' have objective success metric
- formalize music? can't . .
-> difficult to reproduce with machine
GM Era
Big tech experiments era
AWS DeepComposer(Amazon, 2019)
Jukebox(OpenAI, 2020)
- raw-audio generation / advanced deep learning / full piece + lead vocals
- ! turning point !
Music AI hype (2023~)
generative AI (ChatGPT, DALL-E...)
Text-to-music
- MusicLM(Google, 2023)
- MusicGen(Meta, 2023)
Generative audio models
- Mousa, AudioLDM, SingSong, RAVE 2, Riffusion . . . .
Second startup wave
- SOUNDRAW, Riffusion, boomy, beatoven.ai, WAVEAI . .
Aiva.ai
How to classify GM systems
Classifying GM Systems
- goal of system? : melody, chord progressions, full-tracks, jazz imporve, loops, drums, .... / video games, movies, ads, concert, SNS, ....
- who's the users? : composers, songwriters, developers, consumers, researchers, marketing agency, ...
- how autonomous is the system? :
human-machine co-creation ----------- human supervision -------------- fully autonomous
- how is music generated?: machine learning, deep learning...
- how is music represented? : symbolic representation, media, forms, ....
how about ours . . ?
- goal: full-tracks
- users: consumers? writers
- human-machine collab
- audio representation
Use cases
Text-to-music generation
- textual inputs (descriptions of music) -> generate music
- can be used by people on social media who have very little understanding of Music (minimal human input)
- deep learning based
- Audio representation
- MusicLM, MusicGen, Mubert
Singing voice cloning
- Generate / clone voice
- producers, wanna-be musicians
- human-machine collab
- deep learning, audio representation
Automatic accompaniment
- Instrumental accompaniment of lead vocals
- Amateur musicians
- human-machine collab
- deep learning, rule-based techniques / symbolic representation
- Nootone
Sound synthesis
- generation of alien sounds
- mid / pro producers
- human-machine collab
- deep learning / audio representation
- NSynth(Google)
Open Source Research (The Sound of AI)
- voice-to-sound synthesizer
- community-driven research project
Representation
Symbolic representation
- Symbols (notes, instruments...)
- Similar to a score
- MIDI, MusicXML, Piano-roll, ABC notation...
- Discipline connections : music theory, composition, (computational) musicology
Symbolic generation
- MuseNet(OpenAI, 2019) : GPT2 architecture / trained on MIDI files. predict next token
- Pros : Compact. easy to manipulate. clear, precise. losts of compositional info. capture long-term dependencies. small models
- Cons: Oversimplified. Musical limitations. Limited performance info. no production info. Output isn't audio.
- 언제 ideal? : structure + composition is focus, notated western music(classical, jazz....)
- isn't ideal? : performance+production is focus. EDM, drone, .... note에 포함되지 않은 beauty...
Audio representation
- waveform, spectogram, audio embeddings, music cognition
- Audio generation
- sample audio-based models: Jukebox, MusicLM, MusicGen, RAVE
- Pros : Lots of performance/production info, complex and rich, Audio output
- Cons: Large dimension/size, difficult to manipulate. no compositional info. model size is big(요구성능높음), difficult to capture long-term dependencies
* A good music representation solves 50% of GM
* Symbolic은 score, audio는 waveform
* compositional details가 많을 경우 symbolic, performance details가 많을 경우 audio가 적합
Generative music taxonomy
- traditional(symbolic) : Symbolic AI, Optimization, Complex systems, Statistical methods
- cutting edge(symbolic + audio) : Deep learning
Deep learning
- Artifical neural nets
- Learn from massive datasets
- Imitate target style
- Audio / symbolic generation
- Computationally demanding
- Learn long-term dependencies
- No manual input
- Architectures
ㄴ Recurrent neural nets(DeepBach(symbolic)) , Variational auto encoders(Jukebox), Diffusion models(Riffusion), Transformers(MusicGen) - 뒤에 세개는 audio
Limitations
Text-to-music(musicLM, musicGEN...)
- Long-term structure
- Audio fidelity
- Semantic mapping
- Minimal creative control (musical knowledge.. 무시하고 데이터만 넣으면 동작할것이라는 기대)
DeepLearning models
- Music is highly dimensional (harmony.melody,rhythm...)
- Network can't learn all dimensions
- DL model has no musical knowledge
- Massive datasets
- Lack of music coherence 일관성
- Black box -> difficult to steer
Solving the curse of DL?
- hybrid systems
- merge DL and symbolic AI (Neuro-Symbolic Integration)
Music representation
- audio is too complex, symbolic is too simple
- no representation captures all music details efficiently
ㄴ hybrid symbolic + audio representations
ㄴ embeddings(symbolic + audio + context)
ㄴ custom representations
Grammar과 같은 방식으로...
T = {C,D,E,F,A,B,Whole,Half,Quarter}
N = {Melody, Phrase, Pitch, Duration}
S = Melody
P = {
Melody -> Phrase Phrase
Phrase -> Pitch Duration | Pitch Pitch Duration
Pitch -> C | D | E | F | G | A | B
Duration -> Whole | Half | Quarter
}
- Production rules은 어떻게 determine?
ㄴ Extract manually(music theory)
ㄴ Learn from dataset
Generative tasks
- Melody generation, Chord progressions, Music structure, Full track generation ....
Designing a generative grammar
- Finding the correct music representation is key
- What musical dimensions ?
- what do symbols represent ?
Lindenmayer system(L-system) (형식문법의 일종이라ㅐ...)
- musical output에서 사용..
- apply all production rules at once for each iteration
*L-System for chord generation
- A(alphabet) = {A,B,C,D,E,F,G}
- S(axiom) = A
- P = {
A->ABC
B->BA
C->EF
F->GFD
}
music21: a Toolkit for Computer-Aided Musicology
What is music21? Music21 is a set of tools for helping scholars and other active listeners answer questions about music quickly and simply. If you’ve ever asked yourself a question like, “I wonder how often Bach does that” or “I wish I knew which b
web.mit.edu
Markov chain(MC)
- Mathemetical system that undergoes transitions from one state to another
- events probabilistically의 model sequence
* The next state depends only on the current state
history of sequence와는 무관함
- states: the possible conditions
- initial probabilities : likelihood of starting the sequence in a state
- transition probabilities: likelihood of moving from one state to another
Modelling music with MCs
- melody : sequence of notes (parameter : duration + pitch)
- generate a melody based on the probability of one note following another
- chord progression : sequene of chords
- generate a chord based on the probability of one chord following another
ex) C major 5음계 scale
/ simplifications : pitches in one octave / focus on pitch(duration은 일단 무시)
- S={C,D,E,G,A}
- Ip = (pc pd pe pg pa) (각 확률 vector)
- Tp = (pcc pcd pce ... pdc pdd... pec... pgc.... pac...) (all combination에 대한 확률 matrix)
1) First pitch
- use Ip vector -> roll dice -> get pitch from Ip
2) Subsequent pitche
- use Tp matrix -> get to the row of the current pitch (E가 선택됐다면 E에 대한 row. pec ped pee peg pea) -> roll dice -> get new pitch (즉 현재 pitch에만 dependent하게 결정됨)
3) repeat until end
+) other model with MCs
- Rhythms, Octaves, Dynamics(piano, fortissimo...), Simple melodic patterns, Instrumentation, Articulations(staccateo, legato...), Form, ...
* Two modelling approaches
- multiple parameters일 때, 각 parameter마다 1 MC 사용
-> makes problem more treatable
Pros
- Simple, Flexible, Fun and creative(..?), OK for ambient
Cons
- Random walk, Lack of musical context, Bad for genres with strong musical direction
Melody generation with MC
MC model을 사용한 melody generator (학습된 곡: 반짝반짝작은별...)
Cellular automata(CA)
- models used to simulate complex systems using rules on a grid of cells
- ceels change state based on their own and neighbors' states
- complex patterns emerge from simple rules
CA formalisation
- Grid : line of cells(1D), plane of cells(2D)
- Cell: each cell is identified by its row + column position
- States : each cell can be in one of a finite number of states
- Neighborhood : set of cells around cell whose states influence the cell's next state
- Transition rules
a) dictate how the state of a cell changes
b) functions of the states of the cell and its neighbors at time t to determine the state at time t+1
- Initial conditions: initial states of the grid
CA for music generation
1) Map axes to different mjusical params(pitch, inst, time...)
2) Assign states to musical events (on/off, pitches...)
3) Design rules for musical evolution - may or may not be music-based rules
4) Map time (e.g. 1 beat = 1 step)
e.g.) time을 x-axis로, y-axis를 드럼 패턴으로(floor, hihat, snare, kick...)
e.g.) melody generation by CA
States={C,D,E,F,G,A,None(rest)}
time을 x-axis로, each cell에 state가 들어가도록
e.g.) expressive chord generation
States={pp, p, mf, f, ff, None}
pitch를 x-axis로(C,D,E,G,A), inst를 y-axis로(synth, piano, organ...)
Music strategies for CA
- Generate entire score
- Guideline for imporvisation
- Integrate CA-generated inst into a composition
- Pros: Flexible, Experimentation, OK for raw material
- Cons: Bad musical output, No music knowledge(just mechanism)
Drum Generation with Cellular Automata
States={ON, OFF}
xaxis: time, yaxis: hihat / snare / kick
transition rules: Syncopation resolution / filling gaps / accenting / mutation
CellularAutomatonDrumGenerator
Genetic algorithms (GA)
- Optimization techniques inspired by the process of Natural Selection(limited resources로 인해 fittest individuals가 survive. over many generations, these processes can rules in adaptaion and the evolution of species)
- solutions evolve over generations to optimize a specific objective
- Aerospace design, routing problems, DNA sequence alignment, Art/music generation, ....
Formalising GA
- Population : a set of candidate solutions (individuals)
- Chromosomes : Encoded version of the candidate solution (e.g. design a table. 다리는 다섯개고 키는 80센치고 색은 빨강이고 등등...)
- Fitness function : Measures how effective a solution is
- Genetic operators
1) Selection : choose fittest individuals for producing offspring
- likelihood of an individual being selected 더 잘맞는게 살아남음
- roulette wheel selection, ...
2) Crossover : combine the genetic info of two parents to produce new offspring
- 자식은 부모간 유전자 parts가 exchanges. one-point crossover, two-point crossover, ...
- creates genetic diversity, can lead to new solutions
3) Mutation : introduce variation into the offspring's genetic makeup
- random changes are made to parts of the genetic code of the offspring
- mutation rate is low to prevent random search
Pros: Flexible, Explore unconvential ideas, Good results
Cons: Crafting fitness funtion is complex, Subjectivity
GA step by step
Create initial population (randomly, criteria, ...) -> Evaluate fitness -> Select parents -> Generate offspring (crossover) -> Mutate offspring -> Replace population ->. Check termination condition
What are GAs good for?
Problems where traditional optimization techs fail
Large, complex, multimodal spaces
Diversity and adaptation
GA for music generation
1) Encode music elements as chromosomes
- melody : 1 note per gene, pitch+duration (C4 - 0.5, D4 - 1.0, ...)
- chords : 1 chord per gene (Cm, Dm, D, ...)
- sound synthesis : 1 synth parameter per gene (cut-off frqeuncy 0.34, reverb 0.46, delay 0.22, ...)
2) Craft the fitness function
- evaluates the aesthetic value of a composition
- infer from music theory
- learn from data
- subjective : what is a good melody?
e.g.) for melody
Linear combination of multiple criteria: scale conformity, melodic contour, rhythmic variatio, dissonance resolution
F = w1 * SC + w2 * MC + w3 * RV + w4 * DR
3) Run the algorithm
Melody Harmonization with GA
MelodyData(dataclass. 학습시킬 노래의 info)
GeneticMelodyHarmonizer (generations)
FitnessEvaluator (fitness function)
Transformer
Text-to-music Generation with Mustango
- The MusicBench dataset : audio tracks with corresponding natural language descriptions of music features
- Architecture
Two components : Latent Diffusion Model / MuNET
1) Latent Diffusion Model
(Audio) -> Audio Encoder -> Diffusion Model(Forward / Reverse Process) -> Audio Decoder -> Audio model
2) MuNet (Music Domain Knowledge Informened UNet)
- conditioning of the audio synthesis
- tamps : t1, t1+t2, t1+t2+t3, ... ... max beat
https://replicate.com/declare-lab/mustango
declare-lab/mustango – Run with an API on Replicate
Run time and cost This model runs on Nvidia A40 (Large) GPU hardware. Predictions typically complete within 4 minutes. The predict time for this model varies significantly based on the inputs. Readme Mustango: Toward Controllable Text-to-Music Generation M
replicate.com
공신력 있는 글을 읽고 싶다면?
저작권소멸 시 (공..모전?) / 현대소설 일부발췌 / 데이터 쌓이면 비슷한 사용자가 추천, 좋아요한 작품추천
공신력이 기본 / 아마추어 글도 추가적인 기능?
공신력 있는 걸 기본으로 쪼금 띄워주고 / 사용자참여