There is no rule for why high vowels generally sound better coming first in the case of ablaut reduplication, which is the official name for things like hip-hop and flip-flop. Sure, you can say it simply sounds weird the other way, but why? It's a bad linguist who's satisfied with "Well, that's just how it do."
A popular explanation is the Optimality Theory, which is sort of a catchall for inexplicable linguistic phenomena. To paraphrase, it says that speakers do the shit they do because it's the laziest ("most optimal" means "least amount of effort") way to do it. You have to work fractionally less to say tick-tock or clip-clop than you do to say tock-tick or clop-clip. So we just evolved an affinity for the sound of the laziest way to do things.
But then there's another theory which says that words that represent things that are spatially nearer to the speaker usually have higher vowels (me versus you, here versus there, this versus that). This sounds pretty dumb until you learn that it actually holds water across different languages. For example, in French, "me" is je, "you" is tu, "this" is ce, and "that" is ca. In German ich/du, hier/da, dies/das. You get the idea, which is good, because we've exhausted my remedial knowledge of non-English languages. Now if you combine that fun party trick with the fact that English is read left to right, it sort of tracks that we would prefer to read a "near-sounding" word before a "far-sounding" word.
If that sounds like an extremely slipshod attempt at an explanation, welcome to the wacky world of social sciences, baby.