Step #2 - Combine Multiple Datasets into One

In many cases, the data needed for the statistical analyses come from different sources. For example, if you want to analyze international growth, you might find economic indicators in a dataset of the World Bank, political indicators in think tanks such as Freedom House, and climate data in another dataset. Another case is when you have one dataset which is divided into multiple files. In this post I will try to elaborate a bit on how to make it work.

Types of Datasets Combinations

There are actually two main types of combinations:
  1. "Vertical" combination - You want to do this when you want to add observations from one file to another file. For instance, if you are working on a sports statistics project and you have data for players performance in four separate files, one for each year between 2001 and 2004. Another possibility is that the data is separated according to different leagues, groups, etc. As long as the variables in the files are the same and the only thing you need to do is to add observations, this is vertical combination. The command in Stata we will use is append. We will explore this command later.


  2. "Horizontal" combination - This is the kind of combinations in which you want to add variables, and not observations. The observations appear in both files (at least most of them), but in each file there is different information about them. For example, if we're dealing with high school students and we have one file with their personal information and grades, and another file with SAT scores only. If we have an identifying variable in both files (e.g Social Security Number), we can assign each student his/her SAT score. This example is a One-to-One matching. There are three types of matches of this kind:

    1. One-to-One matching: If the identifying variable which appears in the files is unique in both files, then it's a one-to-one match. Unique means that for each value of this variable, there is only one observation that contains it. In the figure below, country is the identifying variable. In both datasets, each country has only one observation.



    2. One-to-Many matching: If the identifying variable is unique in one file, but not unique in the other, then it's a one-to-many matching. This is very common when you have groups of observations in one file (the file with the identifying variable which is not unique), and information regarding each group in another file (the other file). The following figure will make it clearer:



    3. As you can see, one can group the individuals to housefolds. The household identifying variable (fam_ID) is common to both of the files. It is not unique in the individuals file, but it is unique in the households file. This enables Stata to assign the same value, of each of the households variables, to all the members of the household. Note that although we have a unique identifier for the individuals (indiv_ID), it is irrelevant for this merge of files.

    4. Many-to-Many matching: This is very rare. This is also problematic, since there is no unambiguous rule for the assignment of values from observations in one file to observations in the other file. I will not elaborate on this matching too much.

Commands Syntax

There are three commands you should know if you want to combine datasets: append, merge and joinby. All three of them combine the dataset currently in memory with data from a file you specify. We will name the data in memory "Master Data" and the data to combine from the specified file "Using Data". It will be clear why we use the word Using here.

Append

The append command does what we called "vertical" combination. It adds observations. It's syntax, in a simple form (for options not specified in this tutorial, you can always type help append in the command line in order to explore more about the command), goes like this:

append using <filename>

Example:

append using "C:\more_observations.dta"

append using "C:\more_observations" // (this is equivalent)

This will add the observations from the file C:\more_observations.dta to the data in memory. In case no extension is specified (i.e no .dta at the end of the filename), Stata assumes it's .dta, so you can omit it.

Now you understand why we call the data in C:\more_observations.dta "Using Data".

What happens if you have variables in the Master Data which do not exist in the Using Data? The observations from the Using Data will be assigned missing values in those variables. If there are additional variables in the Using Data which do not appear in the Master Data, the observations from the Master Data will have missing values in them.

Tip: Before you append, you might want to make sure you know the source file for each observations. For example, if you append 2008 data to 2007 data currently in memory, you might want to make sure you have the variable year in each of the datasets prior to the incorporation of the Using Data.

Merge

For "horizontal" combination of datasets you will need either merge or joinby. The difference between them is the method they use in order to do the merging, but in one-to-one or one-to-many merges, they give almost the same functionality. We will start with the merge command. The syntax, in its simplest form, is:

merge <identifying variable(s)> using <filename>

Examples:

(1)
use "D:\geography", clear // Assumes D:\geography.dta"
merge country using "D:\economy"
// Assumes "D:\economy.dta"

(2)
merge fam_id using "K:\households.dta"

(3)
merge state year using "K:\USA_data\precipitation.dta"

In the first example, Stata first loads observations from a file called geography and then matches them to observations in the economy.dta file. This will do what the figure in the one-to-one section above shows.
Note: what comes after the double forward-slash (//) will be ignored by Stata. It's used to make the code clearer to the human reader

In the second example, assume the individuals dataset is already in memory. I tried to do what the figure in the one-to-many section above shows. Notice that there is no difference in the syntax of the command. The only difference is in the structure of the files you are operating on.

In the third example, I wanted to show you can use more than one identifying variable. In case only combination of variables is unique (and you want to identify observations uniquely), you can specify both of them. In this example, suppose you have data on state-year basis (this is called Panel Data, because you have the same subjects reappearing in different instances) - let's say car accidents data (number of accidents, injuries, etc) and you need to add data about the weather conditions in that year, you need to tell Stata to make the match between the datasets according to both state and year.

Important: The merge command requires that both the Master and Using Data will be sorted by the identifying variables. If the Master Data isn't sorted, run sort <identifying variable(s)>before the merge command. If the Using Data isn't sorted, open it first (use <filename>, clear), then run the sort command, then save it (save <filename>, replace), open the Master Data and run the merge command. Here's an example:

use "D:\economy", clear
sort country
save "D:\economy", replace
use "D:\geography", clear
sort country
merge country using "D:\economy"

1) Since you saved D:\economy.dta in the third line, you will not need to open D:\economy.dta and sort it again in future runs.
2) If you are doing a one-to-one match (i.e if the identifying variable(s) are unique in both sets), you can run the merge command with the sort option. It will automatically sort the datasets within the merge command. The sort option will not work if the identifying variables are not unique.

The _merge variable:

The merge command automatically creates a variable named _merge, which contains information regarding the observation's existence in each of the two datasets. In the simple cases I mentioned above, it will contain, for each of the observations, one of the following values:
1 => the observation (the identifying variable(s) values) appeared only in the Master Data
2 => the observation (the identifying variable(s) values) appeared only in the Using Data
3 => the observation (the identifying variable(s) values) appeared in both datasets

It is up to you to decide what you want to do with each of the cases. In some projects you will not want observatios with the value 2 in the _merge variable. For example, take example 2 above. If you have households data in the Using data, but your interest is individuals (in the Master Data), you don't need observations with household data but without individuals that are linked to it. If you want to get rid of it, you can either type drop if _merge == 2 after the merge command, or, even better, run the merge command with the option nokeep. That is:
merge fam_id using "K:\households.dta", nokeep

You can also decide that observations in the Master Data that has no corresponding observations in the Using Data are irrelevant for your research. In that case, there is no special option for the merge command. So you need to add the command drop if _merge == 1 after the merge command.

Other options of interest

update and replace

What happens if you have some overlap between the variables in the files? Say, when you are merging data from the CIA World Factbook and the World Bank, you might have GNI in both datasets. If you specify none of them, Stata will keep the values that were in the Master Data (in memory). If you specify the options update replace (replace can't be specified without update), Stata will take, instead, the values that are in the Using Data and put them in place of the Master Data values. If you just type the update option (without replace), however, Stata will put the Using Data values only in observations where the Master Data values are missing.

So in case you have the same variable but different values, use neither option when you think the Master Data is more reliable. Use the update replace options if you think the Using Data is more reliable. If they are equally reliable, use just update.

If you specified the update option, _merge will contain 5 possible values:
1 => the observation (the identifying variable(s) values) appeared only in the Master Data
2 => the observation (the identifying variable(s) values) appeared only in the Using Data
3 => the observation (the identifying variable(s) values) appeared in both datasets and the values are the same in both
4 => the observation (the identifying variable(s) values) appeared in both datasets and the value in the Master Data is missing.
5 => the observation (the identifying variable(s) values) appeared in both datasets but the values in the datasets are not missing and not the same.

Examples:

merge country using "D:\Economy", update replace

merge id using "K:\second_version", update

keep

If you want only some variables to be merged, instead of all of them, you can specify keep().

Example:

merge country year using "F:\intl_health_stats.dta", keep(birth_rate death_rate)

unique, uniqmaster, uniqusing and sort

In order to make sure the one-to-one or one-to-many matches are really unambiguously defined, you can make sure the identifying variables are unique in either the Master Data (uniqmaster), Using Data (uniqusing) or both datasets (unique). It is really recommended to specify them, although it won't change the functionality. The main contribution of these options is to make Stata print an error and exit if what you think is unique is not really unique. The sort option can make the merge command sort the datasets on its own, but it is only possible if you're running a one-to-one match (in other words, sort implies unique).

More than one dataset

You can merge more than one file in one command. Instead of specifying one filename after using, you can add more filenames. Unless the nosummary option is specified, the command will create _merge1, _merge2, ... , _mergen variables in which the observation's value in each of the _mergek variables will be 1 if the k-th dataset had this observations and 0 otherwise. The _merge variable will still be there, but now the value 3 in it means that the observations appeared in at least one of the Using datasets.

Personally, I prefer running the merge command iteratively and adding one dataset at a time. It requires to drop the _merge variable each time, and it might take longer time, but I can better report and deal with the merging outcomes.

Joinby

The joinby command does almost the same job merge does, but its internal working is different, so there might be differences in terms of processing time. Its main difference arises when you're dealing with many-to-many matches, but it can be used for one-to-one and one-to-many matches too. The simple syntax is:

joinby <identifying variable(s)> using <filename>

Example:

joinby country using "D:\economy"

Unlike merge, the default of joinby is to drop all observations that do not appear in both datasets. In order to keep those observations, you need to use the unmatched() option. This option has four possible variations:

  • unmatched(none) - Keep none of the unmatched observations (this is the default)
  • unmatched(master) - Keep observations in Master Data that have no match in Using Data (but not vice versa)
  • unmatched(using) - Keep observations from Using Data that have no match in Master Data (but not vice versa)
  • unmatched(both) - Keep all unmatched observations, from both Using and Master Data

So if you want to do the same thing done in the first example of the merge command, use the following commang:

joinby country using "D:\economy", unmatched(both)

There is no need for the datasets to be sorted by the identifying variable(s), which is an advantage over merge.

The update and replace options are available for joinby too.

As I said, more details with:

help joinby

Many-to-Many Merge

Although I have never needed it, this is where merge and joinby will give you totally different results. The question is how to match values from one dataset to the other. I think the best way to explain the difference between the commands is graphically:

Now you can understand the meaning of the sentence describing the joinby command in the help reference: "Form all pairwise combinations within groups".

Conclusion

If you want to add observations: append.
If you want to add variables: merge or joinby

As always, before you celebrate, make sure you got the combination of the files right by looking at the means, counts, minimum and maximum values (sum command) and tabulations (tab command). Take a special look at the _merge variable. Look for missing values or other outlying observations. If you have too many of them, you might have made a mistake along the way. Browse the data a bit. See that the data merged correctly.

Don't forget to save the file (that is, if you don't want to rerun the merge command later).

(go on to Step #3)

62 comments:

Katherine said...

nice one dude! I will be visiting this post quite a few times methinks :)

Dany Bahar said...

STATAMAN, you are brilliant!!!

Katherine said...

hey stataman
I do a merge of a bunch of files, which are all already sorted by the personid I'm using. here's the command and the output:
merge personid using idS7G__IND.DTA idS7B__IND.DTA idS19C_IND.DTA idS7F__IND.DTA idS9___IND.DTA idS7H__IND.DTA idS3___IND.DTA, _merge(ind)
(label timeunit already defined)
(label yesno already defined)
(label timeunit already defined)
(label yesno already defined)

do you have any clue what this label already defined thing is?

hey and when are we getting our post on thank god for the egen command? hmm?

stataman said...

Hi Katherine,

I haven't seen an error like this before. My guess is that it talks about labels defined in each file. These labels are later attached to variables and then numeric values are displayed with their corresponding label.

Is this actually an error or a warning? If it's an error, it appears in red color and stops the program. If it's a warning, it's in green and you can go on with your program without a problem.

If it is indeed an error, try to run the merge with the option "nolabel". The help file says it will not copy value labels from the using files.

Does this help?

Katherine said...

hey you
so yes it was just a warning and not an error, and using nolabel did fix the problem.

here's a suggestion for a post - the importance of using log files. I just had crimson go loopy on me, and it deleted all the pretty code I had written over the past 5 days. It was pretty code! luckily I could use my log file to retrieve the code and recreate my file - thank goodness. So now I have put into place a proper repository backup system but in the meantime I am happy I was using log files!

sofie said...

Hi Stataman,

nice blog with interesting articles. I started a similar project a while ago, but I didn't descover yours until now. Keep up the good work!

Sofie

newyorkbus said...

Hi Stataman,

Your blog is terrific and thank you for your time and efforts on putting it together!

I have to append 230 datasets together (using vertical combinition). Do you have any tips on doing it all at once?

stataman said...

Thanks!

To combine 230 I'd recommend looking at stage #6 in this tutorial. It shows how to use loops. If your dataset files have a systematic name (file1.dta file2.dta ... file230.dta) it would really be easy with a forvalues loop. Otherwise you can construct a long macro with all the filenames one after the other (except for the first). Load the first by the "use" command and then use a foreach loop to joinby, or merge, the other files to the accumulated dataset in memory.

nkincler said...

You're awesome! Thank you so much!

Anees said...

Hi Stataman!
I need some urgent help in understand of the mergins many datasets. I have to merge 6 to 7 datasets in fact. i these files like 1 to 7. I started with merging 1 with 2 using the code
use 1
sort id
save 1, clear
use 2
sort id
save 2, clear
use 1
merge id using 2, no keep
tab _merge
keep if _merge==3
use 1
.
.
.
same pattern till I merged all the 7 files to 1. I got finally a merged datasets.
My question is should i drop _merge=1 if i have to use repeated cross sectional sample.

stataman said...

First of all you probably need to drop _merge, if it exists from previous merges, before any merge.

As to the _merge == 1 (those in memory that did not find a match in the file on disk you are merging into memory), it's your decision. I don't think there's a rule. Maybe they were missing from the first dataset but have observations in datasets 2 to 7. Still in some projects dataset 1 might be crucial, so you might want to drop them after all.

What I usually do is look at the most inclusive dataset (with all the ones that did not find a match), try to understand why there is no match and then decide according to what I got what I want to keep. Some times it's only _merge==3, other times not.

Anees said...

Thank you Statsman for the reply. I think I should give some more explanation to my query. I have 28 quartely collected data in 5 waves each and only 20% of the individuals repeated each wave so that whoever entered in the first wave, 20% of them are interviewed in 2nd waves and this over the 5th wave they exit. Now that there seems to be panel touch in it but it more generally used as a cross sectionally so I do not need to drop if an individual was contacted once. Now in this case if my datasets do not exactly match still I need to keep only _merge=3 and drop else. can you help me with the choice of merge or append command in that case. I have many variables with the same name and coding over the quarters.

stataman said...

Hi again.

Wait, if you have a recurring cross-section, why are you merging it "horizontally" instead of "vertically"? Usually you will have the same variables, right? Just use the append command and add each wave below the other. You can add a variable that indicates which wave did the observation come from.

Does this help?

Anees said...

I am sorry for late reply to your reply but I was unable and away so could not made that in time. Now after following some hints from these posting I think have to use the same append command and I can only have some slight confusion and I hope you would finally help me sort that out also. Ok I appended 28 waves only one wave have such recoded variables which are different from the other codes. For example rest waves codes countries by names and one have have numeric codes. I know I have to recode by tostring and replace commands but as there are more than 100 countries in the names so is there any way which will directly recodes these countries into naming codes instead of digits. I know there might not be but still want to confirm. Also would it be fine to use both the codes for the same named variables.

stataman said...

I would recommend creating a dataset that will be like code dictionary. In it you can have a variable for each coding method. One for the numeric codes, another one for three-character country code, another for two-character etc (only if you need to). Then, if your original datasets are tidy, you can merge the relevant variables from the dictionary according to the code you have in the original file and the one you want in the big destination file. After you create the dictionary you only need to merge each file.

One more thing to remember, though, is that some commands in Stata don't like string values (for example, if you try fixed effects regression with xtreg). So maybe the best thing is to keep the numeric country code and maybe label the values with some string format of the country name - so that human eyes can read it easily too.

I hope this helps, but I'm less and less sure.

Anees said...

Hi Again!
I am really thanking you for your guidance which let me to work out most of the issues by now. Here the last thing I would like you to confirm for is that if I have the same type of variables like country and there are different answers to this questions like
use dataset1
list country
UK
USA
France
Spain

and

use dataset2
list countru
UK
USA
Spain
Germany.

Would the apending the command would replace not being alike entries in the dataset or it would creat another category in the same variable. eg
use apndeddataset
list country
UK
USA
Spain
Germany
France

or it would add the entries alike and superimpose the dataset1 entry of france with germany. Please confirm it for me as I have more than hundred countries in my country variables I could not figured out how that appending the country variable in 8 different quarterly data would be consistent.

stataman said...

The best way to learn that is to experiment. Try to construct datasets as you gave in the example and then do the append and see what happens.

Append does not superimpose datasets on each other. It just puts the appended dataset below the dataset in memory. If you have the same variable name for country, it will put the appended observations' countries in the same variable, but in the appended observations. If there are two names (country and countru), then a new variable named countru will be created and the first dataset's observations will have missing values for countru whereas the appended dataset's observations will have missing values for country.

I'm pretty sure experimenting will be much more helpful than my comments.

bayhaqi said...

Nice Blog indeed! Thanks!

bayhaqi said...

nice blog!!!

Becca said...

THANK YOU SO SO SO MUCH! Your site (the merge/append post) just saved me from hours & hours or further struggling (I've already spent many such hours). Thanks!

Eileen said...

Thank you, I really needed a refresher on Stata. :) Your blog is wonderful.

Nafees said...

Hi Stataman, I am working on a project. I need to make combinations of variables of the common values in those variables and create new variables from these. For example, in one dataset, I have 8 variables so possible number of combinations would be 28 for two, 56 for three, 70 for four etc. I have worked out a way but this takes a long time. Can you help me write a shorter code or guide me which command(s) should be used to accomplish this. Thanks. Nafees

Anees said...

You can use the gen or egen command where gen newvar= var1 if var2==varvar3 format. This way all equal in values variables will be generated.

David Blake Jones said...

Hi, I hope this question is not too basic, but I am new to Stata and don't really know how to search for help with this question. I am analyzing data from the American National Election Study of 2008. In the post election part of the survey, respondents are asked two questions about their perception of government responsiveness.

The problem is that about half of the respondents are asked one version (labeled "old" question) of the first question. The other half are asked another version (labeled "new") of the first question. The only difference between the two versions, however, is the presence of the word "about" in one and its absence in the other. Thus, I want to assume that the questions are asking essentially the same thing.

The second of these Government Responsiveness questions (the actual second question, not the second version of the first question) just has one version. I want to create a scale to combine the responses to the two Government Responsiveness questions , but don't know how given the two versions of the first question.

Normally, if two questions only have one version each, I would generate a new scaled variable to combine the two questions, as in gen NewScale = (Question1 + Question2). However, given that there are two versions of question 1, I don't know how to do this.

If you would help me I would be most helpful.

Thanks for your time.

kabaso said...

I am merging data on 1 to 1, 1 to many, and many to one but i a m getting the message "variable hhid does not uniquely identify observations in the master data"
When i merge on m to m data especially on group variables is becoming correlated. what can i do?

I used the following commands:

use "C:\Users\MWENIAK\Documents\LCMS2006\Education 14.08.2010.dta", clear
rename SEC4_PID pid
rename HID hhid
sort hhid pid
save newfile1.dta, replace

use "C:\Users\MWENIAK\Documents\LCMS2006\Household Roster and migration and poverty.dta", clear
sort hhid pid
save newfile2.dta, replace

Thanx

Kabaso Nkandu

kabaso said...

I am merging data on 1 to 1, 1 to many, and many to one but i a m getting the message "variable hhid does not uniquely identify observations in the master data"
When i merge on m to m there is no problem and it is successful, but data especially on group variables is becoming correlated. what can i do?

I used the following commands:

use "C:\Users\MWENIAK\Documents\LCMS2006\Education 14.08.2010.dta", clear
rename SEC4_PID pid
rename HID hhid
sort hhid pid
save newfile1.dta, replace

use "C:\Users\MWENIAK\Documents\LCMS2006\Household Roster and migration and poverty.dta", clear
sort hhid pid
save newfile2.dta, replace
/*Merges the three new files generated*/

use newfile1.dta, clear
merge 1:1 hhid using newfile2.dta
tab _merge /*check the file to verify that _merge takes the appropriate value*/
drop if _merge!=3
drop _merge

Thanx

stataman said...

Try to merge according to both hhid and pid:

merge hhid pid using ...

kabaso said...

Thanx for your quick response. I tried merging using both hhid and pid but i am getting the following error message:

merge 1:1 hhid pid using newfile2.dta
variables hhid pid do not uniquely identify observations in the master data

stataman said...

This means your dataset has at least one case in which at least two observations share the same combination of hhid and pid. Stata doesn't know which one of them to choose for the merge. You need to figure out exactly how your datasets are constructed. Using different egen commands can help you learn more about it. For example:

egen c = count(_n), by(hhid pid)
tab c

browse if c > 1

Will show you the cases that confuse the merge

kabaso said...

Thanx once again. I have managed to use the egen and got the following results:

use "C:\Users\MWENIAK\Documents\LCMS2006\Education 14.08.2010.dta", clear

. rename SEC4_PID pid

. rename HID hhid

. sort hhid pid

. egen c = count(_n), by(hhid pid)

. tab c
c Freq. Percent Cum.
1 95009 99.82 99.82
2 170 0.18 100
Total 95179 100


what can i do to make merge 1 to 1 possible. please advise!

stataman said...

I'm sorry I can't help more, but I'd look at the 170 cases of 2 obs per hhid-pid combination and see why you have them. If they are just duplicates, drop one of each (duplicates command can help with that). If they are not exact duplicates, try to find out what distinguishes each observation in the pair and see maybe there's a third variable you need to merge by.

kabaso said...

Thanx very much stataman. may almighty God bless you. your advice worked. i dropped the 170 cases and a 1to 1 merge worked.

stataman said...

Hi kabaso,

I'd drop only half of the 170 cases (those that are duplicates), not all of them. There is still valuable information in them. To keep just one instance of every group of the same hhid-pid you can:

egen tag = tag(hhid pid)
keep if tag == 1
drop tag

Good luck

kabaso said...

Hi stataman. with your advise i managed to merge the first four files successfully. when i decided to merge three extra files to make 7 files there is a problem. variables from the second and third file were dropped from the final merged file. what can i do to retain all the variables in the seven files?

kabaso said...

hi stataman i want to withdraw my earlier post. You took too long to reply. Therefore i made so many tries and research only to discover a typographical error in my do file. it is working perfectly. you are genius

kabaso said...

hi stataman i want to withdraw my earlier post. You took too long to reply. Therefore i made so many tries and research only to discover a typographical error in my do file. it is working perfectly. you are genius

Khan Hidayat said...

Hi Stataman!
I have two datasets, one baseline and one follow up each of these have unique ID for household (hhid). I want to merge these to construct a panel of it. I need your suggestions. Thanking you in anticipation.

武忱 said...

Stataman!!!You briliant!!! Thanks a lot!

nada said...

Hi,

I have a question regarding how to merge datasets. I want to combine datasets (individual data) from different countries where the categories for each variable will be different, for example with "political party" or "province". Although they are the same variables, what do I do so that all of the categories for all three countries appear in the 'base' dataset? Right now I am trying to do this in SPSS but I am not sure how to continue or if I should try this in STATA. In one dataset I have added more categories for the political parties in each country, but do I have to recode them then in the original dataset before merging? I hope this makes sense and thanks in advance for any advice you can give me!

nada said...

Sorry, I meant to elaborate, I think this would be either a one to many merge or many to many merge. Another example like I said is the province variable where for one country there are certain provinces and for another country there are others. So the variable is the same, but the categories are different. I would really appreciate specifically on the best method to use and the commands I would need to do this. I have read over the post but any extra advice regarding my examples would help!

micky said...

Our SLM household survey data contains a number of files pertaining to
various socioeconomic aspects of the population. We have managed to merge
different files with the master file by jointly using HHcode and IDC (the
personal identifier). However, we are finding difficulty in merging the file
containing data on remittances with the master file. This remittance file
has only HHcode as identifier, and as is the case with other files, is not
unique. One solution that works is to drop all non-unique HHcode
observations in the remittance file, and then do a m:1 merge with the master
file. We are wondering if there exists a better solution to the problem.

Geoy said...

Hi Stataman!!

I have a huge problem!!

I`m using data from WB and because it`s too big they divide it into 45 files. I merged them one by one...but then they have 2 files at the end with the weights. I`m stuck, I really need the weights but how can I merge them since the variable don`t correspond? any little help would be highly appreciated

Geoy said...
This comment has been removed by the author.
Dj said...

hi, if i need to merge data based on more than one key variable, hw do i do it?

Muhammad Anees said...

You can use options like 1:m, m:m and m:1.

for more details, see help merge in Stata.


Anees
aneconomist dot com

Jercy Wilson said...

These multiple dataset are really very helpful.The discussion is really nice and getting some ideas.

Java Training in Chennai

Mas Herera said...

Very nice article and I am Obat Bius very happy to meet with your blog, the articles are very interesting, thank you for share very amazing article and I wait for the next quality articles...

Allison Fernandez said...

Thanks for your help regarding the already defined error when merging datasets! That was helpful!

Amirtha rao said...

I am reading your post from the beginning, it was so interesting to read & I feel thanks to you for posting such a good blog, keep updates regularly.
Regards,
Salesforce training in Chennai

Harini said...

Excellent information with unique content and it is very useful to know about the information based on blogs.
Selenium Training in Chennai | QTP Training In Chennai

Santhiya said...

In database computing, Oracle Real Application Clusters (RAC) — an option for the Oracle Database software produced by Oracle Corporation and introduced in 2001 with Oracle9i — provides software for clustering and high availability in Oracle database environments. Oracle Corporation includes RAC with the Standard Edition, provided the nodes are clustered using Oracle Clusterware.
Oracle RAC allows multiple computers to run Oracle RDBMS software simultaneously while accessing a single database, thus providing clustering.

In a non-RAC Oracle database, a single instance accesses a single database. The database consists of a collection of data files, control files, and redo logs located on disk. The instance comprises the collection of Oracle-related memory and operating system processes that run on a computer system.

Oracle RAC Training in Chennai

Akasha said...

Whatever we gathered information from the blogs, we should implement that in practically then only we can understand that exact thing clearly, but it’s no need to do it, because you have explained the concepts very well. It was crystal clear, keep sharing..
Microsoft SQL Server Training In Chennai

ioscare team said...

Such a informative post.Thanks for sharing your knowledge with us.keep it up for updating post..
http://sonymobileservicecenterinchennai.in/AboutUs.html

ioscare team said...

Such a informative post.Thanks for sharing your knowledge with us.keep it up for updating post..
http://sonymobileservicecenterinchennai.in/AboutUs.html

Savitha said...

Excellent information with unique content and it is very useful to know about the information based on blogs.
Informatica Training In Chennai
Hadoop Training In Chennai
Oracle Training In Chennai
SAS Training In Chennai


kiran kumar said...

WebMethods Training in Chennai
This information is impressive; I am inspired with your post writing style & how continuously you describe this topic. After reading your post,
thanks for taking the time to discuss this, I feel happy about it and I love learning more about this topic..

Muhammad Anees said...

Econometricians Club (www.econometricians.club) offers an online course in Stata for Econometrics and as I am member of this blog since long, I wish to offer a discount to any of the blog-member in an online, one to one and private online course to be recorded for the participant for future use with custom module based on the participant specialization of research. The courses include:

1. Data Cleaning, Merging, Appending, Managing, Graphing
2. Analysis, Regression, Correlation, Hypothesis Testing
3. Regression Evaluation, Assumptions and Specification Tests
4. Modification of Models based on 3 where needed
5. Writing of Results in Academic Standards

Those who register for this course and mention STATMANBLOG, I will give him a discount for around 50% of the course fee charged from normal students.

You can see more about my club at htt://www.econometrician.club

Abina Ragav said...

Great post,

This information is impressive..I am inspired with your post writing style & how continuously

you describe this topic. After reading your post,thanks for taking the time to discuss this, I

feel happy about it and I love learning more about this topic..

Java Training in Chennai

Abina Ragav said...

Nice information do visit our page for PHP Institute in Chennai, Java Institute in Chennai, Dot Net Institute in Chennai and more

Thang Viet said...

Stataman: Your blog is really informative. How often do you clean spams nowadays? There appear several spams: people are trying to sell their junk training courses.

Could you help explain the difference in the following merging commands?

The first merge command I experimented is:
. merge 1:m idgr using ... /*idgr is the identifying var, which is created by grouping two vars, id and session*/

The second is:
. merge 1:m id session using... /* id and session are the two identifying vars*/

The results of the two merging process are not the same. The first one gives less merged obs (_merge==3) than the second one does.

Should I keep the second merge result or the first one?

Muhammad Anees said...

#ThangViet:

Your point of selling junk courses can be true but for my own comment as an instructor of Econometrics using Staa at www.econometricians.club might be exclusion as it is fully relevant as I am always looking to this forum since 2009/2010.

Now, the two codes are difference as the first one matches each observation/variable based on only on idgr while the second one makes pairs for unique combinations using the id session.

Initially, the two datasets are compared for idgr only for first set of code and if that matches between the data, it is merged accordingly and _merge results will be ==3. Otherwise, it can be only in main/parent data or merging data.

The second code first makes unique ids based on the pairs of id and session and where both the id and session matches between the two datasets, then it creates the _merge ==3 or it might be to the one or other datasets.

I wish this explain simple explanation helps you understand the issue.

Priya Kannan said...

Pretty section of content. I simply stumbled upon your site and in accession capital to say that I get actually loved to account your blog posts.
Selenium Training in Chennai