Introduction


This blog is a free Stata tutorial. I have been using Stata for the last two years now for different applied work in economics and other fields of the social sciences. If you are in your undergraduate or graduate studies or if you are working for some agency that performs social research, you will probably need to use Stata in the context of your project. Stata has an extensive manual which is very accessible, in my opinion, but in order to know how to use it, one needs to already know the commands' names.

However, if you are new to Stata, and you have a project to do, there is a sequence of actions you probably need to do. This tutorial is constructed to follow this sequence: data assembly and construction of additional variables. Then I deliberately skip talking about commands that perform statistical analyses and leave it to your statistics or econometrics courses. But the second part of the tutorial (steps #5-#8) are dedicated to automating those commands and the creation of tables which will report the results. In addition, there are best practices of how to write code that will be easy to follow and change if needed.

I am assuming the reader has basic knowledge of Econometrics (regressions etc.) and I will not get into issues of how to specify an appropriate model. I will concentrate, though, on the practical steps one needs to do before and after the regressions, and how to organize the code so as to minimize mistakes.

The tutorial is divided to steps. You might not need to go through all the steps, so feel free to move on if you see the step is irrelevant for you. You can also navigate through the tutorial with the labels on the left bar. They consist of keywords (like an index) and steps numbers (like a table of contents).

The steps are as follows (keep in mind that the tutorial is still under construction):

  • Step #7: Exporting Results to a Spreadsheet - Excel as an example
  • Step #8: Program Definition - if you start to see the same code in many .do files, maybe you should read this step.
I did not find time to fill in steps #7 and #8 here. For step #7, and other good things, I prepared some slides for a short Stata sequence I gave at the department. You can find them here.

Good luck!

173 comments:

Katherine said...

hey stata man!
I have a question. I have 4 samples, each of which use different years. So of course I have to repeat my code 4 times cos I'm not smart enough to figure out how to use a list of years or something.

I repeat this code four times, just using different years each time:

use sample1_1999.dta, clear
foreach year in 2000 2001 2002 {
append using sample1_`year'.dta
}
foreach year in 1999 2000 2001 2002 {
gen year`year' = (year==`year')
}
egen count_year = count(year), by(zehut)
unique zehut
tab count_year
drop if count_year > 1
unique zehut
save stack_sample1.dta, replace

I know I need to make some sort of macro list, and then loop through the list, but I don't know quite how. also I have to use the first year each time to start the stacked file, but then I don't append it, so I have to somehow get rid of that first element of the years list, or just add it in twice and then not worry because in any case I get rid of the duplicates later on.

any ideas?

stataman said...

Hi Katherine,

If I understand correctly, you need to change the list of years every time.
To do this, define four local macros:

local sample1_years "1999 2000 2001 2002"
local sample2_years "1998 2002 2003"
local sample3_years "..."
local sample4_years "..."

foreach sample in sample1 sample2 sample3 sample4 {
   local first = 1
   foreach year in ``sample'_years' {
      if `first' {
         use `sample'_`year', clear
         local first = 0
      }
      else {
         append using `sample'_`year'
      }
   }

   // The following command creates the year`year' dummies
   xi i.year, prefix("year") noomit

   egen count_year = count(year), by(zehut)
   unique zehut
   tab count_year
   drop if count_year > 1
   unique zehut
   save stack_`sample'.dta, replace
}

Keith said...

Hi, I am using stata to analyse a discrete choice experiment. I am using the clogit function and I am totally stuck!
I have got my main variable effects by this:
clogit y x11 x12 x21 x31 x32 x41,group (setid)
I now need to get the effects for just the males ect...
I rally do not know what to do.
Please help me!!

stataman said...

Hi Keith,

I'm not sure I understand exactly where the males come into the regression, but I am assuming you need to run this regressions for males only? Or do you want to run the regression for all observations and allow just some of the variables to have a different effect for males and females?

If you want to run the regression just for males, add the if option after the command:

clogit y x11 x12 x21 x31 x32 x41 if male==1,group (setid)

If you are looking for the second option, you will need to interact the variables you want to allow the effect to vary between males and females. For example, if you want to see how the effect of x11 changes between males and females (while the rest of the variables have the same effect on both), you should run:

gen male_x11 = male * x11
clogit y x11 male_x11 male x12 x21 x31 x32 x41, group (setid)

The effect of x11 for females will be the coefficient reported for x11, and the effect of x11 for females will be the sum of the x11 coefficient and the male_x11 coefficient.

I hope that answers your question.

Galia said...

Hello Stataman :)

I love your Stata Blog, it look like the clearest, i.e. best guide i could find on the net. So firstly a big thank you!

I am having one small problem however.... I am using Stata on a Mac computer. I am trying to load an excel file onto Stata and therefore have been reading your guide for that. The problem is that i cannot input the 'location' of my file to type it into the command window... The only location name my Mac gives me doesn't seem to be working when i type it in exactly as you describe.

I wonder if it would be easier, instead of typing the precise commands into the commands window, you know how to go about doing it in the more tedious way? i.e. through the panel at the top? i.e. 'file', 'open' etc. ????

The deadline for my project is in a few days, i would be extremely grateful if you could reply before then! Fingers crossed,

Galia

stataman said...

Thanks Galia!
I have got to say I have no experience with Mac whatsoever. I worked with UNIX, some Linux and mainly Windows, but never on a Mac, so I don't know even how files are saved and referred to in Mac. The best I could find was here: http://www.macworld.com/article/57685/2007/05/copyfilepath.html (maybe this will help you to get the location of the file).

There are other ways to get the data. I'm assuming you wanted the location (or path) of the file for the insheet command, but you can try to use StatTransfer if you have it (the best way that I know of), or you can simply write "edit" in the command line, then open the Excel file, copy the columns you need and paste it into the data editor that Stata opened after you entered "edit". In Windows the paste action is done by clicking Ctrl-V. I understand that the Mac counterpart is Cmd-V.

Does that help in any way?

Galia said...

Hello,

Yes, i have not solved the problem, thank you. :D I decided to try pasting the data and it worked very well just as you described. So i will stick to that method since finding the location of the file seems a little complicated!

Looking forward to reading the next of your articles now :)

Odovakar said...

Hi Stataman,

I am new to Stata (have been using Eviews earlier), and despite of your excellent blog I'm still facing some basic trouble getting my panel data right into the editor. I have a "normal" panel so to speak, with "Country" in the first column, "Year" in the second, and then the variables. When I try to get it into Stata from Excel, either by using Stattransfer or by Copy/Paste, "Country" is all the time being treated as a string variable, no matter what I do, i.e. the panel can't be estimated. "Year" is being treated as 'Int'. What am I doing wrong?

stataman said...

Hi Odovakar,

I don't know how your country is coded. If it's coded like "ARG", "USA", etc, then there is no other choice but to treat the original country variable as a string. If there is a number code that StatTransfer fails to code into a numeric Stata variable, then look again at your Excel - there is probably a tiny green triangle in the top-right corner of each cell that says that the numbers are treated as strings. You can click on the green triangle and tell Excel to treat them as numbers.

But, in any case, even if the country is coded as string, and you want to give it a numeric code, you can use the following command in Stata (assuming the original country variable is named "country"):

egen country_code = group(country)

And then you have a code for each distinct string stored under country.

Good luck

Alyssa said...

hey.. this is a great help. your blog inspired a breakthrough for me, over something that had me blocked for a week. thanks man!

stataman said...

Cool! More than happy to help.

Odovakar said...

Hi again Stataman,

Thank you very much for your previous help - I immediately got it right! You are the best.

Now, a rude leecher as I am (as well as a Stata newbie...), I have to ask you once again for some help. I have in vain for several days now been reading everywhere on the Internet and in manuals, and have been experimenting with commands such as vif, xtserial, hettest and xtgls, but without luck. All I want to do is to test my panel models for autoregression, multicollinearity and heteroscedasticity. Please tell me simply: How do I do this? Nothing seems to work and I get different kinds of error messages all the time.

I have estimated a few log-log models with fixed effects, and one with random effects as well.

Also: Is there any simple command in Stata with which to estimate an EGLS panel model, such as the settings available in Eviews?

All the best to you!

Odovakar said...

Hi again Stataman,

Thank you very much for your previous help - I immediately got it right! You are the best.

Now, a rude leecher as I am (as well as a Stata newbie...), I have to ask you once again for some help. I have in vain for several days now been reading everywhere on the Internet and in manuals, and have been experimenting with commands such as vif, xtserial, hettest and xtgls, but without any luck. All I want to do is to test my panel models for autoregression, multicollinearity and heteroscedasticity. Please tell me simply and generally: How do I do this? Nothing seems to work and I get different kinds of error messages all the time.

I have estimated a few log-log models with fixed effects, and one with random effects as well.

Also: Is there any simple command available in Stata with which to estimate an EGLS panel model, such as the settings that are available in Eviews?

All the best to you!

stataman said...
This comment has been removed by the author.
stataman said...

Pfew.... Sorry for this huge delay in answer, but first year PhD is tough here.

In any case, it would have been more helpful if you could copy & paste the errors you got.

To tell you the truth, I didn't have the chance to use any of the commands you mentioned, but I experimented with some now and hopefully I can help with some of them.

1. xtserial - you first need to specify how your panel looks like with the tsset command. For example, suppose I have a panel of years and countries. Then I should first run:
tsset country year and then xtserial <variable> where variable is what you want to check autocorrelation in.

A note about tsset. If you have the panel defined by either a string variable or more than one variable (country, province, county, town) then to have one numeric variable with a code for each of the combinations you can run:
egen [varname] = group(<varlist>)where varname should have the name of the variable you want to keep the code in, and varlist is like the "country province county town".

Then you run tsset varname year

2. xtgls In xtgls you just need to specify the group-identifying-variable (country in our case) with the i() option:

xtgls agri_gdp rainfall, i(country)

3. EGLS? I never heard of EGLS. I know FGLS and you have several ways to implement it. I guess xtgls implements one of them.

Again, sorry for the long delay.

marie said...

Hi!
I've tried to use xtserial to test for serialcorrelation in a paneldataset, but I get an error: r(2000) no observations. All the variables are either float or byte. Can you help me?

stataman said...

hmmm... pretty hard for me to tell. If you're working on a panel data, make sure you do the tsset before the command with both the repeated-observation identifying variable and the time variable.

This is what I did to see if the command works:

set mem 20m
set obs 20000
gen year = 2000 + mod(_n, 3)
gen x = mod(_n,6667)
drawnorm u
tsset x year
xtserial u

Then I got:

Wooldridge test for autocorrelation in panel data
H0: no first order autocorrelation
F( 1, 6665) = 0.108
Prob > F = 0.7427

(as expected, as I drew u regardless of years)

Try to copy the list of commands above and see how the structure of the dataset is different than yours.
In general, the "no observations" error comes up in two cases: (1) there are actually no observations in the data (2) there are observations, but they have missing values (the annoying "."). Maybe one of the years has missing values throughout, even though other years are ok. Make sure all years have non-missing values (at least some - probably more than one, but no need to have them all full).

Marleen said...

hi Stataman and Marie,

I was going to write that I had the same problem of "no observations" after the xtserial command.. However I solved it with the advice that there should be no missing data..

Allthough I have no missing data in my years, I have gaps (1981, 1984, 1987...). Just by creating another timevariable (1,2,3...) the problem solved =)

So thanx a lot Stataman and good luck with the serial correlation tests.

Elena said...

Hey,

I am trying to do a factor analysis with Stata but I keep on receiving a r(2000)- no observation- error message. Does anybody have an idea what goes wrong here?

Elena

stataman said...

I haven't done much factor analysis in the past. My first two guesses would be to either check that you have nonmissing values for the varlist you give or check that your conditions (if you have any) match at least some observations (and that the value of the variables for those observations specifically is nonmissing). To check that your condition is ok, you can do:

sum varlist if <condition>

If you get 0 in the number of observations, you should check your condition again.

Any other suggestions are, as always, welcome.

Michael said...
This comment has been removed by the author.
stataman said...

Delete your comment? I never deleted any comment... I don't know, perhaps a blogspot bug.

Maybe you should write it again?

Michael said...

Elana: Make sure your variables for are defined as numeric. If they are defined as 'string,' one option is to -destring-

Everyone: Stata has a discussion board that is another resource for technical questions about Stata.

ken1088 said...
This comment has been removed by the author.
ken1088 said...

hello!
i have a question. i used the xtserial command to determine autocorrelation on my panel data but i'm getting the no observations error. how can i correct this? there were no missing values in my data and i did not put any conditions. i have also tried destringing my data.

Michael said...

Ken1088:
Did you define your data as time series?
-tsset-

la geconde said...

I have been using STATA to develop a logit model. The variables in my model are all binomial.Could you please suggest how I show the regression fitness graphically? Is it at all possible?

stataman said...

Well, you can graph something, but it will not give you extra information. Remember, the binary dependent variables models, such as probit or logit, will give you an estimate of the probability to get 1 in the dependent variable given your independent variables. If your independent variable is also binary then you can plot a graph that connects between the dots: (0,Pr[y=1|x=0]) and (1,Pr[y=1|x=1]). As you can see, in this case you can simply report those probabilities in your text or table.

I hope that answers the question.

stataman said...

ken1088, sorry but without seeing the data I can only guess. I also learned that there is an illegal copy of Stata10 going around that does funky stuff - loses variables, drops observations, I don't know.

I'm not saying you are using an illegal copy, but in case you do, make sure you get a legal one.

Benjamin said...

Stataman, have a question for you. I am trying to create a local varlist that contains all of my variables
such as
local varlist1 v1 v2 v3 v4

and then at times be able to remove a variable or two for instance

local varlist2 v2 v4

The code I have been trying to use is

local varlist1 v1 v2 v3 v4
local varlist2 v2 v3
local varlist: list varlist1 - varlist2

but it doesn't some seem to work. I can't find any examples despite a couple hours of googling.

stataman said...

Hi Benjamin,

To make things generalized I would need to know more of which variables are you trying to remove. Is there any criterion for them? Maybe construct two lists - one for each subset of variables - and combine them when you need all.

In short, it depends on what you're trying to do.

In general, if you need to put a varlist into a local (perhaps there's a direct command for it, but if there isn't:) you can fill it with a loop:

local myVarlist ""
foreach var of varlist v1-v3 v4-v8 {
local myVarlist "`myVarlist' `var'"
}

will do the job.

Victoria Reyes said...

Hi, I just discovered your blog! I'm trying to run a fixed effects model on cross sectional data (not panel data), but every time I run the model, I get this error:

. xtreg ratio_arriv gdp_origin hdi_origin gini_origin empire_o perc_mfg_o cul_sit
> e_o travel_o embassies region_d region_o pop_mill_o gdp_dest hdi_dest gini_dest
> empire_d perc_mfg_d cul_site_d travel_d pop_mill_d empire_same region_same, fe
must specify panelvar; use xtset
r(459);

Why is this and how can I "fix" it?

Thanks!

Lim said...

Hey Stata man,

I have a panel data with "ethnicity" coded for only one year of the individual id but missing for the other 3 years. How do I fill in the missing values with the observed value for ethnicity?

Luisa Fernanda said...

Hi Stata man!!!!
I am working on a document for the central bank of Colombia. I am trying to estimate the following model:

. xi: xtpcse endcp1 vario crec tam end1 z pi diftasa pib z2 i. nit i. ao, correlation(ar1)

or

. xi: xtgls endcp1 vario crec tam end1 z pi diftasa pib z2 i. nit i. ao, panels (correlated) corr(ar1)

but I get this error:
varlist required
r(100);

could you help me please?

thanks

Luisa

stataman said...

Victoria, to run fixed effects you must specify which variable has the group code for the groups you want to run fixed effects for. This can be done in two ways. Suppose is a country code (it should be numeric. if you have a string code like a three-letter code, run "egen country_num = group(country)").
After you have the country_num as the variable that contains each country's distinct code, you can run fixed effects in two ways:

1. Directly, with an i(.) argument:

xtreg dep_var indep_vars, fe i(country_num)

2. Indirectly, by specifying the panel's structure.

xtset country_num

Or if there's a time variable too, like year:

xtset country_num year

see "help xtset" for more details on the second approach.


-=-=-=-=-=-=

Lim, I think I have something in the egen chapter about how to populate a value. It's pretty simple. Suppose the individual id is stored in a variable named indiv_id:

egen ethnicity_temp = max(ethnicity), by(indiv_id)

replace ethnicity = ethnicity_temp

drop ethnicity_temp

-=-=-=-=-=-=-=-=-=-

Luisa,

Try to avoid using the space bar after the "i." Stata reads commands in words, so if you do "i. variable" it reads the i. separately and expects a variable name to come right after the dot.

Try running:

xi: xtpcse endcp1 vario crec tam end1 z pi diftasa pib z2 i.nit i.ao, correlation(ar1)

-=-=-=-=-=-

stataman said...

Victoria, to run fixed effects you must specify which variable has the group code for the groups you want to run fixed effects for. This can be done in two ways. Suppose is a country code (it should be numeric. if you have a string code like a three-letter code, run "egen country_num = group(country)").
After you have the country_num as the variable that contains each country's distinct code, you can run fixed effects in two ways:

1. Directly, with an i(.) argument:

xtreg dep_var indep_vars, fe i(country_num)

2. Indirectly, by specifying the panel's structure.

xtset country_num

Or if there's a time variable too, like year:

xtset country_num year

see "help xtset" for more details on the second approach.


-=-=-=-=-=-=

Lim, I think I have something in the egen chapter about how to populate a value. It's pretty simple. Suppose the individual id is stored in a variable named indiv_id:

egen ethnicity_temp = max(ethnicity), by(indiv_id)

replace ethnicity = ethnicity_temp

drop ethnicity_temp

-=-=-=-=-=-=-=-=-=-

Luisa,

Try to avoid using the space bar after the "i." Stata reads commands in words, so if you do "i. variable" it reads the i. separately and expects a variable name to come right after the dot.

Try running:

xi: xtpcse endcp1 vario crec tam end1 z pi diftasa pib z2 i.nit i.ao, correlation(ar1)

-=-=-=-=-=-

Luisa Fernanda said...

Thanks for answer!!

I tried to estimate the model again with your recommendation but I get another error:

. xi: xtpcse endcp1 vario crec tam end1 z pi diftasa pib z2 i.nit i.ao, correlation(ar1)
no room to add more variables
An attempt was made to add a variable that would have resulted in more than 5000 or 4999 variables (Stata reserves
one variable for its own use). You have the following alternatives:

1. Drop some variables; see help drop.

2. If you are using Stata/SE, increase maxvar; see help maxvar.
r(900);

I drop the variables that I am not using and I set the maximum variables possible (5000)

What else can I do?

Thanks!!

stataman said...

Well, it looks like you have too many categories in your fixed effects (total possible values in the nit and ao variables is too high). This poses a computational problem for Stata.

As for a solution, I am not familiar with the command you are using, neither am I familiar with the dataset and the econometric model.

If your data is truly a panel, maybe nit or ao are the variables that define the group? Try to use xtset to define the dataset as a panel and get rid of the explicit use of the i.var coefficients.

Sorry I can't help more.

Eric said...

Hi Stataman,
I am working on panel data with missing observations. When I declare the dataset as panel data using xtset, the data is decribed as strongly balanced which should not be the case. Secondly I'm trying to fill in the missing data by using tsfill, full command but nothing really is happening the missing data is not filled.

Please suggest!

Thanks!
Eric

stataman said...

Sorry, Eric. I can't really tell why it says it's balanced without seeing the data. My guess is that your missing values are not in the variables that contain the group and individual code. But then again, that's just a guess.

Never used tsfill before. Sorry.

Eric said...

Hi Stataman,

Thanks! You are right and I realised that I dont require to use tsfill.

Cheers!
Eric

Eva said...

Hey!
So, you'll probably think this is a ridiculous question, but I haven't been able to input data. I downloaded .dat and .do files from ipums then ran the do file in stata. The file ran and the variables appeared, but when I went to data editor, the variables were listed, but no values. I also tried to use tabulate but it said no observations. Help please! Thanks!!

Kristine said...

Hey StataMan

I must say thank you for the Guide.

But i have this task at hand - i have two datasets (panel Data) where 2009 data set is a follow up survey. So what i want to do is only select the HouseHolds (HH) in 2007 data set that are also in 2009 dataset. The unique variable is HH id. Which is the best way around this assignment?

stataman said...

Hi Kristine,

Sorry for the delay. I had some personal stuff to attend to.

What you can do is use the following command:

gen in2009 = year==2009
egen hh_in2009 = max(in2009), by(hh)
keep if year == 2007 & hh_in2009

The first command will create a dummy variable with 1's for the observations for which the year variable contains 2009. The second command will take each household and give it the maximum of the in2009 variable, which is 1 if this household has an observation in 2009 or 0 if there is no observation in 2009. Finally, the last command will keep the 2007 observations of households that also appear in 2009.

I hope this helps.

Victoria Reyes said...

Hi. I was wondering if I have unstandardized data (on tourism), how can I normalize it by population?

I want to make a graph that is

number of dyads (cumulative) by tourist arrivals so it reflects a power log

Thanks!

stataman said...

Sorry, Victoria, but I don't think I can help without knowing what does the data look like. It sounds like your question is less about Stata and more about the analysis at hand. I'm sorry, but I don't think I can help you with that.

Good luck!

Victoria Reyes said...

Thanks Stata Man...perhaps framing it like this will help:

how can I make a graph where the number of observations (e.g. dyads in my case) is the x? (with y being the percent of tourist flows)


I cannot just use the option "scatter percent" since that's too few variables and because I want the x to be number of dyads (observations)....

What I want is a graph where x is 1, 2, 3, etc (or 5, 10, 15 - something to that effect)

stataman said...

It's a bit hard without knowing how the original dataset looks like. You say that every dyad appears several times (hence "number of observations" on the x-axis). If this is the case, you probably want to collapse the dataset so that every dyad will be one observation and you will create a new variable with the number of observations this dyad had in the original dataset. In addition you get some y variable (tourist flow percentage). I'm not sure how this y variable appears in the original dataset. Suppose you have it for each observation in the original dataset and you want to present the mean y (mean tourist flow percentage)... you can select other statistics than mean.

So if this is how your original dataset looks like (every dyad appears several times, with some y-variable for each observation). You can do:

/* The following command will collapse the dataset from multiple-observations per dyad to one-observation per dyad, where we mean the tourist_flow variable and count the number of observations in each dyad and put it in the variable obs_num. */

collapse tourist_flows (count) obs_num=dyad_id, by(dyad_id)

/* The following line will draw the plot */
twoway (scatter tourist_flows obs_num)

I hope this is what you wanted to do. If not, I hope this gives you a hint at what you're aiming to.

Sorry I can't help more,
Roy

Victoria Reyes said...

Hi,

Each dyad appears only once (unique destination/origin combo) - but your advice did give me good ideas on how to move forward.

Thanks!
Victoria

David said...

Hey stataman!

I have a noob question concerning the estimation of a random effects model (xtreg re). I hope you can help me: I have run the random effects model and accordingly a fixed effects model. Afterwords I performed a hausman test which was insignificant (confirming that my data allows to be estimated with random effects) I also confirmed this with the Breusch-Pagan Lagrange multiplier test for random effects. I am now testing the assumptions for my model, but I am unsure about all the assumptions that should be tested. For now I have tested my models for collinearity(collin), heteroscedasticity(lrtest) and autocorrelation(xtserial). I am however unsure whether I have to test for other assumptions? Maybe you have a clue. I have some textbooks(wooldridge & Baltagli) concerning random effects modelling but since I am a econometrics-illiterate, I do not understand everything. thnx in advance!
Kind regards
David.

stataman said...

Hi David,

Hmmm... I'm not sure what to say. It's very hard to run tests without knowing what are your maintained assumptions. Moreover "testing assumptions" is something that you can't do without maintaining others. For example, testing for endogeneity of some variables requires you to assume that the instrumental variables you use are exogenous (which you can't test, but... well... assume). In other words, if it's something that you can check, it's not really an assumption. It is more of a result.

This is an econometrics question and not a Stata question. My take is that you can't really do econometrics without having some economic/behavioral model (even if simple or implicit) behind. This model, and your understanding of the data, should give you ideas about what assumptions are required for you to test your hypotheses using the data at hand.

It might be just my take on it (or my econometrics professors') and others will think that you can test hypotheses (or assumptions if you insist) without thinking about the mechanism or a model.

I'm sorry I can't help you more on this.

Roy

kim said...
This comment has been removed by the author.
kim said...
This comment has been removed by the author.
kim said...

Hi stataman!
How do I know if the runs test has a autocorrelation? my runs test looks like this


. runtest residual
N(residual <= -.006206203950569) = 158
N(residual > -.006206203950569) = 158
obs = 316
N(runs) = 79
z = -9.02
Prob>|z| = 0

stataman said...

hi Kim,

Sorry, I have never used this command.

It looks like this page addresses this issue http://www.stata.com/support/faqs/stat/panel.html

sorry I can't help more.
Roy

stataman said...
This comment has been removed by the author.
kathy said...

Hello Stataman,
I am running some data and applying kmeans cluster. I got 3 different groups. My furter step is to apply a fuzzy c-means cluster and get the degree of membership of each observation to each fuzzy cluster, but I do know how to get it. For your reference, I applied the following sentece:

cluster kmeans stu_pdifte pub_pdifte patappl_pdifte, k(3) measure(L2) name(prueba) start(prandom)

And for the fuzzy option, I read something like that in the Stata help:

cluster set myclus, addname type(fuzzy) method(kmeans) dissimilarity(L1) var(group g2l1)

But nothing happens. Please, could you help me?
Thanks in advance.

Gaëlle said...

Hey stata man!

First I used xtreg and results was ok.
Second I'm tring to use xtabond but stata said:
no observations
r(2000).
:-(
I can't understand why xtreg run and xtabond no...

Please help me.

Gaelle

Gaëlle said...

Hey stata man!

First I used xtreg and results was ok.
Second I'm tring to use xtabond but stata said:
no observations
r(2000).
:-(
I can't understand why xtreg run and xtabond no...

Please help me.

Gaelle

daticon said...
This comment has been removed by the author.
daticon said...

Hi stataman...

Thanks for your BLOG. Good reading.

I've been having trouble with a command and was hoping you might be able to help.

Quite simply, all I want to be able to is rename var1, var2, var3,... var_n to the corresponding data within the first observation.

Thus, var1's first obs might be "user_id", var2's is "lastname", var3's is "firstname".

The closest I have gotten to accomplishing this is:

foreach v of varlist var* {
di “`v’”
di `v’
}

This command will correctly display:
var1
user_id
var2
lastname
var3
firstname

However, to get it to rename var1 to user_id... if I add in the following line to the foreach statment:

rename `v’ `v’

..it says 'var1 already defined'

If I change it to
rename "`v’" `v’

...it says ' "var1 invalid name'

Any suggestions?

stataman said...

Good question, daticon, and the answer lies in the nature of macros (locals or globals).

When you invoke a macro in a command (say, " `mymacro'"), what happens is that Stata replaces all macros with their content BEFORE running the command. After all macros have been substituted by their content, then the command will run as if there weren't any macros used at all.

So, for example in the first iteration of the loop, when you run:
> di "`v'"

Stata will replace `v' with var1 and then will run:
> di "var1"

which will display the string var1.

When you run
> di `v'

It will do the same replacement:
> di var1

but now that we don't have quotes, Stata doesn't treat var1 as a string that it should not worry about its meaning but as a word which is part of the command. In this case a variable's name. Commands that take one value (like di, for example) instead of whole variables (like corr, reg, su, and so on), will take the first row of the variable (unless you add the row number in brackets after the variable's name). This is why you're getting the value of var1 which you set to be the name.
Finally, when you run:
> rename `v' `v'

Stata will translate this into
> rename var1 var1

which is not what you had in mind.


Generally speaking, putting the variable's name as the observation is not a very good strategy. It's usually an outcome of a flawed data import (there's a way to tell insheet to treat the first row as variable names, look at the help file). Otherwise, all the variables become strings and you probably don't want that.

If the problem wasn't a problematic data import, but rather your own idea (to put the variables names in the first row), a better approach would have been to make a long local with all the names ordered, and then loop like this:

local varnames "user_id lastname firstname ..."
local totvars : word count `varnames'

forvalues varnum = 1/`n' {
local renameto : word `varnum' of `varnames'
rename var`varnum' `renameto'
}

(I haven't checked that code so it might have a bug in it, but that's the spirit, anyways)

Good luck

daticon said...

Thanks for the reply stataman. Yes, the problem is that I have to 'import' data from web-based tables that all have different column headings. The issue is compounded by the fact that the leading 0 or 0s get dropped on import, but which are vital to maintain, column headings are not STATA friendly, etc. etc.

I have used a similar technique to your suggestion that basically involves importing the data (by cutting and pasting into the data editor) into STATA 11 two times. On the first run I treat the first row of data as an observation, so that the leading 0's don't drop. Clear. On the second run I treat the first row as headings. Then it is matter of having to click each heading to get the list of variables (like in your example).

Of course, this takes a lot of time! And is tedious. The only way around is to find a programmatic way. Unfortunately, neither of our efforts yets solves it so that I only have to import the data once...treating the first as data...and them simply renaming var1-var(n) the first row.

Would be so easy to accomplish in Java, C, etc. that it is quite frustrating how impossible it seems in STATA.

If you (or any of your readers) can think of an answer... please post back! Thanks again, daticon.

daticon said...

...should have said with the first method..the STATA command is then quite simply (for example)

renvars var1-var100 / user_id lastname firstname ... etc.

stataman said...

if that's the problem you can take the first line (or any line) of the data (in the format from which you want to import from), and add some character (non-numeric character) to the value of the variable you want to be imported as a string.
Save your file, import it to Stata, and fix the value of the first line that you changed (take the character away).

Alternatively, if your leading zeros create a number of equal length across the dataset (say 9 digits), you can run:

gen stringvar = string(var, "%09.0f")

(right after you import)

I hope this helps.

daticon said...

I wish it would...but that would take longer than the 'paste into STATA 2 times' option. I may look at ways I could fix files using Java first, and then import into STATA...but that will have to be some weekend fun.

Unfortunately, the only 'fix' that will save me time is the ability to paste once into STATA and then rename all the column headers (var1 - var(n)) to the contents of the first row.

I do appreciate all the suggestions, but I really do need an automatic / programmatic answer to this.

stataman said...

if you insist, you can do what you suggested, but instead of running the

rename 'v' 'v'

first, do this
local newname = 'v'[1]

And then
rename 'v' 'newname'

Sorry for the lack of backticks on my current keyboard.

daticon said...

Hooray! Many thanks stataman for your help! That will save me loads of time.

The final code was:

foreach v of varlist var* {
local newname = strtoname(`v'[1])
rename `v' `newname'
}

The strtoname was needed to convert non-stata friendly names to stata friendly ones.

Paul de Boer said...

Hi Stata man,

I'm trying to conduct a negative binomial regression. With the emphasis on trying!

I get stuck before even running it...

When I enter the syntax (or via the tabs above)I get:

number of dependent variables for equation 1 must be greater than zero. This, while I most certainly did select a dependent variable!

I cannot find out what that means (beyond what it litteraly says) and how to solve it.

I can conduct, for example, a linear regression, create graphs so i dont think that its a data-import problem.

And yes, Im new to stata (or negative binomial regression for that matter:)

I hope you can help me

Sincerely yours,

Paul de Boer

Michael said...

Hi, stata man,
I have a question that I can't solve. I've got panel data of company stocks and their month-end prices. For instance, for firm ABC, its stock prices are Jan 31, 20x1 $10.00, Feb 28, 20x1 $11.20, and so on. The data set is in a vertical form.

For each year, I'm trying to calculate each stock's average monthly return based on the change in stock price from Dec 31, 20x1 and Dec 31, 20x2. That is [Pr(month 13)/ Pr(month 1)] - 1 / 12.

I can't figure out how to do this in Stata but I'm guessing its in relative observations values.

I appreciate your help. Thanks!

Lika said...

Hi Stataman,

First of all, thanks a lot for this very useful blog!
I have some troubles performing a unit root test for an unbalanced panel with xtunitroot and would appreciate your help. My original variable is numeric and log-transformed ("logsalesperemp"). I tried to run a unit root test (fisher pperron) using the following command but got the error "r(2000) no observations". Note that I tsset the data before running the program and also tried with ips test but no panel was used and no result returned....
Have you already encountered this error with xtunitroot? Any idea where the problem comes from? Thank you

. xtunitroot fisher logsalesperemp, pperron lags(2)

Woody said...

hey stata man!

Is there a command in Stata that solves heteroskedasticity and autocorrelation problems for binary dependent variable models in panel data?

Thank you for reading this...

Michael said...

Woody:
Tell us more about what sort of model are you using. It's not obvious to me since your dependent variable is binary.

Woody said...

I am tring to model the determinants of implementation of clean tech based on several firm caracteristics. As such, my dependend variable is structured as follows: 1 indicates the presence of clean tech in firm i, and 0 indicates no clean tech is present in the firm.

Michael said...

Woody:
Are you trying to use OLS regression or logistic regression? Logit is for categorical dependent variables like you have. Don't think you can properly use OLS regression with a binary dependent variable.

Woody said...

Indeed, it has to be a logistic, ´cause xtgls would be solution for an continous dependent variable, that´s why I´m currently at a dead point...

Michael said...

Woody:
Now that I know that you are using a model with the usual linear assumptions, back to your original question. A stats program can't automatically fix problems of heteroskedasticity and autocorrelation. Solutions come from the human.

Heteroskedasticity can mean that you have a missing variable in your model or your data set comes from different populations. In the first case, you add the missing explanatory variable to your model if you can identify it. In the second case, you can split your data into two (or more) data sets and run your model separately on each set.

Regarding the problem of autocorrelation, it seems to me that you have cross-sectional data rather then time-series so I'm not clear on why you have an autocorrelation problem in your logit model.

Long said...

Hi Stataman,

I use the panel data and I try to use the command

XTABOND

but it turns out the message that "no observations". Please advise me what should I do.

Thanks a lot!

Max said...

Hi, nice guide.

Im just looking for one solution, I have a huge data set with many N/A, N/R and some string saying "data not found for year 20xx for company with code "asdf".

Ofc i coudl replace that in excel with a rather complicated if command, but since it is so easy to replace the N/A and N/R in Excel with a. and b. for missing values, i would just like to replace any string left in my data with a c. directly in stata.

so to sum up is it possible to use i.e. the replace command to just replace any string in any variable?

stataman said...

Hi Max,

destring <varlist>, replace force

should work.

That is, if your variable name is, say, "amount", then:

destring amount, replace force

Max said...

Thx it does work, could i tell him to replace the missings as .c directly?

But ok i can rename them later easily.

kyoko said...

Hi, stataman,

I'm studying economics in university and I start to use STATA in
recent days.

I have a question.
I get a "no observations" error message -r(2000) when I gave the
following command (and had the shown output):
probit bio09 emagri08 exrate08 upov08

but
When I gave the following command (and had the shown output):
probit bio09 emagri08 upov08,
I can get result.

supposedly,
the data "exrate08" has a problem.
Do you have some solutions?
This data's type is "str2" and format is "%92."

"bio09" is dummy variable whether a country cultivate GM crops.
"emagri08" is employment in agriculture by country.
"exrate08" is exchange rate by country.
"upov08" is dummy variable whether a country affiliates UPOV.

Michael said...

Kyoko:
Stata is reading exrate08 as a string variable meaning that it is reading the field as non-numeric. Functions like -probit- need numeric data to run. That's why you get the "no observations" error. Stata doesn't see any numeric data in that field.

As a first step to solve this problem, I suggest that you look at the data in this field to see what characters or entries are being interpreted as non-numeric.
-list exrate08-

The solution depends on what's causing Stata to read the data as string data. You could have alpha characters in this field. This link may help:

http://www.stata.com/support/faqs/data/allstring.html

Also, Stataman gave someone else a solution to a similar problem on November 30, 2010 so you might want to read that discussion. But I'd suggest that you understand what's causing Stata to read the field as string data before using -destring-. -destring- alters your data set.

kyoko said...

Michael,

Thank you very much for your answer.
I'll try at the university on next Monday.

kyoko said...

Michael,

There are data-brank enterd ".."
maybe it is read as strings.
so I can use destring command.
and I can get result.

Thank you for your kindness.

Max said...

Is there a way to tell stata to use a minimum group size in xtreg?

stataman said...

I don't think there's a direct option, but how about doing:

local min_group_size = 10
egen group_size = count(group_id), by(group_id)
xtreg y x1 x2 if group_size >= `min_group_size', fe i(group_id)

Max said...

Yeah i though about something like that, but my problem is that i have many ., .a, .b, .c. So out of my 24000 and something observations i can only use around 3000. Is it possible to change that command so it considers that variables y,x, and z are not a missing value?

stataman said...

Sure.

First do:

mark inSample
markout inSample x y z

Then condition the egen with "if sample"

Max said...

thx that helped

Max said...

Hey me again :) i hope u dont mind.

I have a problem generating growth rates of some of my variables. I am worried that stat does not respect the groups because my results so far dont seem to be correct.

I though i could just take d.sales/l.sales for the growth rate.
or (sales/l.sales)-1. When i look at the results it seems that he considers the groups but the summary shows me a mean and a std which is not possible for a growth rate.

Is there any other way doing this correctly?

mary said...

hi there

thanks for you useful blog
i want to know why when i do the sum command its showing 0 observations in all my variables yet i have entered the data correctly its there. is it because i have entered them as string variables? help out please.

somayeh said...

SARA
HI
I have a problem,in my panel model I have many observation for every variable in a year ,how do I edit my observation in data editor and run this panel?

Victoria said...

I have a question: how do i create a frequency line graph by groups where the x axis is the year (its longitudinal), the y axis is frequency, and the body of the graph are 5 different lines, each representing a group.

I want to trace how the frequency of something changes by year and by group. Does this make sense?

Michael said...

Hi, Victoria:
I'm not clear on what you are doing. Are you wanting a frequency distribution or a count of records?

Victoria said...

I'm essentially trying to replicate the "Number of World Heritage properties inscribed each year by region"graph on this website: http://whc.unesco.org/en/list/stat#s12

I want this:

hist(date_inscr~d), by(region) freq

But instead of bars, I'd like just lines, and I want all the regions to be in one graph. Does this make sense?

Michael said...

Victoria:
The graph on the website looks like count information rather than a frequency distribution. Histograms are graphics showing distributions. You'll probably need to create count information in your dataset and then graph that information. I believe Stataman has a tutorial on creating count information on this website.

Michael said...

One more thing, if you have the data on the website already in your dataset, try -twoway connected y1 y2 y3 year- to create the graph.

Victoria said...

Thanks. I searched the blog and didn't find the count data tutorial that you mentioned...do you have a link (sorry)

Victoria said...

So I just created a variable "counts" with a 1 in each cell - I assume that's the same thing?

I can do:
scatter counts region, by(date_inscr~d)

where date_inscr~d is the year, but then that gives me different graphs, one for each year, and it won't let me do "by(region)" because its a factor variable.

Victoria said...

Sorry for so many comments! I basically want to graph the outcome of :

table region date_inscr~d

Where the frequencies in the table are the number of sites in each region by year (correct?)

eg
| date_inscribed
name_en | 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989

Africa | 4 3 7 5 3 2 3 1 2 2 3 2
Arab States | 9 4 2 9 4 6 3 2 3 1
Asia and the Pacific | 5 3 5 5 4 3 5 5 11 5 1
Europe and North America | 7 25 10 11 3 18 10 14 17 17 12 3
Latin America and the Caribbean | 2 2 3 3 4 5 2 4 2 9 4

|

Michael said...

What does your data set look like? The data under the graph? If it does, try:
-sort year-
-twoway connected EU_NA AS_PAC LA_CAR ARAB AFR year-

Michael said...

Victoria:
This should do it:

clear*
input year EU_NA AS_PAC LA_CAR AR AFR
1978 7 0 2 0 4
1979 25 5 2 9 3
1980 10 3 3 4 7
1981 11 5 3 2 5
1982 3 5 4 9 3
end
sort year
twoway connected EU_NA AS_PAC LA_CAR AR AFR year

Victoria said...

Hmm...when I try that

twoway connected Africa Arab Asia_Pac E_NA LA_C year

Where: e.g.
gen Arab = 1 if name_en == "Arab States"

(and name_en lists each region)

I just get a single (the Asia Pac color) straight dot line, and the y axis goes from -1 to 1

Victoria said...

Thanks - that does work. So each time, I essentially have to create a new data set? Or create the graphs in excel? I can't take my larger data set and create it?

Thanks again for the help

Michael said...

I'm afraid that I don't know what your data set looks like. If it looks like the data under the online graph, then, yes, it is easily done with -twoway connected-. I essentially recreated that dataset form under the graph with -input-.

somayeh said...

I have a question.when I run panel model in stata I Deal with this messag ,"repeated time values within panel".what do I do for producing a composite identifier replace "xtset" in panel data?

Samuel Chase said...

Quick question, is it possible to form a frequency distribution for a string variable? Thanks

stataman said...

Did you try to tabulate it?

tab country3letcode

LLA said...

Hi stata man!
May I ask a question that has been evading me for days, but I can't believe is that hard!

I have a data set with 3 variables:
1) "number weeks treatment": in weeks so ordinal or categorical rather than continuous I assume
codebook agrees it is numerical
2) "strain bacteria" either A or B
3) "Time to detect bacteria" in days: a continuous variable

I would like to test whether the strain of bacteria (subgroup A or subgroup B) affects the time to detect baceria (y axis) over number of weeks treatment (x axis)

I think this must be a multiple regression so have created 2 dummy variables, one that contains all the time to detect bacteria if strain=A and one all the time to detect bacteria if strain=B I used:
gen strain_A=time_to_detect_bac
replace strain_A=. if strain!=A
gen strain_B=time_to_detect_bac
replace strain_B=. if strain!=B

when I type
regress strain_A strain_B num_weeks_rx
it consistently replies "no observations"
can't work out why, have tried various things, any ideas?

Michael said...

LLA:
Are any of your variables in your regression model a string variable? You can use -codebook- to examine your variables. Variables in a regression model must be numeric.

LLA said...

Thank you for the thought, but have checked them all with codebook & they are all numeric (byte).

Michael said...

Next step, which seems basic, is examining your data using -browse y x1 x2-. You are looking at the data for your regression variables (y, x1, x2 = strain_A strain_B num_weeks_rx) for some clue why Stata cannot see any observations. All these values should be either numbers or . and in black (not red).

LLA said...

Thanks, yes, they're all black.

I don't understand why they're called y x1, x2 (as stata says in help section too) when really it's y1 y2 x ie dependant var1 strain_A, dependant var2 strain_B, independant variable number weeks treatment, makes me concerned error is because I'm doing the wrong test. There are quite a lot of missing values: 76 obs in strain A and 343 in strain B, but surely this doesn't matter as 2 groups rarely the same & tests work..
Thanks very much for your help

Michael said...

No need to worry about the naming of y, x1, x2. In my background, y= dependent variable. Just make sure the order of your variables for -regress- is correct.

Next step, what type of data do you have for your dependent variable: continuous, categorical, etc. And what type of data for your two predictor variables?

LLA said...

Thank you, sorry for some reason didn't see you reply that time.
My variables are as follows;
dependant var1 strain_A: continuos (time)
dependant var2 strain_B: continuous (time)
independant variable number weeks treatment: ordinal (or categorical) I think. as is time but in weeks 0 weeks, 1 week, 2 weeks etc with no fractions of weeks

Michael said...

Are you trying to model two dependent variables in one regression? If you are, that won't work. Perhaps you need two models: (1) - reg strain_A number_weeks_treatment-
(2) - reg strain_B number_weeks_treatment-

LLA said...

I'm not sure, Don't think the 2 models will give me the answer I want, as I want to see if strain is correlated to num weeks rx but also if strain type (A or B) affects the time to detect bacteria.. Maybe I've ordered the data wrongly?

Michael said...

I'm afraid I can't help there. That's a research methodology issue and not my field. Perhaps you ought to talk with someone in your field. I guess that ANOVA might be a technique to explore. Good luck.

LLA said...

Thank you very much for your advice Michael, think I need to look over he data again to decide what I have..

LLA said...

May I take advantage of your stata knowledge one more time?
I have some survival data which I'd like to plot kaplan meir curves on. Because the 'event' of time to event is a positive one I think the graph would make more sense if it started low (ie where crosses x at 0)and went to 1 at the end. I think this means I would like to plot anaysis time against 1-survival. Do you know if it's possible to ask stata to do this?
Many thanks

PV said...

Hi Stata Man,

I would like to second Paul de Boer's request.

I have a dummy variable. I can run "tab dummy", I can run "reg dummy x1 x2", I can run "cor dummy x1".

But I CANNOT run "logit dummy x1", nor "xtlogit dummy x1".

Everytime I try a logit command, I get the error message "number of dependent variables for equation 1 must be greater than zero r(198)"

What am I doing wrong? I used to run logits all of the time, and I am not aware of doing anything differently.

Thank you,
PV

stataman said...

Sorry. I have no idea.

I guess you can call me statapig now... ?
http://www.youtube.com/watch?v=714-Ioa4XQw

PV said...

Thanks for trying, Statapig.

I just ran the same do file with the same dataset on a different computer, and everything worked perfectly. I guess at some point a computer just decides it won't run another logit model?

And there's your solution Paul: buy another computer.

ncarvalho said...

To PV and Paul de Boer (and anyone who was getting this error:)

Everytime I try a logit command, I get the error message "number of dependent variables for equation 1 must be greater than zero r(198)"

You need to uninstall and reinstall your version of Stata b/c for some reason it must have become corrupted. I had the same issue and uninstalled/reinstalled it and the issue has resolved.

Good luck!

priya said...

Hi!

I am glad to see the blog helping ppl out .. i am running xtlogit and want to know how can i check for autocorraletion and hetero? i tried xtserial but it gives me no result just a dot in place of F and p value? what other test could i use in case this is not possible?

priya said...

i have another query.. i am wanting to run dea in stata but i have a lot of missing values in my data... stata keeps running but gives no output for lamost five to six hours... i have around 3000 observations in a panel dataset

Adam said...

stata man, blog is great. Question for you. I am looking at a time series of data (15years or so) for a bunch of firms. I am trying to estimate a left hand side variable for firm i and I am supposed to estimate my coefficients by industry and year, excluding firm i.

So essentially, I've got 76,000 observations and lets say I've got 20 observations in industry "a" in year "t". I've got to run a regression for firms 2-20 in that industry/year to estimate my coefficients (I've got all relevant left and right hand side variables), then plug in those estimates with my right hand side variables for firm 1 in that industry/year to calculate my left hand side dependent variable. Then repeat for firm 2, etc...

Any suggestions on how to write this code?

Thanks

zaber said...

Hi I am trying to find effect of a policy variable (dummy) that does not change over time, on a pooled cross sectional database consisting of more than 100 countries. I have 500 observations.
In my regression I tried to use country dummies ( 1 for specific country 0 for others) but stata says "no observations" but when I run regression without these country dummies the results are there. Can you please tell me what is going wrong ? Is there anyother way to include country dummies rather than the technique I just described ?

mac said...

How did you create your dummies? Also, look at the properties of each
dummy variable to make sure they are numeric rather than string variables.

AG said...

Hey Stata Man!

I want to make a loop that has insde of it:

forvalues k=1/6{
clear all
use data`k'.dta
save data`k'.dta,replace
}

and I can't get the program to work. It does everyting except when reading data`k'.dta it reads data.dta instead of data1.dta.

Do you know how do make a loop with clear all and save inside?

Thank you,
Ana

Unknown said...

Hi Stata Man!

I would like to make a loop like the following:

forvalues v=1/6{
clear all
use data`v'.dta
save data`v'.dta
}

but when I write it like this instead of saving it as data1.dta it saves it as data.dta for all cases. Would you happen to know how I can fix this?

Thank you,
AG

Peter Pan said...

Maybe someone knows if there is a command for that.

if i wanna create a dummy var for a set of countries is there a way to create it for a list?

Say i have the ISO 3 digit Numerical codes and i want a dummy for Countries 110-120 or for different numbers.

stataman said...

sure.

look up the command xi

Edson C. Araujo said...

Hey, well done for the site, very useful!

I not sure if this still working, but would you mind to give a hand with a stata issue?

If so, can I send through here or to your e-mail?
Thanks

araujoec@gmail.com

Peter Pan said...

Thx for the tip but i do want to create a dummy for a group of countries, not sperate ones for each country. So lets say i have Iso codes for all countries and i want a dummy for all G77 countries, I´d like sth like

gen Dummy_G77=0
replace Dummy_G77=1 if IS0==(...)

in the brackests i would like to put a list of the G77 ISO codes or all numbers. I would not like to write a loop or add a | command for every country

stataman said...

Try:

egen g7= anymatch(iso), ....


I put the dots because I don't remember the option, but you can give it a list of numbers. Look it up.

Help Me said...

Help me Stataman,
all my variables are binary, I keep running the logit command and i get no observations

MVS said...

Dear Stataman,
Please help me. I’m stuck. I intend to do discrete choice modeling. I have thousands of patients (say 5,000) and a few hospitals and tens of surgeons. Hospital-surgeon pair is a choice. Each patient faces around 50 choices. My dataset has one observation for each patient which corresponds to the chosen hospital and the chosen surgeon i.e. the variable choice = 1 in all these observations. There are at least 10 variables that describe hospital and surgeon characteristics in all the observations. Now I want to create a choice set for each one of the patients. That means adding around 50 observations per patient ID where the patient did not choose the remaining hospital-surgeon pairs in the choice dataset (for the non-chosen alternatives). The challenge I’m facing is how to create this dataset of 250,000 observations from my dataset. How do I do it in STATA or SAS?
Mahesh

Lizard said...

Hi Stata man,

Does collapsing a data set create problems with variable estimation if my original data contained some missing values? For example: collapse...(mean)age. And some age variables are missing. Will my mean age be affected by the missing values?

Thanks.

Thomas said...

Hello everybody

I am new in using Stata but already got a huge problem. I try to do a unit root (ADF) for some time series. The problem is, i get the error message:

. dfuller i_spiindex, trend lags(2)
no observations
r(2000);

what is the problem here? the time series is already numerical and when using summarize it tells me to have 141 observations?

thanks for your help

mac said...

Did you -tsset- the data?

Buddy said...

Hi Stataman!

I need some advice on some very basic things. This is my first time to do a choice experiment and to use Stata.

I already have my raw data in my computer and I need to set it up for multinomial logistic regression.
My choice experiment is about vehicle choice.
I have 6 attributes with the following levels (5x3x3x2x2x2)
Each respondent has to answer 10 choice sets. Each choice set has 2 alternatives with the opt out option.
This is how I initially set up my data (just an illustration)

id choice P1 P2 P3 P4 P5 T1 T2 T3..
1 0 1 0 0 0 0 0 1 0
1 1 0 0 0 1 0 1 0 0
2 1 0 1 0 0 0 0 0 1
2 0 0 0 1 0 0 0 1 0
.
.
P1..P5, T1-T3 are attribute levels

So I made a column for each attribute level and made it a dummy variable.
Each row is a type of vehicle. If the vehicle possesses a certain attribute level (e.g. eng2), that variable has value=1, otherwise 0.

If respondent chooses an alternative, choice=1, if not =0
If the repondent "opts out" (e.g. ce_id 3), both alternatives has 0 value for choice.

Am I doing this data setup right?

Plus how do I treat the "opt out optio?

Many thanks!
BudZ

SBB said...

Hi! I'm wondering if there is a way to collapse a list of dummy variables while turning them into a count of how many times that dummy variable appears as 1 in the original data. For instance, if my panel data where A is a country is:
Dum1 Dum2 Dum3
A 1 0 1
A 1 1 0

I want to get:
A 2 1 1

I tried
collapse (sum) Dum*, by(country)
and
collapse (count) Dum*, by(country)
neither of which work.
I look forward to your suggestion,
Best.
SBB

Unknown said...

Stata Man. I hope you can help. I work with a datatset with a lot of binary data. I want to created a connected line graph of the "mean" aka proportion, of one of my binary variables by year. I'm having a terrible time with it (using Stata 12). Any advice?
LB

Has said...

Hi Can you please tell me how to use the yes/no response in asurvey data.

Thannaletchimy said...

Hey Stata man,

I would like to clarify my doubts regarding running the DW test on my panel. I have a panel data for bilateral trade of 11 countries disaggregated over 20 sectors. To create the IDs, I simply assumed that each bilateral trade pair for a particular sector is coded as an id. So I ended up having 200 ids ! Is that the correct way to shape my data ? When I ran the regressions and when I try to do dwstat, I get r 459. So I am wondering if it is the format of my data that is wrong.

Thanks in advance for your help.

Jonathan Haskel said...

Thanks STATAMAN my search for how to code "country" turned you up and you have saved me days. thanks. One thing: in panel data on stata, i think i am right, you DO have to code your own time dummies, is that right?
thanks again, Jonathan

Stef Salvez said...

Dear Stataman!

I have a panel data on prices that vary across country and time:

clear all

input id str8 (dates) variable
1 "23/11/08" 2
1 "28/12/08" 3
1 "25/01/09" 4
1 "22/02/09" 5
1 "29/03/09" 6
1 "26/04/09" 32
1 "24/05/09" 23
1 "28/06/09" 32
2 "26/10/08" 45
2 "23/11/08" 46
2 "21/12/08" 90
2 "18/01/09" 54
2 "15/02/09" 65
2 "16/03/09" 77
2 "12/04/09" 7
2 "10/05/09" 6
end


As you can see
the start and end date of the time series for countries 1 and 2 are
different. For example, for country 1 the time series begins on
"23/11/08" while for country 2 the time series begins on "26-10-2008”.

My data on prices are available every 28 days (or equivalently every 4
weeks). So each observation is a 4-week average. But in some cases I have jumps (35 days or 29 days instead of
28 days). For example from the above table we have the jumps: from
"28/12/08" to "28/12/08" , from 22/02/09" to "29/03/09", etc

My goal is create a unified sequence of dates across countries. Otherwise I can not do further econometric/data analysis Unless you have different suggestion, I want to take
what I have and calculate monthly average prices. SO I want to change the data frequency (via interpolation?) and instead of having 4 week averages or 5 week averages to have monthly averages
Please, I would be grateful to you if you could provide some code
in order to achieve this
thank you for your understanding

Ali Nadeem said...

HI Stataman, I could really use your suggestion! I am using a data set of 209 countries, i have run a fixed effects regression for non-oecd countries, when i summarize my variables it gives me all variables in data set, i only need variables , summary of onlythose variables that have been used in my fixed effects regression, for instance, when i run fixed effects, number of groups are 138,I beleive stata drops some observations or some countries, so my total obs for a variable should be 138 multiply by 26 years, 1980-2006, which is 3588 observations only, how can i ask stata to generate this ? please help.

Ali Nadeem said...

Hi,
I could really use your help!
I am using a data set of 209 countries, i have run a fixed effects regression for non-oecd countries, when i summarize my variables it gives me all variables in data set, i only need variables , summary of onlythose variables that have been used in my fixed effects regression, for instance, when i run fixed effects, number of groups are 138, believe stata drops some observations, so my total obs for a variable should be 138 multiply by 26 years, 1980-2006, which is 3588 observations only, how can i ask stata to generate this ? please help.

Lino said...

Hi Stataman,

I need your help urgently!

I am dealing with an unbalanced panel data (30 countries and 25 years).

My questions:

1. I tested my panel dataset for stationarity with Levin-Lin-Chu test. STATA requires to consider one variable. I had to consider the dependent variable because it has no gaps. I could not consider any other variable because gaps exist and the software does not run the test when identifies gaps. Performing the test as described I get a p-value = 0.0014 which I intepret in the sense of rejecting the null hypothesis that is panels contain unit-roots. In other words the test confirms the absence of unit-root which means that the specification in use is stationary/valid. Am I right?

2. I run LR test for the heteroskedasticity as follows

xtgls depvar indepvars, igls panels(heteroskedastic)

estimates store hetero

xtgls depvar indepvars

local df = e(N_g)-1

lrtest hetero . , df(`df')

The p-value turns out to be 0.0000 which I interpret in the sense of rejecting the null hypothesis this meaning we have heteroskedastcity. Am I right?

3. I run the Wooldridge test for autocorrelation by using

xtserial depvar indepvars

and I get a p-value of 0.0000. Here again I reject the null hyopthesis associated to the non-existence of autocorrelation. In other words I have autocorrelation. Am I right?

My panel then is stationary (which is good!), but has hetersk. and autocorrelation problems.

To overtake these problems I think I am right to rely on the results from the estimation (which coul be my final task ... right?)

xtgls depvar indepvars, igls panels(heteroskedastic)

which gives me Cross-sectional time-series FGLS regression

I wonder if I can consider (although teh signs of the coeff. are different)

xtmixed depvar indepvars || _all: R.id || _all: R.year,mle

W.r.t. this I would like to know if this command already models while correcting for hetero and autocorr. Or should I consider some option to do this?

Thanks in advance.

Journey Chronicle in Science and Letters said...

. alpha v243 v52 v223 v183, casewise item std
no observations
r(2000);
. alpha v243 v52 v223 v183
cannot determine the sense empirically; must specify option asis
r(459);
. corr v243 v52 v223 v183
no observations
r(2000);
. fuck you and your no observations stata you fucking bastard there is nothing wrong with the fucking data
unrecognized command: fuck
r(199);

John said...

I am working with the UCDP Battle Related Deaths dataset. It has data on conflicts group by warring party, location and year. I am comparing it with another dataset and my unit of analysis is the Country Year. I want to find a way to break out the warring party observations in UCDP dataset by individual country.

Just to clarify. What I am looking for is the total number of Battle Deaths in a Country Year whether the country was on either side of the battle or if the battle took place in that country. The command(using the varnames form the dataset ):

total(bdBest) if strpos(SideA, "Burundi") | strpos(SideA2nd, "Burundi")| strpos(SideB, "Burundi") | strpos(SideB2nd, "Burundi") | strpos(Battlelocation, "Burundi"), over(YEAR)

Gives me a table of what I am looking for but of course I would have to do this for every single country and I would like to get the info from that table inserted into the dataset somehow.

Thanks

John said...
This comment has been removed by the author.
HHB said...

Hello Stataman - thank you for this great BLOG. I am using STATA for the first time to analyze some discrete choice experiment data. I can run McFadden's cond logit using the clogit command (with choices grouped into choice sets), but I also want to run a random effects model (I have 200 respondents who each performed the same 13 choice tasks). When I use xtset or xtlogit with "re", STATA treats my data as binary choice- not as choice experiment data. I cannot find the appropriate command for this anywhere. Appreciate your help - Henry

M said...

Do you know what "variable mode has replicate levels for one or more cases; this is not allowed
r(459);" means is wrong with my data?

M said...

I'm sorry but I realized I didn't mention that I was trying to run a nested logit regression. My data looks just like the one in the restaurant example in STATA help so I can't figure out what's wrong.

ahmad naeem Bhatti said...


click here it courses


www.Itcourses-distancelearning.com
National institute is the largest leading chain of skill based hands-on line -training providers for Vast Rang of :-
v Air Line Ticketing,
v Cabin Crew Air Hostess
v Safety Officer/Engineer
v Seo Search Engine Optimization
v spoken English
v Hotel management.
v Call center training
v Web Development
v Technical Vocational
Training Courses Pakistan and All over the World. Its Educational Heritage Can Be Traced Since 2007. National Institute has always led the market by introducing latest market driven IT, linguistics and online Courses. National institute is a trusted name in the field of on line education and training having the largest chain of institutions in the on line worldwide. National institute has a large group of happy &satisfied students attaining successes in their future lives & prospects.



click here it courses

xarifx said...
This comment has been removed by the author.
xarifx said...

Hi Stataman!

Being fairly new to Stata, I'm having a difficulty figuring out how to do the following:

I have time-series data on selling price (p) and quantity sold (q) for 10 products in a single datafile (i,e., 20 variables, p01-p10 and q01-q10). I am strugling with appropriate stata command that computes sales revenue (pq) for each of these 10 products (i.e., pq01-pq10).

I would greatly appreciate you help.

Thank you.

Muñekita Cat said...

Hiiiiii! your blog is great, I'd love you to join my websites, and you put my link on your site, and so we benefit both.

I await your response to munekitacate@gmail.com

kisses!
Emilia

Sanjay Sawant said...

Thanks for providing such a great blog stateman. Appreciated. rain water harvest

Vid DD said...

In my data, I have variables as follows: household ID, ID of persons in household, father ID, years of education, who is the father. So person 3 in house 23 for example might say that person 1 is his or her father, while person 6 and 7 and 8 also in house 23 says that person 9 is their father. This is likely a joint family.

So I can't make a new column eduF in the usual way, since for person 3 and 6/7/8 in the same household, the father is different so the eduF level varies even in the same household. I need however this new column eduF saying, for each member of the family, what is the education level of the person they list to be their father.

I think this requires forvalues or foreach and loops, but am not sure what would be the code! Any help would be SINCERELY appreciated.

Dipesh Shinde said...

That’s a great blog!.I have never read a blog like this before. Your writing style is truly informative. jobs stores.

vazir98 said...

i am using time series data to analyse the bilateral relation. as the distance variable is constant stata is not taking it into analysis. How to put in distance variable so that stata uses it.

Ambrose

boeme said...
This comment has been removed by the author.
boeme said...

Hello,

I am trying to test my panel data models with xtserial for serial correlation, but when I use the command I get the result: unrecognized command: xtserial. And when I tried to ssc install xtserial I get the result:ssc install: "xtserial" not found at SSC. I wonder if there is any alternative of this test or if the command has changed the syntax?
I am using Stata 12 and I have done the updates.
Many thanks in advance.

daticon said...

Boeme, try any of these:

findit xtserial
net sj 3-2 st0039
net install st0039

boeme said...

Thank you very much the last two commands worked and I run the test. Many thanks.

aneka obat said...

very good information and Inspiring & Interesting.

success always
http://tokoobatbiusasli.blogspot.com

Menka said...

Hello,
I have a panel data set and after a Hausman and LM test, I now have to carry out a pooled OLS regression. I need to test for autocorrelation. I only wanted to confirm whether the command "xtserial" is for fixed and random effects only or I can use it for OLS as well?
Thanks in advance!

sabina moon said...

hi, everyone, are you need economics help service. We are ready to provide service. please visit this site and contract with us. we are ready 24 hours.
econometrics help